Part number the two!
Did some more work on the Assembler project (part 1 here) that didn't just involve improving the interface. Ooh, ooh - what?
Parsing. That's what.
Run (or F5) now results in the entered code being parsed and added to an
Instructions collection. The following style is correctly parsed:
<mnemonic><datatype> <source_operand>, <dest_operand>
An example of which is:
; an optional comment which isn't explicitly captured
mnemonic is the instruction (move, add, etc).
datatype is the size which is either .b (byte), .w (word), .l (long-word).
dest_operand is where stuff happens (values, registers, addresses, and the like).
Each line in the source is read one-by-one. Once a line is read, a regular expression
1 match is attempted (hard-coded add and move for now) .
(add|move)(?:\.(l|b|w)?)? *(.*), (\w*)
Note that this expression is far from complete. First change would involve removing the rigid white-space structure.
The reason for the
datatype being optional in the regex is because I thought it could be omitted from assembler source. Doing a bit of research seems to show that everyone always includes a datatype (or size as I think is the nomenclature), so I've dropped the idea of having a default type. Still, this now means I can detect and specifically give an error about any missing types.
Anyway, back to parsing.
If the line fails the regex match, then onto the next line. If it succeeds, the parsing continues and an
Instruction object is created containing
datatype. I haven't yet figured out how the source and destination operands are going to be represented. In assembler code, they can be addresses, registers, or whatever else so I don't really know what's going to happen until I just start typing.
Undecided if mnemonics should inherit from a baseMnemonic and be individual classes, or if I should continue with the mnemonic simply being an enum on
Instruction. I like instructions being self-contained, but comparisons become more involved. I'm leaning more towards baseMnemonic with a defined Interface so instructions can do whatever the hell they want as long as they take in an input and give back an output.
Thinking further ahead, I'm also undecided if the entire assembler source should be parsed in one go or instead proceed one line at at time. The latter will allow code to be edited while the program is running, which is quite nice, and goes hand-in-hand with the Step debugger feature.
Right now, I can't think of any advantages of parsing in one go (parse each line and populate
Instructions immediately before execution even starts). Maybe "compiler" optimisations at a later date? Performance isn't even a minor consideration, so I don't see that as much of a win. I think the step-by-step method simply gives more advantages and so that may very well be the route to take.
1 I can highly recommend Regex Pal for writing and testing expressions.