QEMU switched to a data driven representation with a python program to autogenerate the "check bit patterns and extract fields" code, and it's one of the better design overhauls we've done: we started using it mostly for new code but went back and converted some of the old handwritten decoders too. It's much easier to add a new instruction when you only need to add a line like
USADA8 ---- 0111 1000 rd:4 ra:4 rm:4 0001 rn:4
and add a function trans_USADA8() that gets called with the field values, compared to trying to find the right place in a big existing set of handcoded switch and if statements to add the extra checks for the insn.For cases where everything you need is really in the datafiles (e.g. a simple disassembler) or where you're providing a user facing API that you want to have match the architecture documentation closely, the tradeoffs are different.
Also for QEMU I tend to value "works the same regardless of target architecture" over "we can do a clever thing for this one case but all the others will be different".
Years ago, I ended up creating my own aarch64 definition, which I use to generate disassemblers, interpreters, and recompilers (dynamic and static) automatically: https://github.com/daeken/SharpRetro/blob/main/Aarch64Genera...
It doesn't have perfect support, but it has served as an incredibly useful resource. I've since generalized it to work for other architectures, and that same repo has definitions for MIPS (specifically the PSX CPU), DMG, and the groundwork for an x86 core. The goal is to be able to define these once, then generate any future targets automatically.
It's a kind of serialization/deserialization, or what I think Python and some others call "pickling". Same task. Turn these raw bit patterns into typed values.
Ada probably comes closest of the major languages to pulling it off. It has separation of the abstract/programmer's view of a data type and the implementation / low representation of that type.
Specify a bunch of records like:
for Instruction use record
Condition at 0 range 31 .. 28;
ImmFlag at 0 range 27 .. 27;
Opcode at 0 range 24 .. 21;
CondFlag at 0 range 20 .. 20;
Rn at 0 range 19 .. 16;
Rd at 0 range 15 .. 12;
Operand at 0 range 11 .. 0;
end record;
Then aim a pointer at your instructions and read them as records/structs.It works particularly cleanly with a nice RISC encoding like ARM. I'm not actually sure if that would work in Ada. The use representation syntax might not be generic enough.
1) instructions which "bend" the format, like ARM instructions such as STMIA or B which combine multiple fields to make a larger immediate value or mask.
2) recognizing instructions which use special values in fields (like ARM condition = 1111) to represent a special instruction.
3) instruction encodings with split fields, like the split immediate in RISC-V S-type instructions.
4) instruction encodings which have too many instruction-specific quirks to fit into any reasonable schema, like 68000.
With hindsight I would have loved to do a proper compiler but that’s undergrad level really. I really recommend it as a toy post-food-coma project for when you’re stuck with the family either next week or at the end of December :)
anthk•2mo ago
https://github.com/pts/pts-mips-emulator
thesnide•2mo ago