Hello all,
Mainly to see the benefits that it could have (or not), I started to implement a PoC of an LLVM-based JIT compiler for Dolphin. After some google search, I didn't find previous attempts, but I have to admit that only while writing this message I have found out about this thread: https://forums.dolphin-emu.org/Thread-i-d-like-to-make-a-llvm-jit-for-dolphin-but-i-don-t-know-where-to-start. It is a bit old, so I don't known whether this turns out to be something interesting or not (does anyone has any information?).
As it was my first time playing with the Dolphin code, I wanted to start with something really simple but that works. Moreover, the LLVM assembler being known to be quite slow compared to custom solution (like dolphin's), I searched for a "fail fast" approach that would involve *not* to write all the PPC semantics in LLVM IR by hand. My approach has thus been the following:

About v2, it also works on these same games, and is really slow. I optimized the code a bit so that the LLVM IR generation step is about ~100µs for "small" blocks (<10 instructions), and can go up to 1ms for "big" blocks. Actual code generation as expected is what takes most of the time and can go up to 2ms.
My questions are:
Thanks everyone for your help and remarks!
Regards,
Mainly to see the benefits that it could have (or not), I started to implement a PoC of an LLVM-based JIT compiler for Dolphin. After some google search, I didn't find previous attempts, but I have to admit that only while writing this message I have found out about this thread: https://forums.dolphin-emu.org/Thread-i-d-like-to-make-a-llvm-jit-for-dolphin-but-i-don-t-know-where-to-start. It is a bit old, so I don't known whether this turns out to be something interesting or not (does anyone has any information?).
As it was my first time playing with the Dolphin code, I wanted to start with something really simple but that works. Moreover, the LLVM assembler being known to be quite slow compared to custom solution (like dolphin's), I searched for a "fail fast" approach that would involve *not* to write all the PPC semantics in LLVM IR by hand. My approach has thus been the following:
- take the CachedInterpreter, and, instead of generating a list of callbacks, generate a list of "call XX" thanks to LLVM (where XX is an immediate representing the function to call) (what we will call v1)
- verify that this works
- replace these calls by their actual LLVM IR implementation (generated from the existing C++ code), inline everything and see what happens (what we will call v2)
- and also verify that it actually works
- v1 is implemented here: https://github.com/aguinet/dolphin/tree/feature/llvm_jit_simple
- v2 is implemented here: https://github.com/aguinet/dolphin/tree/feature/llvm_jit
- note that this has only been tested and compiled under Linux, and that there are some nasty hacks in v2 to generate the LLVM IR of the interpreter to get it back at runtime (see https://github.com/aguinet/dolphin/blob/feature/llvm_jit/Source/Core/Core/CMakeLists.txt#L670 for instance).

About v2, it also works on these same games, and is really slow. I optimized the code a bit so that the LLVM IR generation step is about ~100µs for "small" blocks (<10 instructions), and can go up to 1ms for "big" blocks. Actual code generation as expected is what takes most of the time and can go up to 2ms.
My questions are:
- inlining the full interpreter is a bit overkill. What are the trade-off between inline asm code/runtime calls that should be made in this case?
- I have seen a "block linking" feature in the current JIT implementations. My guess is that it tries to merge block together to avoid a "round trip" to the main "scheduler", but I might be completely wrong. Is there any documentation on this process?
- (Somehow related to the previous one) Granularity of the jitting process seems to happen at the basic block level. Would that be doable to try and link blocks together to be able to optimize full functions? (which is where the LLVM optimizer would be really good)
- and, more generally, do you think that we could achieve something usable with this approach?
Thanks everyone for your help and remarks!
Regards,