(01-18-2017, 01:21 AM)degasus Wrote: [ -> ]I'd expect maybe like 1% speedup for one hour of work. Likely with a few failed attemptions 
Edit: Most of the CPU time is spend here, this needs to be optimized as hell: https://github.com/dolphin-emu/dolphin/tree/master/Source/Core/Core/PowerPC/JitArm64
Hej, I am sorry and this may not belong into this forum but I guess other users are interested as well so I ask questions here
As I understand dolphin converts blocks of PowerPC opcodes from the original GC/WII executables and translates them into ARM64 code that get then executed at "native speed".
So, what's the main reason for the slow down?
Is the translation from PowerPC to ARM64 way too slow OR is the generated ARM64 code so bad?
Running the generated code is what takes time.
(01-22-2017, 12:19 AM)olihey Wrote: [ -> ]Is the translation from PowerPC to ARM64 way too slow OR is the generated ARM64 code so bad?
The generated ARM64 code is bad. Mostly because we have a few limitations here (those are valid for both x64 and armv8):
The PPC code page is RWX, and maybe in umprotect mode. So we have to care about block invalidation by far too often.
The PPC has 32 registers. Same on ARM, but we need a few for the emulation system. So we can't statically allocate them, and so we need to load and flush all used registers per block.
Many PPC instructions have some strange side effects. It's hard to guess if this side effect is used, so we end up calculating lots of stuff which is not used.
We lack all kind of inlining (because of invalidating the block cache), neither do we support in-block forward jumps (divergent register allocations). So our blocks a very small.
Please also keep in mind that optimizing such blocks is close to impossible. There are conditional exits after many instructions, as eg every load/store instruction may fail.
We're actually quite happy that we get the performance we have right now. There is a good article within
https://www.alchemistowl.org/pocorgtfo/pocorgtfo06.pdf about the implemented optimizations.
(01-22-2017, 12:43 AM)degasus Wrote: [ -> ]There is a good article within https://www.alchemistowl.org/pocorgtfo/pocorgtfo06.pdf about the implemented optimizations.
Awesome read, I had no idea. Should be a must read for every Dolphin user so they can appreciate what you guys are doing.
From reading this, it is really pretty challenging to emulate a 900 MHz PowerPC on a 2.014 MHz ARM processor, like the Nvidia Shield TV has. Amazing work you have done

We can still do a lot, but indeed, there won't be big speedups. But in total, they will sum up.
Simplest way to start is likely fixing a few interpreter fallbacks. eg we've added a few of them with the last coretiming change. It's just copying this code in any ARM compiler, and copy the assembly back
Bigger task is to support inlining, I'm a bit on this one, but no clue how well this will continue.
If you want to spend a year on a very cool task, restart with some IR based JIT, and do all kind of forward in-block branches. This will allow us to skip a bit overhead and better register tracking within those (by far bigger) blocks.
Just come on IRC #dolphin-emu @ freenode and ask me how to do so.
@olihey: May I suggest the stack based BranchLinkRegister optimization? It's both in the article and in the x64 JIT, but not in the ARMv8 one. I'd expect a 5% to 10% speedup, for likely 200 lines of code. But hard to debug code. But as it would be a port, this should be manageable. The article talks about using the host return address predicition, but IMO the main speedup is to avoid the dispatcher.
Just chiming in that the newest build has a 1-2fps speed up on the Shield TV in Windwaker, nearly full speed. Thanks to you degasus with pull request 4735. Impressive stuff, I was actually tinkering around with the Shield's profiling tools investigating if there was any low hanging fruit with little success before I came across this topic. It seems I'll need to learn a lot about the JIT if I want to contribute.