Dolphin, the GameCube and Wii emulator - Forums

Full Version: Compiling Win32 ARM64
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5
Ok, I did install WinDbg locally on the ARM64 device (so no remote debugging for now).
Issue is:

Invalid instruction exception: mrs x3, ctr_el0

Ok thats the code I did add for cache flushing. However I did not expect it to be executed under interpreter but apparently it is.

Call Stack:
FlushCacheSection
Arm64XEmitter::FlushCacheSection
Arm64XEmitter::FlushCache
VertexLoaderARM64::GenerateVertexLoader
VertexLoaderARM64::VertexLoaderARM64
std::make_unique

So we still use JIT even if interpreter was enabled?
Yes, code is still generated for the GPU vertex loaders and format conversion - no matter the "CPU Core" option - so you'll probably need to disable that if the codegen isn't working - see VertexLoaderBase::CreateVertexLoader() in Source/Core/VideoCommon/VertexLoaderBase.cpp

There doesn't seem to be any config or easy option to disable this right now, so likely code mods will be needed if you want to disable for now.
(04-24-2018, 09:05 AM)JonnyH Wrote: [ -> ]Yes, code is still generated for the GPU vertex loaders and format conversion - no matter the "CPU Core" option - so you'll probably need to disable that if the codegen isn't working - see VertexLoaderBase::CreateVertexLoader() in Source/Core/VideoCommon/VertexLoaderBase.cpp

There doesn't seem to be any config or easy option to disable this right now, so likely code mods will be needed if you want to disable for now.

Ok thanks!

I figured that the low level cache flushing instructions are not allowed at EL0 in Windows so i am getting invalid instruction exceptions. I did replace it with FlushInstructionCache(GetCurrentProcess(), start, end - start);

And....its running!!! - both Interpreter and Cached Interpreter.  I am getting 5-7fps in cached interpreter mode running Zelda:TWW intro. On my desktop core i7 Skylake @4.2GHz i am getting 9-13fps with the same settings. So loosing only 50% speed against a desktop class core i7 is a good sign Smile
(04-24-2018, 10:32 AM)Gerdya Wrote: [ -> ]Ok thanks!

I figured that the low level cache flushing instructions are not allowed at EL0 in Windows so i am getting invalid instruction exceptions. I did replace it with FlushInstructionCache(GetCurrentProcess(), start, end - start);

And....its running!!! - both Interpreter and Cached Interpreter.  I am getting 5-7fps in cached interpreter mode running Zelda:TWW intro. On my desktop core i7 Skylake @4.2GHz i am getting 9-13fps with the same settings. So loosing only 50% speed against a desktop class core i7 is a good sign Smile

Amzing results. May you try to drop the W18 register here: https://github.com/dolphin-emu/dolphin/blob/master/Source/Core/Core/PowerPC/JitArm64/JitArm64_RegCache.cpp#L341 This might already fix the JIT. Loosing 50% of the speed with 50% of the clock speed is an amazing result Big Grin

By the way, where are your pull requests?
(04-25-2018, 07:39 AM)degasus Wrote: [ -> ]Amzing results. May you try to drop the W18 register here: https://github.com/dolphin-emu/dolphin/blob/master/Source/Core/Core/PowerPC/JitArm64/JitArm64_RegCache.cpp#L341 This might already fix the JIT. Loosing 50% of the speed with 50% of the clock speed is an amazing result Big Grin

By the way, where are your pull requests?

Hi Degasus,

for pull requests i would need a public repository right? So currently i have the repository locally on my development machine. Are there any good descriptions on how to move a local repository to a public one such that i can do pull-requests?

Meanwhile i debugged quite a bit. The crash is in PPCAnalyzer::Analyze() when calling:
block->m_physical_address.clear()

This is a std:Confusedet data structure and the crash is in ntdll with access violation.

Now a have a hypothesis:

The code in NTDLL which crashes looks like

mov  x8, x18
ldr    x9, [x8, #0x60]   ---> x8=0x3860.0000 -> EA=0x38600060 -> access violation

Now thing is, that Windows uses X18 as private register for TLS. So if JIT is touching X18 we have a problem.

If this is the case, is it possible to disable usage of X18?
(05-02-2018, 04:20 AM)Gerdya Wrote: [ -> ]for pull requests i would need a public repository right? So currently i have the repository locally on my development machine. Are there any good descriptions on how to move a local repository to a public one such that i can do pull-requests?

You need to create a github account and to close our repository there. Afterwards, you can add this clone as a new remote to you local repository and push your new commits there.
But honestly, explaining git here on the forum is likely not the best way. There are plenty of tutorials online, or you could ask on IRC. I'm also fine with the plain patch, so I could create the pull requests myself.

(05-02-2018, 04:20 AM)Gerdya Wrote: [ -> ]Now thing is, that Windows uses X18 as private register for TLS. So if JIT is touching X18 we have a problem.

If this is the case, is it possible to disable usage of X18?

Have you read my comment about dropping the line with W18? We use the 32bit register notation there, but W18 == X18 ( != Q18, this is a NEON register ) https://github.com/dolphin-emu/dolphin/blob/master/Source/Core/Core/PowerPC/JitArm64/JitArm64_RegCache.cpp#L341
(05-02-2018, 05:18 AM)degasus Wrote: [ -> ]You need to create a github account and to close our repository there. Afterwards, you can add this clone as a new remote to you local repository and push your new commits there.
But honestly, explaining git here on the forum is likely not the best way. There are plenty of tutorials online, or you could ask on IRC. I'm also fine with the plain patch, so I could create the pull requests myself.

I did create a public repo in march:
https://github.com/Gerdya/dolphin-gerdya

But i assume i did just some wrong steps. So yup, i guess i need to read up more.

(05-02-2018, 05:18 AM)degasus Wrote: [ -> ]Have you read my comment about dropping the line with W18? We use the 32bit register notation there, but W18 == X18 ( != Q18, this is a NEON register ) https://github.com/dolphin-emu/dolphin/blob/master/Source/Core/Core/PowerPC/JitArm64/JitArm64_RegCache.cpp#L341

Sorry, you already told me before...i did not dare to ask what you meant with "drop register w18". In any case it is now commented out and not available for allocation anymore.

Now the crash in NTDLL is gone Smile

The next crashes have to do with code generation. I did figure out the following as far:

1) X28 is used as memory base address
2) It is intialized in JitAsm.cpp (note MEM_REG = X28)

    // set the mem_base based on MSR flags
    LDR(INDEX_UNSIGNED, ARM64Reg::W28, PPC_REG, PPCSTATE_OFF(msr));
    FixupBranch physmem = TBNZ(ARM64Reg::W28, 31 - 27);
    MOVP2R(MEM_REG, Memory::physical_base);
    FixupBranch membaseend = B();
    SetJumpTarget(physmem);
    MOVP2R(MEM_REG, Memory::logical_base);
    SetJumpTarget(membaseend);

Indeed, the content of X28 is equal to logical_base, so this works. However *logical_base is not accessible -> so i get an access violation on each load and store using X28.

I do see where physical_base is allocated/comitted, however i do not see where logical_base is allocated or comitted. Any hints?

Thanks,
Gerdya
(05-02-2018, 08:21 AM)Gerdya Wrote: [ -> ]The next crashes have to do with code generation. I did figure out the following as far:

1) X28 is used as memory base address
2) It is intialized in JitAsm.cpp (note MEM_REG = X28)

    // set the mem_base based on MSR flags
    LDR(INDEX_UNSIGNED, ARM64Reg::W28, PPC_REG, PPCSTATE_OFF(msr));
    FixupBranch physmem = TBNZ(ARM64Reg::W28, 31 - 27);
    MOVP2R(MEM_REG, Memory::physical_base);
    FixupBranch membaseend = B();
    SetJumpTarget(physmem);
    MOVP2R(MEM_REG, Memory::logical_base);
    SetJumpTarget(membaseend);

Indeed, the content of X28 is equal to logical_base, so this works. However *logical_base is not accessible -> so i get an access violation on each load and store using X28.

I do see where physical_base is allocated/comitted, however i do not see where logical_base is allocated or comitted. Any hints?

This is a memory failure which happens on purpose. We call it fastmem. There is a nice article about it in https://www.alchemistowl.org/pocorgtfo/pocorgtfo06.pdf (warning: 100mb) Edit: It is on page 9: 3.4 Dolphin intentionally makes thousands of segfaults
Does dolphin also crash if it is not within a debugger? If so, you can disable this feature with an ini setting:
Dolphin.ini
[Core]
Fastmem = False

This will slow down a lot, but still a lot faster than the interpreter.
I am going in circles somehow. Its frustrating.
Issue is the following:

After adding few traces the issue most of the time looks like this:
49:22:350 e:\dolphin-master-org\dolphin\source\core\core\powerpc\jitarm64\jit.cpp:562 D[JIT]: JIT64 PC: 8031cb64 SRR0: 803082a4 SRR1: 00003032 FPSCR: 00000004 MSR: 00003032 LR: 8031cb64 r00: 00000000 r01: 80414450 r02: 80407520 r03: 804145a0 r04: 00000001 r05: 00000000 r06: 804144c0 r07: 80414500 r08: 80414543 r09: 0000c200 r10: 0000c208 r11: 80414688 r12: 802b69c0 r13: 804058e0 r14: 00000000 r15: 00000000 r16: 00000000 r17: 00000000 r18: 00000000 r19: 01000000 r20: 80414540 r21: 80414600 r22: 80414640 r23: cc000000 r24: 804145c0 r25: 00000000 r26: 00000000 r27: 00a00000 r28: 80414580 r29: 80414a04 r30: 00000003 r31: 00000001  
49:22:360 e:\dolphin-master-org\dolphin\source\core\core\memtools.cpp:39 N[JIT]: EXCEPTION: CODE:c0000005  ACCESSTYPE:       0  BADADDRESS:00000002CC005020
49:22:361 e:\dolphin-master-org\dolphin\source\core\core\memtools.cpp:39 N[JIT]: EXCEPTION: CODE:c0000005  ACCESSTYPE:       1  BADADDRESS:00000002CC005020
49:22:371 e:\dolphin-master-org\dolphin\source\core\core\memtools.cpp:39 N[JIT]: EXCEPTION: CODE:c0000005  ACCESSTYPE:       0  BADADDRESS:00000002CC005022
49:22:372 e:\dolphin-master-org\dolphin\source\core\core\memtools.cpp:39 N[JIT]: EXCEPTION: CODE:c0000005  ACCESSTYPE:       1  BADADDRESS:00000002CC005022
49:22:381 e:\dolphin-master-org\dolphin\source\core\core\memtools.cpp:39 N[JIT]: EXCEPTION: CODE:c0000005  ACCESSTYPE:       8  BADADDRESS:FFFFFE9261370000

The last line tell us, that we are at an invalid PC location (ACCESS_TYPE=8). Up till then everything looks ok. Issue is, its not deterministic. If i am looking at the logfile, sometimes it works only 800 lines, one run later it might work for 1500 lines - so its extremely hard to catch with the debugger.

I did catch the issue with the debugger once. Reason was wrong trampoline for the fault handler.

00000217`8e0349f0 a9bf47fe stp         lr,xip1,[sp,#-0x10]!
00000217`8e0349f4 2a1903e0 mov         w0,w25
00000217`8e0349f8 d28cd91e mov         lr,#0x66C8
00000217`8e0349fc f2a1e55e movk        lr,#0xF2A lsl #0x10
00000217`8e034a00 f2cffefe movk        lr,#0x7FF7 lsl #0x20
00000217`8e034a04 d63f03c0 blr         lr
00000217`8e034a08 a8c17bf1 ldp         xip1,lr,[sp],#0x10     <<<<<< different pair order than store
00000217`8e034a0c d65f03c0 ret         lr

Issue was that the order registers put on the stack are different order than they were restored -> which led to PC = 0 after return.

However i did review the generators of the trampolines and i have no explanation how this could have happened. Everything looks ok.
Both ABI_PushRegisters(gprs_to_push) and ABI_PopRegisters(gprs_to_push) should, from what i understand push and pop registers in the very same order. The order is defined by an iterator from common:BitSet<u32>. I do not see how the iterator would iterate over the bitfield in two different orders.

In summary, something goes wrong non-deterministically (it fails at different points). Second observation: from the logs, the PC is set to invalid location right after an exception is handled which sets up a trampoline for device access. Third observation: last  valid exception is almost always a write operation. Fourth observation: i could in one case root-cause the issue with a wrong trampoline (see above) but i cannot explain how this wrong trampoline could be generated wrongly.

I appreciate any ideas on how to go on from here.
Maybe we violate something in the ABI specs and some windows syscall overwrites our stack. We use the stack in a non-common way, with propper stack guards or anything similiar.
Pages: 1 2 3 4 5