Dolphin, the GameCube and Wii emulator - Forums

Full Version: Mystic Speed Boost from Self-Built Version on Linux?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.

Mastermind_X

Hi folks,

I'm observing a mysterious speed boost when I compare the pre-compiled Ubuntu builds to a self-compiled version of Dolphin. I was stunned by the fact that dolphin suffered from sever frame rate drops (~5 FPS on SSBM event match 15, ~25 FPS in Mario Kart Wii permanently). I wanted to investigate a bit in order to tweak the frame rate at least a little bit using compile time optimizations but
I was completely stumped when the frame rate drops immediately disappeared when running my first self-compiled dolphin build. I did not touch the cmake configuration nor the compiler flags, from what I saw it compiled using -O3 and -msse2. I'm on Debian Stretch, gcc version 4.9.3. I compared pre-built 4.0-7945, pre-built 5.0-rc27 and self-built git (master, ec28d7df924f427fc9ae4830c520aaf121cb3713)

Now my obvious question would be whether there is a simple explanation why the official dolphin build is so much slower than my n00bish compiled version. Does the build-bot compile code for a generic x86_64 machine without any SIMD instructions?

Cheers,
mx

Specs:
Intel® Core™ i7-4800MQ CPU (4 Hyper-Threaded @ 3.7 GHz, 6 MB Cache)
On-Chip GPU: Intel HD 4600
Discrete GPU: GT 730M (should be irrelevant as the needed kmods are currently not installed (different painful story ...))
Our buildbot compiles with -O2 by default iirc. SSE2 is forced-enabled for all x64 builds, so this is no issue.

To be honest, this should have almost no effect as we spend most of our time within just-in-time generated code. So the optimization level of the compiler doesn't affect us much.

Could you compare the output of "perf top" with both builds? Maybe there is a difference somewhere... Also be careful to test exactly the same, eg just opening the video setting window while the emulation is running affect the used settings a lot :/

About your painful story, did you try bumblebee?

Mastermind_X

Hi degasus,

thanks for your reply! I agree, the quality of the JITed code should not be influenced by compiler flags to much. Nevertheless, had there been any efforts to generate code that makes use of vector instructions and if not do you think it would be worth a try? Sorry for asking if that's already the case, I didn't have time to look into dolphin's internals (yet).
Also, I'll run some measurements using perf once I get home from the office. Do you have any specific hardware events in mind or should I just investigate on how much time the CPU spends within what parts of dolphin?

(10-09-2015, 08:54 PM)degasus Wrote: [ -> ]About your painful story, did you try bumblebee?

Yes, I did. (https://github.com/Bumblebee-Project/bbswitch/issues/78 and 1)

Basically a lost cause.
We use SSE for the PPC float handling. The GC PPC has SIMD with 2-float width, but else we don't vectorize ourself. Our Jit doesn't have any optimization passes, it's an one-pass compiler from PPC to x64.
We use some special instructions up to AVX, but they don't speed up much. The biggest performance hit is not vectoring but because of strange PPC behavior. If you want to read a bit more about this weird behavior, start on page 6.
There are plans to implement a LLVM based jit, but it's a very long term task and right now, nobody knows if it will be faster at all...

For the measurement, I just hope to see one function with a much higher load on the ubuntu builds. I don't want to belive this is a general slowdown over all called C functions...

Mastermind_X

Hi degasus,

sorry this took so long, busy times.

First, the measurements:

In order to minimize unwanted high-latency side-effects, I created a ramdisk containing the ISO (RMGP01) and set dolphin's CPU affinity to 7. (My machine has 4 hyper-threaded cores, 7 refers to the last one, just to avoid confusion.). I then measured only this one CPU using perf and let it take 500k samples (just cycle count) in the save file selection dialog, which appears to me to be deterministic when rendering the effects in the background. I also set dolphin's internal resolution to 2x, to create some workload for the CPU.

Code:
$ taskset 0x80 ./ubuntu_build/usr/games/dolphin-emu-nogui tmp/RMGP01.wbfs
$ perf top -cpu=7 > results_ubuntu
$ taskset 0x80 ./git_build/dolphin/Build5/Binaries/dolphin-emu-nogui tmp/RMGP01.wbfs
$ perf top -cpu=7 > results_git

Results:
Code:
$ cat results_git
  5,05%  dolphin-emu-nogui              [.] OGL::ProgramShaderCache::UploadConstants
  3,49%  [kernel]                       [k] __i915_wait_request
  1,99%  dolphin-emu-nogui              [.] OpcodeDecoder_Run<false>
  1,56%  [kernel]                       [k] i915_parse_cmds
  1,50%  libpthread-2.19.so             [.] pthread_mutex_lock
  1,49%  libc-2.19.so                   [.] memset
  1,19%  dolphin-emu-nogui              [.] GetCRC32
  1,13%  dolphin-emu-nogui              [.] VertexLoaderManager::RunVertices
  1,11%  dolphin-emu-nogui              [.] RunGpuLoop
  1,09%  [kernel]                       [k] gen6_ring_get_seqno
  1,02%  libdrm_intel.so.1.0.0          [.] 0x00000000000060ef
  1,02%  dolphin-emu-nogui              [.] TextureCache::Load
  1,00%  dolphin-emu-nogui              [.] CommandProcessor::GatherPipeBursted
  1,00%  perf-3862.map                  [.] 0x000000004187608e
  0,88%  dolphin-emu-nogui              [.] LoadBPReg
  0,82%  dolphin-emu-nogui              [.] JitInterface::CompileExceptionCheck
  0,71%  dolphin-emu-nogui              [.] GeneratePixelShader<ShaderUid<pixel_shader_uid_data> >
  0,67%  dolphin-emu-nogui              [.] CommandProcessor::SetCPStatusFromGPU
  0,66%  libc-2.19.so                   [.] __memcpy_sse2_unaligned
  0,65%  dolphin-emu-nogui              [.] VertexManager::CalculateZSlope
  0,60%  dolphin-emu-nogui              [.] OGL::ProgramShaderCache::SetShader
  0,59%  perf-3862.map                  [.] 0x000000004263efe6
  0,58%  dolphin-emu-nogui              [.] OGL::Renderer::UpdateEFBCache
  0,55%  dolphin-emu-nogui              [.] ReadDataFromFifo
  0,51%  perf-3862.map                  [.] 0x0000000042a7a9e1
  0,51%  dolphin-emu-nogui              [.] IndexGenerator::AddStrip<true>
  0,44%  perf-3862.map                  [.] 0x0000000042a7864d
  0,43%  libpthread-2.19.so             [.] __pthread_mutex_unlock_usercnt
  0,41%  libc-2.19.so                   [.] __memcmp_sse4_1
  0,34%  dolphin-emu-nogui              [.] LoadXFReg
  0,34%  [kernel]                       [k] copy_user_enhanced_fast_string
  0,34%  dolphin-emu-nogui              [.] GPFifo::Write64
  0,32%  libdrm_intel.so.1.0.0          [.] 0x000000000000853c
  0,32%  dolphin-emu-nogui              [.] VertexManager::PrepareForAdditionalData
  0,31%  [kernel]                       [k] i915_gem_execbuffer_relocate_entry
  0,31%  dolphin-emu-nogui              [.] GetVertexShaderUid
  0,30%  dolphin-emu-nogui              [.] GPFifo::FastCheckGatherPipe
  0,29%  i965_dri.so                    [.] 0x000000000033bca8
  0,27%  perf-3862.map                  [.] 0x0000000042a786f1
  0,26%  dolphin-emu-nogui              [.] ZeldaAudioRenderer::ApplyReverb
  0,26%  [kernel]                       [k] __list_del_entry
  0,25%  dolphin-emu-nogui              [.] VertexShaderManager::TransformToClipSpace
  0,24%  libc-2.19.so                   [.] _int_malloc
  0,23%  perf-3862.map                  [.] 0x0000000042a7b37e
  0,23%  dolphin-emu-nogui              [.] CommandProcessor::SetCPStatusFromCPU
  0,23%  libc-2.19.so                   [.] __memmove_ssse3_back
  0,22%  dolphin-emu-nogui              [.] Renderer::EFBToScaledX
  0,22%  libdrm_intel.so.1.0.0          [.] 0x00000000000055ae
  0,21%  dolphin-emu-nogui              [.] VertexManager::Flush
  0,20%  perf-3862.map                  [.] 0x000000004187617a
  0,20%  dolphin-emu-nogui              [.] TextureCache::DoPartialTextureUpdates
  0,20%  perf-3862.map                  [.] 0x0000000042a7a9f3

Code:
$ cat results_ubuntu
  2,13%  [kernel]                 [k] __i915_wait_request
  1,46%  dolphin-emu-nogui        [.] 0x000000000036c420
  0,99%  dolphin-emu-nogui        [.] std::less<unsigned long>::operator()
  0,83%  libc-2.19.so             [.] __memcpy_sse2_unaligned
  0,76%  dolphin-emu-nogui        [.] 0x0000000000440c3b
  0,71%  [kernel]                 [k] i915_parse_cmds
  0,68%  [kernel]                 [k] gen6_ring_get_seqno
  0,61%  libc-2.19.so             [.] memset
  0,60%  libpthread-2.19.so       [.] pthread_mutex_lock
  0,53%  dolphin-emu-nogui        [.] 0x0000000000440c14
  0,51%  perf-5085.map            [.] 0x00000000428d808e
  0,50%  dolphin-emu-nogui        [.] 0x0000000000440ca6
  0,49%  dolphin-emu-nogui        [.] 0x0000000000440cbb
  0,47%  dolphin-emu-nogui        [.] 0x0000000000440ccc
  0,47%  dolphin-emu-nogui        [.] 0x0000000000440d6a
  0,46%  libdrm_intel.so.1.0.0    [.] 0x00000000000060ef
  0,45%  dolphin-emu-nogui        [.] 0x000000000031172c
  0,41%  dolphin-emu-nogui        [.] 0x0000000000314a1c
  0,36%  perf-5085.map            [.] 0x0000000040a74606
  0,35%  dolphin-emu-nogui        [.] std::swap<int>
  0,33%  dolphin-emu-nogui        [.] 0x0000000000314a09
  0,33%  perf-5085.map            [.] 0x0000000040eb57ed
  0,31%  dolphin-emu-nogui        [.] std::__atomic_base<unsigned char*>::operator+=
  0,29%  dolphin-emu-nogui        [.] 0x0000000000440c0f
  0,29%  dolphin-emu-nogui        [.] 0x0000000000440c9b
  0,29%  perf-5085.map            [.] 0x0000000040eb3459
  0,28%  dolphin-emu-nogui        [.] std::_Hashtable<unsigned int, unsigned int, std::allocator<unsigned int>, std::__detail::_Identity, std::equal_to<unsigned int>, std::hash<unsigned
  0,27%  libc-2.19.so             [.] __memcmp_sse4_1
  0,27%  dolphin-emu-nogui        [.] std::__atomic_base<unsigned char*>::operator unsigned char*
  0,26%  dolphin-emu-nogui        [.] 0x0000000000145cdb
  0,26%  dolphin-emu-nogui        [.] 0x0000000000440cd6
  0,26%  dolphin-emu-nogui        [.] 0x0000000000440da0
  0,25%  dolphin-emu-nogui        [.] 0x000000000036c400
  0,24%  dolphin-emu-nogui        [.] 0x0000000000440cbe
  0,23%  libc-2.19.so             [.] __memmove_ssse3_back
  0,23%  dolphin-emu-nogui        [.] 0x00000000003149fc
  0,22%  dolphin-emu-nogui        [.] 0x00000000003b14dc
  0,21%  dolphin-emu-nogui        [.] 0x000000000036c3a9
  0,21%  dolphin-emu-nogui        [.] 0x0000000000440d74
  0,19%  dolphin-emu-nogui        [.] 0x00000000003149dd
  0,19%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
  0,19%  dolphin-emu-nogui        [.] 0x0000000000440c0b
  0,18%  dolphin-emu-nogui        [.] std::array<short, 88ul>::operator[]
  0,18%  dolphin-emu-nogui        [.] 0x0000000000440d44
  0,18%  dolphin-emu-nogui        [.] 0x0000000000440bf5
  0,17%  dolphin-emu-nogui        [.] 0x0000000000440c02
  0,17%  dolphin-emu-nogui        [.] 0x000000000011a143
  0,17%  dolphin-emu-nogui        [.] 0x0000000000440c19
  0,17%  i965_dri.so              [.] 0x0000000000184e56
  0,17%  dolphin-emu-nogui        [.] 0x0000000000320ad2
  0,17%  dolphin-emu-nogui        [.] 0x000000000015042a



Observations:
  1. General performance is at 100% for the git build and at roughly 60% for the ubuntu build.
  2. I cannot really make sense of the addresses perf gives. Interpreting them as absolute values ends up in unallocated memory (checked statically and wihin gdb), interpreting them as relative addresses to the start of the text section does not lead me to anything useful either (not aligned to function beginnings). Before I start investing more time into this problem: Do you have an idea how I could resolve the addresses within the ubuntu dolphin build in an easier manner than reverse engineering the binary? Do you maybe have a ubuntu build with symbols?
  3. In one of my earlier measurements, I noticed that dolphin spends a good amount of time waiting for fgets to return. The ramdisk solution provided a 10% instant speed boost, you might want to think about creating a new config option that makes dolphin read the whole image into the memory instead of reading it from the disk.

I may also happily provide the two binaries and my .dolphin config, if you need them.

Cheers,
MX
(10-23-2015, 03:37 AM)Mastermind_X Wrote: [ -> ]In one of my earlier measurements, I noticed that dolphin spends a good amount of time waiting for fgets to return. The ramdisk solution provided a 10% instant speed boost, you might want to think about creating a new config option that makes dolphin read the whole image into the memory instead of reading it from the disk.

Dolphin currently does disc reads on the same thread as all CPU emulation, so waiting for reads to finish unfortunately blocks a lot of other things from executing. To get rid of that problem, I've been working on moving disc reads to a separate thread: https://github.com/dolphin-emu/dolphin/pull/2149

It should be fast enough to make disc reads non-noticeable without the drawbacks of reading the whole image into memory (like increased startup times and increased RAM usage), but it hasn't been tested much yet.
This sounds like you're running with two different kind of configuration. We usually load a game-ini on boot, maybe this is loaded in one run but not in the other? They may slow down or speed up a lot for required emulation features.

Do both builds use the same amount of CPU time? Maybe ubuntu is blocked a lot. So if you open dolphin within gdb (hint: handle SIGSEGV nostop noprint) and just look at the backtrace randomly, is there a general mutex/lock which you're usually waiting in the "active" threads (GPU + CPU emulation)?

> I cannot really make sense of the addresses perf gives.
perf has no clue how to deal with JITed code. But good luck, we have an option to write a map file. Just set "PerfMapDir = /tmp" in ~/.dolphin-emu/Config/Dolphin.ini. Keep in mind that the file /tmp/perf-XXXX.map must have the same UID as used for perf. So either chown this file, or use perf without sudo.
But bad luck, this won't help for missing symbols within the dolphin package itself...

@JosJuice Is there a way to use async reads in the same way as the wii does itself? It would be less overhead compared to another thread.
(10-23-2015, 06:15 AM)degasus Wrote: [ -> ]@JosJuice Is there a way to use async reads in the same way as the wii does itself? It would be less overhead compared to another thread.

You mean calling something to start the read and then getting notified once it's done? There isn't a good cross-platform way to do that in C/C++ as far as I know. Besides, being able to do decryption and decompression on the DVD thread instead of the CPU thread is a nice bonus, and that's actually how the Wii handles decryption if you consider Starlet to be the equivalent of the DVD thread.