• Login
  • Register
  • Dolphin Forums
  • Home
  • FAQ
  • Download
  • Wiki
  • Code


Dolphin, the GameCube and Wii emulator - Forums › Dolphin Emulator Discussion and Support › Development Discussion v
« Previous 1 … 31 32 33 34 35 … 117 Next »

Mystic Speed Boost from Self-Built Version on Linux?
View New Posts | View Today's Posts

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Threaded Mode
Mystic Speed Boost from Self-Built Version on Linux?
10-08-2015, 08:22 PM
#1
Mastermind_X
Unregistered
 
Hi folks,

I'm observing a mysterious speed boost when I compare the pre-compiled Ubuntu builds to a self-compiled version of Dolphin. I was stunned by the fact that dolphin suffered from sever frame rate drops (~5 FPS on SSBM event match 15, ~25 FPS in Mario Kart Wii permanently). I wanted to investigate a bit in order to tweak the frame rate at least a little bit using compile time optimizations but
I was completely stumped when the frame rate drops immediately disappeared when running my first self-compiled dolphin build. I did not touch the cmake configuration nor the compiler flags, from what I saw it compiled using -O3 and -msse2. I'm on Debian Stretch, gcc version 4.9.3. I compared pre-built 4.0-7945, pre-built 5.0-rc27 and self-built git (master, ec28d7df924f427fc9ae4830c520aaf121cb3713)

Now my obvious question would be whether there is a simple explanation why the official dolphin build is so much slower than my n00bish compiled version. Does the build-bot compile code for a generic x86_64 machine without any SIMD instructions?

Cheers,
mx

Specs:
Intel® Core™ i7-4800MQ CPU (4 Hyper-Threaded @ 3.7 GHz, 6 MB Cache)
On-Chip GPU: Intel HD 4600
Discrete GPU: GT 730M (should be irrelevant as the needed kmods are currently not installed (different painful story ...))
Reply
10-09-2015, 08:54 PM
#2
degasus Offline
Developer
**********
Developers (Some Administrators and Super Moderators)
Posts: 1,828
Threads: 10
Joined: May 2012
Our buildbot compiles with -O2 by default iirc. SSE2 is forced-enabled for all x64 builds, so this is no issue.

To be honest, this should have almost no effect as we spend most of our time within just-in-time generated code. So the optimization level of the compiler doesn't affect us much.

Could you compare the output of "perf top" with both builds? Maybe there is a difference somewhere... Also be careful to test exactly the same, eg just opening the video setting window while the emulation is running affect the used settings a lot :/

About your painful story, did you try bumblebee?
Find
Reply
10-10-2015, 12:23 AM
#3
Mastermind_X
Unregistered
 
Hi degasus,

thanks for your reply! I agree, the quality of the JITed code should not be influenced by compiler flags to much. Nevertheless, had there been any efforts to generate code that makes use of vector instructions and if not do you think it would be worth a try? Sorry for asking if that's already the case, I didn't have time to look into dolphin's internals (yet).
Also, I'll run some measurements using perf once I get home from the office. Do you have any specific hardware events in mind or should I just investigate on how much time the CPU spends within what parts of dolphin?

(10-09-2015, 08:54 PM)degasus Wrote: About your painful story, did you try bumblebee?

Yes, I did. (https://github.com/Bumblebee-Project/bbswitch/issues/78 and 1)

Basically a lost cause.
Reply
10-10-2015, 01:10 AM (This post was last modified: 10-10-2015, 01:11 AM by degasus.)
#4
degasus Offline
Developer
**********
Developers (Some Administrators and Super Moderators)
Posts: 1,828
Threads: 10
Joined: May 2012
We use SSE for the PPC float handling. The GC PPC has SIMD with 2-float width, but else we don't vectorize ourself. Our Jit doesn't have any optimization passes, it's an one-pass compiler from PPC to x64.
We use some special instructions up to AVX, but they don't speed up much. The biggest performance hit is not vectoring but because of strange PPC behavior. If you want to read a bit more about this weird behavior, start on page 6.
There are plans to implement a LLVM based jit, but it's a very long term task and right now, nobody knows if it will be faster at all...

For the measurement, I just hope to see one function with a much higher load on the ubuntu builds. I don't want to belive this is a general slowdown over all called C functions...
Find
Reply
10-23-2015, 03:37 AM
#5
Mastermind_X
Unregistered
 
Hi degasus,

sorry this took so long, busy times.

First, the measurements:

In order to minimize unwanted high-latency side-effects, I created a ramdisk containing the ISO (RMGP01) and set dolphin's CPU affinity to 7. (My machine has 4 hyper-threaded cores, 7 refers to the last one, just to avoid confusion.). I then measured only this one CPU using perf and let it take 500k samples (just cycle count) in the save file selection dialog, which appears to me to be deterministic when rendering the effects in the background. I also set dolphin's internal resolution to 2x, to create some workload for the CPU.

Code:
$ taskset 0x80 ./ubuntu_build/usr/games/dolphin-emu-nogui tmp/RMGP01.wbfs
$ perf top -cpu=7 > results_ubuntu
$ taskset 0x80 ./git_build/dolphin/Build5/Binaries/dolphin-emu-nogui tmp/RMGP01.wbfs
$ perf top -cpu=7 > results_git

Results:
Code:
$ cat results_git
  5,05%  dolphin-emu-nogui              [.] OGL::ProgramShaderCache::UploadConstants
  3,49%  [kernel]                       [k] __i915_wait_request
  1,99%  dolphin-emu-nogui              [.] OpcodeDecoder_Run<false>
  1,56%  [kernel]                       [k] i915_parse_cmds
  1,50%  libpthread-2.19.so             [.] pthread_mutex_lock
  1,49%  libc-2.19.so                   [.] memset
  1,19%  dolphin-emu-nogui              [.] GetCRC32
  1,13%  dolphin-emu-nogui              [.] VertexLoaderManager::RunVertices
  1,11%  dolphin-emu-nogui              [.] RunGpuLoop
  1,09%  [kernel]                       [k] gen6_ring_get_seqno
  1,02%  libdrm_intel.so.1.0.0          [.] 0x00000000000060ef
  1,02%  dolphin-emu-nogui              [.] TextureCache::Load
  1,00%  dolphin-emu-nogui              [.] CommandProcessor::GatherPipeBursted
  1,00%  perf-3862.map                  [.] 0x000000004187608e
  0,88%  dolphin-emu-nogui              [.] LoadBPReg
  0,82%  dolphin-emu-nogui              [.] JitInterface::CompileExceptionCheck
  0,71%  dolphin-emu-nogui              [.] GeneratePixelShader<ShaderUid<pixel_shader_uid_data> >
  0,67%  dolphin-emu-nogui              [.] CommandProcessor::SetCPStatusFromGPU
  0,66%  libc-2.19.so                   [.] __memcpy_sse2_unaligned
  0,65%  dolphin-emu-nogui              [.] VertexManager::CalculateZSlope
  0,60%  dolphin-emu-nogui              [.] OGL::ProgramShaderCache::SetShader
  0,59%  perf-3862.map                  [.] 0x000000004263efe6
  0,58%  dolphin-emu-nogui              [.] OGL::Renderer::UpdateEFBCache
  0,55%  dolphin-emu-nogui              [.] ReadDataFromFifo
  0,51%  perf-3862.map                  [.] 0x0000000042a7a9e1
  0,51%  dolphin-emu-nogui              [.] IndexGenerator::AddStrip<true>
  0,44%  perf-3862.map                  [.] 0x0000000042a7864d
  0,43%  libpthread-2.19.so             [.] __pthread_mutex_unlock_usercnt
  0,41%  libc-2.19.so                   [.] __memcmp_sse4_1
  0,34%  dolphin-emu-nogui              [.] LoadXFReg
  0,34%  [kernel]                       [k] copy_user_enhanced_fast_string
  0,34%  dolphin-emu-nogui              [.] GPFifo::Write64
  0,32%  libdrm_intel.so.1.0.0          [.] 0x000000000000853c
  0,32%  dolphin-emu-nogui              [.] VertexManager::PrepareForAdditionalData
  0,31%  [kernel]                       [k] i915_gem_execbuffer_relocate_entry
  0,31%  dolphin-emu-nogui              [.] GetVertexShaderUid
  0,30%  dolphin-emu-nogui              [.] GPFifo::FastCheckGatherPipe
  0,29%  i965_dri.so                    [.] 0x000000000033bca8
  0,27%  perf-3862.map                  [.] 0x0000000042a786f1
  0,26%  dolphin-emu-nogui              [.] ZeldaAudioRenderer::ApplyReverb
  0,26%  [kernel]                       [k] __list_del_entry
  0,25%  dolphin-emu-nogui              [.] VertexShaderManager::TransformToClipSpace
  0,24%  libc-2.19.so                   [.] _int_malloc
  0,23%  perf-3862.map                  [.] 0x0000000042a7b37e
  0,23%  dolphin-emu-nogui              [.] CommandProcessor::SetCPStatusFromCPU
  0,23%  libc-2.19.so                   [.] __memmove_ssse3_back
  0,22%  dolphin-emu-nogui              [.] Renderer::EFBToScaledX
  0,22%  libdrm_intel.so.1.0.0          [.] 0x00000000000055ae
  0,21%  dolphin-emu-nogui              [.] VertexManager::Flush
  0,20%  perf-3862.map                  [.] 0x000000004187617a
  0,20%  dolphin-emu-nogui              [.] TextureCache::DoPartialTextureUpdates
  0,20%  perf-3862.map                  [.] 0x0000000042a7a9f3

Code:
$ cat results_ubuntu
  2,13%  [kernel]                 [k] __i915_wait_request
  1,46%  dolphin-emu-nogui        [.] 0x000000000036c420
  0,99%  dolphin-emu-nogui        [.] std::less<unsigned long>::operator()
  0,83%  libc-2.19.so             [.] __memcpy_sse2_unaligned
  0,76%  dolphin-emu-nogui        [.] 0x0000000000440c3b
  0,71%  [kernel]                 [k] i915_parse_cmds
  0,68%  [kernel]                 [k] gen6_ring_get_seqno
  0,61%  libc-2.19.so             [.] memset
  0,60%  libpthread-2.19.so       [.] pthread_mutex_lock
  0,53%  dolphin-emu-nogui        [.] 0x0000000000440c14
  0,51%  perf-5085.map            [.] 0x00000000428d808e
  0,50%  dolphin-emu-nogui        [.] 0x0000000000440ca6
  0,49%  dolphin-emu-nogui        [.] 0x0000000000440cbb
  0,47%  dolphin-emu-nogui        [.] 0x0000000000440ccc
  0,47%  dolphin-emu-nogui        [.] 0x0000000000440d6a
  0,46%  libdrm_intel.so.1.0.0    [.] 0x00000000000060ef
  0,45%  dolphin-emu-nogui        [.] 0x000000000031172c
  0,41%  dolphin-emu-nogui        [.] 0x0000000000314a1c
  0,36%  perf-5085.map            [.] 0x0000000040a74606
  0,35%  dolphin-emu-nogui        [.] std::swap<int>
  0,33%  dolphin-emu-nogui        [.] 0x0000000000314a09
  0,33%  perf-5085.map            [.] 0x0000000040eb57ed
  0,31%  dolphin-emu-nogui        [.] std::__atomic_base<unsigned char*>::operator+=
  0,29%  dolphin-emu-nogui        [.] 0x0000000000440c0f
  0,29%  dolphin-emu-nogui        [.] 0x0000000000440c9b
  0,29%  perf-5085.map            [.] 0x0000000040eb3459
  0,28%  dolphin-emu-nogui        [.] std::_Hashtable<unsigned int, unsigned int, std::allocator<unsigned int>, std::__detail::_Identity, std::equal_to<unsigned int>, std::hash<unsigned
  0,27%  libc-2.19.so             [.] __memcmp_sse4_1
  0,27%  dolphin-emu-nogui        [.] std::__atomic_base<unsigned char*>::operator unsigned char*
  0,26%  dolphin-emu-nogui        [.] 0x0000000000145cdb
  0,26%  dolphin-emu-nogui        [.] 0x0000000000440cd6
  0,26%  dolphin-emu-nogui        [.] 0x0000000000440da0
  0,25%  dolphin-emu-nogui        [.] 0x000000000036c400
  0,24%  dolphin-emu-nogui        [.] 0x0000000000440cbe
  0,23%  libc-2.19.so             [.] __memmove_ssse3_back
  0,23%  dolphin-emu-nogui        [.] 0x00000000003149fc
  0,22%  dolphin-emu-nogui        [.] 0x00000000003b14dc
  0,21%  dolphin-emu-nogui        [.] 0x000000000036c3a9
  0,21%  dolphin-emu-nogui        [.] 0x0000000000440d74
  0,19%  dolphin-emu-nogui        [.] 0x00000000003149dd
  0,19%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
  0,19%  dolphin-emu-nogui        [.] 0x0000000000440c0b
  0,18%  dolphin-emu-nogui        [.] std::array<short, 88ul>::operator[]
  0,18%  dolphin-emu-nogui        [.] 0x0000000000440d44
  0,18%  dolphin-emu-nogui        [.] 0x0000000000440bf5
  0,17%  dolphin-emu-nogui        [.] 0x0000000000440c02
  0,17%  dolphin-emu-nogui        [.] 0x000000000011a143
  0,17%  dolphin-emu-nogui        [.] 0x0000000000440c19
  0,17%  i965_dri.so              [.] 0x0000000000184e56
  0,17%  dolphin-emu-nogui        [.] 0x0000000000320ad2
  0,17%  dolphin-emu-nogui        [.] 0x000000000015042a



Observations:
  1. General performance is at 100% for the git build and at roughly 60% for the ubuntu build.
  2. I cannot really make sense of the addresses perf gives. Interpreting them as absolute values ends up in unallocated memory (checked statically and wihin gdb), interpreting them as relative addresses to the start of the text section does not lead me to anything useful either (not aligned to function beginnings). Before I start investing more time into this problem: Do you have an idea how I could resolve the addresses within the ubuntu dolphin build in an easier manner than reverse engineering the binary? Do you maybe have a ubuntu build with symbols?
  3. In one of my earlier measurements, I noticed that dolphin spends a good amount of time waiting for fgets to return. The ramdisk solution provided a 10% instant speed boost, you might want to think about creating a new config option that makes dolphin read the whole image into the memory instead of reading it from the disk.

I may also happily provide the two binaries and my .dolphin config, if you need them.

Cheers,
MX
Reply
10-23-2015, 04:49 AM
#6
JosJuice Offline
Developer
**********
Developers (Some Administrators and Super Moderators)
Posts: 9,074
Threads: 7
Joined: Oct 2014
(10-23-2015, 03:37 AM)Mastermind_X Wrote: In one of my earlier measurements, I noticed that dolphin spends a good amount of time waiting for fgets to return. The ramdisk solution provided a 10% instant speed boost, you might want to think about creating a new config option that makes dolphin read the whole image into the memory instead of reading it from the disk.

Dolphin currently does disc reads on the same thread as all CPU emulation, so waiting for reads to finish unfortunately blocks a lot of other things from executing. To get rid of that problem, I've been working on moving disc reads to a separate thread: https://github.com/dolphin-emu/dolphin/pull/2149

It should be fast enough to make disc reads non-noticeable without the drawbacks of reading the whole image into memory (like increased startup times and increased RAM usage), but it hasn't been tested much yet.
Find
Reply
10-23-2015, 06:15 AM
#7
degasus Offline
Developer
**********
Developers (Some Administrators and Super Moderators)
Posts: 1,828
Threads: 10
Joined: May 2012
This sounds like you're running with two different kind of configuration. We usually load a game-ini on boot, maybe this is loaded in one run but not in the other? They may slow down or speed up a lot for required emulation features.

Do both builds use the same amount of CPU time? Maybe ubuntu is blocked a lot. So if you open dolphin within gdb (hint: handle SIGSEGV nostop noprint) and just look at the backtrace randomly, is there a general mutex/lock which you're usually waiting in the "active" threads (GPU + CPU emulation)?

> I cannot really make sense of the addresses perf gives.
perf has no clue how to deal with JITed code. But good luck, we have an option to write a map file. Just set "PerfMapDir = /tmp" in ~/.dolphin-emu/Config/Dolphin.ini. Keep in mind that the file /tmp/perf-XXXX.map must have the same UID as used for perf. So either chown this file, or use perf without sudo.
But bad luck, this won't help for missing symbols within the dolphin package itself...

@JosJuice Is there a way to use async reads in the same way as the wii does itself? It would be less overhead compared to another thread.
Find
Reply
10-23-2015, 07:01 AM
#8
JosJuice Offline
Developer
**********
Developers (Some Administrators and Super Moderators)
Posts: 9,074
Threads: 7
Joined: Oct 2014
(10-23-2015, 06:15 AM)degasus Wrote: @JosJuice Is there a way to use async reads in the same way as the wii does itself? It would be less overhead compared to another thread.

You mean calling something to start the read and then getting notified once it's done? There isn't a good cross-platform way to do that in C/C++ as far as I know. Besides, being able to do decryption and decompression on the DVD thread instead of the CPU thread is a nice bonus, and that's actually how the Wii handles decryption if you consider Starlet to be the equivalent of the DVD thread.
Find
Reply
« Next Oldest | Next Newest »


  • View a Printable Version
Forum Jump:


Users browsing this thread: 1 Guest(s)



Powered By MyBB | Theme by Fragma