• Login
  • Register
  • Dolphin Forums
  • Home
  • FAQ
  • Download
  • Wiki
  • Code


Dolphin, the GameCube and Wii emulator - Forums › Dolphin Emulator Discussion and Support › General Discussion v
« Previous 1 ... 59 60 61 62 63 ... 368 Next »

Why doesn't adding more cores to my CPU help with the performance in Dolphin?
View New Posts | View Today's Posts

Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Thread Modes
Why doesn't adding more cores to my CPU help with the performance in Dolphin?
02-15-2018, 01:49 AM (This post was last modified: 02-17-2018, 10:58 PM by mstreurman.)
#1
mstreurman Offline
Above and Beyond
*******
Posts: 1,239
Threads: 11
Joined: Nov 2015
First of all, I thought General discussion was the correct spot to post this, if not I apologize and ask from a mod to move it to the correct forum.
-------------------------------------------------------------------------------------------------
So I used to be a trainer for IBM and often got the following question after explaining that more cores doesn't necessarily means more performance and speed in all applications: "WHY?" So I had this awesome game/practical experiment for them to do.

The Set up:

1. I call myself the "Program" or "Application"
2. I call the blackboard/whiteboard "L2 Cache"
3. I call my students "Cores"
4. I call the piece of paper I give them "L1 cache"
5. I call the calculations on the L1 cache "Threads"

During this practical experiment the Cores must be seated at their desk. The threads are only allowed to communicate answers to each other using the L2 Cache.

The Program gives each Core an L1 cache with one of the following Threads:
Core1: A+B=C
Core2: X+Y=Z
Core3: C+Z=N
Core4: A+N=Q

The Program updates the L2 Cache with the following information:
A=1
B=2
X=3
Y=4

Then the Program tells the Cores to update the L2 Cache with their name and Thread.

The execution:

The Program asks "Please give me the value of Q"

So Core4 starts calculating but is stuck: it does A+N=Q > 1+N=Q so he doesn't have an answer, he knows by looking at the L2 cache that Core3 should know the Value to N, so he asks: "Could you please write the value of N in the L2 cache so I know it too" and waits

Core3 starts calculating... but finds out that he is stuck almost immediately because he doesn't know anything but N=C+Z, so he looks at the L2 cache and sees Core1 and Core2 know the answer to C and Z and he asks them: "Please write the value to C and to Z in the L2 cache so I know it too" and waits

Core1 and Core2 know the values needed for their Threads from the L2 Cache and start doing them:
Core1 does A+B=C=1+2=3 walks to the L2 Cache and updates it with C=3
Core2 does X+Y=Z=3+4=7 and walks at the same time as Core1 to the L2 cache and updates the cache with Z=7

Now Core3 sees the values of C and Z and starts doing his Thread C+Z=N=3+7=10 and walks to the L2 cache and updates it with N=10

Now Core4 sees the value of N and starts doing his Thread and does A+N=Q=1+10=11 and walks to the L2 cache and updates it with Q=11

Now the Program has the answer after about 2 minutes. As you can see 2 of the Cores are doing nothing most of the time because they have to wait until the other Cores are done doing their Thread.

-------------------------------------------------------------------------------------------------

Windows has a thing that is called a "Scheduler" this scheduler will decide to which Core what Thread goes to make things work as efficient as possible.
So I clear the L2 cache and leave only the basic information and give one of the Cores all the Threads and tell him he has to write down every single step he takes in his L1 cache and when he is done he has to update the L2 Cache with the value of Q
After about 1 minute he walks up to the L2 Cache and updates it with Q=11
As you see here, it takes away a lot of the steps of waiting and gives the answer much quicker.

-------------------------------------------------------------------------------------------------

Finally I clear the L2 cache and leave the basic information again and give an L1 Cache with the following Thread Q=(2*A)+B+X+Y and assign it to a Core and he updates the L2 cache 15 seconds later with Q=11, this is called "Optimization".
This final step is the hardest one to do because you will never know where these optimizations can be done, luckily some of the optimizations can be done in a compiler and take little or no effort on the developer, other optimizations are harder to do and will need Out of the Box thinking from the developer themselves.

-------------------------------------------------------------------------------------------------

As you all can see in this practical experiment the final step is the fastest but leaves the other Cores idle and doing nothing where only 1 Core works hard to give values. This is of course a very simple explanation and in no way covers all the difficulties and intricacies of multi-core processing.

The way Dolphin handles multi-core CPU's is by giving all the tasks that are done on the Wii/GC CPU to a single Thread and all the tasks that are done by the GPU of the Wii/GC to another thread (extreme simplification). Windows will in turn see that these threads are very hard working threads and assign them each to a Core. It is very hard to split up the tasks that are handled by one of these threads and assign them to different Cores because of the updating of the L2 Cache that takes extra time and actually can make things a lot slower, and even though these threads are handled on different cores they often need to wait for each other because they need a piece of information that is handled by the other thread.

I know this was a huge wall of text and it is actually a lot easier to see/do this in real life and see the effects of this than by just reading about it, but this usually was quite well understood by all the trainees after doing this practical experiment.
Check my profile for up to date specs.
Find
Reply
02-15-2018, 02:11 AM
#2
themanuel Offline
Parasitic Member of the Community
*****
Posts: 828
Threads: 63
Joined: Oct 2009
That's a nice demonstration. Thanks for sharing it.
Windows 10 Pro x64  |  i7-9700K @ 4.6-5.0GHz  |  MSI Z370 Gaming Plus  |  MSI RX 5700 8GB Factory-OC  |  16 GB DDR4-3000
Find
Reply
02-15-2018, 07:41 AM
#3
Shonumi Offline
Linux User/Tester
**********
Administrators
Posts: 6,503
Threads: 55
Joined: Dec 2011
This is good. I like it. Maybe the official FAQ can link to it.
Website Find
Reply
02-16-2018, 01:17 AM
#4
chumpz Offline
Member
***
Posts: 203
Threads: 72
Joined: Dec 2017
mstreurman has anticipated if im going to ask this.. LOL Smile
Windows 10 Home
Intel Core I5 8300H @2.3Ghz up to 4Ghz
12GB DDR4 RAM
GTX 1050 4GB
5400 RPM SSHD
Dell G7 15 7588
Find
Reply
02-17-2018, 01:29 AM
#5
DrHouse64 Offline
A woman yet a man, a man yet a woman
****
Posts: 343
Threads: 18
Joined: Jun 2013
Good explanation.

I remember now, there was a PR that was more or less a placeholder by phire, the idea was to drop dual core and write a threaded "single core" that is more accurate and able to use several threads when it can.
I guess no one worked on this yet ? This is not a "work on this please" request but I'm simply curious. I like this idea.
From France with love.
Laptop ROG : W10 / Ryzen 7 4800HS @2.9 GHz (4.2 GHz Turbo disabled unless necessary for better thermals) / 16 Go DDR4 / RTX 2060 MaxQ (6 Go GDDR6)
Find
Reply
02-17-2018, 01:38 AM
#6
degasus Offline
Developer
**********
Developers (Some Administrators and Super Moderators)
Posts: 1,827
Threads: 10
Joined: May 2012
(02-17-2018, 01:29 AM)DrHouse64 Wrote: I remember now, there was a PR that was more or less a placeholder by phire, the idea was to drop dual core and write a threaded "single core" that is more accurate and able to use several threads when it can.
I guess no one worked on this yet ? This is not a "work on this please" request but I'm simply curious. I like this idea.

The idea was to write some kind of queue for rendering tasks. So the GPU emulation shall be done on the same that as the CPU emulation, but the backend tasks shall be threaded. You can check its performance by using the null backend on single core already - through without rendering.

The point is how to queue such tasks. https://github.com/dolphin-emu/dolphin/pull/6042 is a proposed way for serialization, but without the threading for now. So there is indeed progress on this task.
Find
Reply
02-17-2018, 07:15 AM
#7
DrHouse64 Offline
A woman yet a man, a man yet a woman
****
Posts: 343
Threads: 18
Joined: Jun 2013
(02-17-2018, 01:38 AM)degasus Wrote: The idea was to write some kind of queue for rendering tasks. So the GPU emulation shall be done on the same that as the CPU emulation, but the backend tasks shall be threaded. You can check its performance by using the null backend on single core already - through without rendering.

The point is how to queue such tasks. https://github.com/dolphin-emu/dolphin/pull/6042 is a proposed way for serialization, but without the threading for now. So there is indeed progress on this task.

Oh, so this is what 6042 is about. I see, thanks.
From France with love.
Laptop ROG : W10 / Ryzen 7 4800HS @2.9 GHz (4.2 GHz Turbo disabled unless necessary for better thermals) / 16 Go DDR4 / RTX 2060 MaxQ (6 Go GDDR6)
Find
Reply
02-18-2018, 01:06 AM
#8
linkdude64 Offline
Member
***
Posts: 78
Threads: 3
Joined: Dec 2014
Great post!
Specs:
Win 10 x64 LTSB (Look it up!); ASUS ROG X370 Mobo; AMD Ryzen 2600X w/PBO; 2x8GB GSkill FlareX DDR4 3200Mhz; ASUS ROG RX580 8GB; Samsung 960 EVO nVME 
Find
Reply
« Next Oldest | Next Newest »


  • View a Printable Version
  • Subscribe to this thread
Forum Jump:


Users browsing this thread: 1 Guest(s)



Powered By MyBB | Theme by Fragma

Linear Mode
Threaded Mode