Why doesn't adding more cores to my CPU help with the performance in Dolphin?

**mstreurman** · (This post was last modified: 02-17-2018, 10:58 PM by mstreurman.)

First of all, I thought General discussion was the correct spot to post this, if not I apologize and ask from a mod to move it to the correct forum.
-------------------------------------------------------------------------------------------------
So I used to be a trainer for IBM and often got the following question after explaining that more cores doesn't necessarily means more performance and speed in all applications: "WHY?" So I had this awesome game/practical experiment for them to do.

The Set up:

1. I call myself the "Program" or "Application"
2. I call the blackboard/whiteboard "L2 Cache"
3. I call my students "Cores"
4. I call the piece of paper I give them "L1 cache"
5. I call the calculations on the L1 cache "Threads"

During this practical experiment the Cores must be seated at their desk. The threads are only allowed to communicate answers to each other using the L2 Cache.

The Program gives each Core an L1 cache with one of the following Threads:
Core1: A+B=C
Core2: X+Y=Z
Core3: C+Z=N
Core4: A+N=Q

The Program updates the L2 Cache with the following information:
A=1
B=2
X=3
Y=4

Then the Program tells the Cores to update the L2 Cache with their name and Thread.

The execution:

The Program asks "Please give me the value of Q"

So Core4 starts calculating but is stuck: it does A+N=Q > 1+N=Q so he doesn't have an answer, he knows by looking at the L2 cache that Core3 should know the Value to N, so he asks: "Could you please write the value of N in the L2 cache so I know it too" and waits

Core3 starts calculating... but finds out that he is stuck almost immediately because he doesn't know anything but N=C+Z, so he looks at the L2 cache and sees Core1 and Core2 know the answer to C and Z and he asks them: "Please write the value to C and to Z in the L2 cache so I know it too" and waits

Core1 and Core2 know the values needed for their Threads from the L2 Cache and start doing them:
Core1 does A+B=C=1+2=3 walks to the L2 Cache and updates it with C=3
Core2 does X+Y=Z=3+4=7 and walks at the same time as Core1 to the L2 cache and updates the cache with Z=7

Now Core3 sees the values of C and Z and starts doing his Thread C+Z=N=3+7=10 and walks to the L2 cache and updates it with N=10

Now Core4 sees the value of N and starts doing his Thread and does A+N=Q=1+10=11 and walks to the L2 cache and updates it with Q=11

Now the Program has the answer after about 2 minutes. As you can see 2 of the Cores are doing nothing most of the time because they have to wait until the other Cores are done doing their Thread.

-------------------------------------------------------------------------------------------------

Windows has a thing that is called a "Scheduler" this scheduler will decide to which Core what Thread goes to make things work as efficient as possible.
So I clear the L2 cache and leave only the basic information and give one of the Cores all the Threads and tell him he has to write down every single step he takes in his L1 cache and when he is done he has to update the L2 Cache with the value of Q
After about 1 minute he walks up to the L2 Cache and updates it with Q=11
As you see here, it takes away a lot of the steps of waiting and gives the answer much quicker.

-------------------------------------------------------------------------------------------------

Finally I clear the L2 cache and leave the basic information again and give an L1 Cache with the following Thread Q=(2*A)+B+X+Y and assign it to a Core and he updates the L2 cache 15 seconds later with Q=11, this is called "Optimization".
This final step is the hardest one to do because you will never know where these optimizations can be done, luckily some of the optimizations can be done in a compiler and take little or no effort on the developer, other optimizations are harder to do and will need Out of the Box thinking from the developer themselves.

-------------------------------------------------------------------------------------------------

As you all can see in this practical experiment the final step is the fastest but leaves the other Cores idle and doing nothing where only 1 Core works hard to give values. This is of course a very simple explanation and in no way covers all the difficulties and intricacies of multi-core processing.

The way Dolphin handles multi-core CPU's is by giving all the tasks that are done on the Wii/GC CPU to a single Thread and all the tasks that are done by the GPU of the Wii/GC to another thread (extreme simplification). Windows will in turn see that these threads are very hard working threads and assign them each to a Core. It is very hard to split up the tasks that are handled by one of these threads and assign them to different Cores because of the updating of the L2 Cache that takes extra time and actually can make things a lot slower, and even though these threads are handled on different cores they often need to wait for each other because they need a piece of information that is handled by the other thread.

I know this was a huge wall of text and it is actually a lot easier to see/do this in real life and see the effects of this than by just reading about it, but this usually was quite well understood by all the trainees after doing this practical experiment.

themanuel · 02-15-2018, 02:11 AM #2

That's a nice demonstration. Thanks for sharing it.

***Shonumi*** · 02-15-2018, 07:41 AM #3

This is good. I like it. Maybe the official FAQ can link to it.

chumpz · 02-16-2018, 01:17 AM #4

mstreurman has anticipated if im going to ask this.. LOL Smile

DrHouse64 · 02-17-2018, 01:29 AM #5

Good explanation.

I remember now, there was a PR that was more or less a placeholder by phire, the idea was to drop dual core and write a threaded "single core" that is more accurate and able to use several threads when it can.
I guess no one worked on this yet ? This is not a "work on this please" request but I'm simply curious. I like this idea.

**degasus** · 02-17-2018, 01:38 AM #6

(02-17-2018, 01:29 AM)DrHouse64 Wrote: I remember now, there was a PR that was more or less a placeholder by phire, the idea was to drop dual core and write a threaded "single core" that is more accurate and able to use several threads when it can.
I guess no one worked on this yet ? This is not a "work on this please" request but I'm simply curious. I like this idea.

The idea was to write some kind of queue for rendering tasks. So the GPU emulation shall be done on the same that as the CPU emulation, but the backend tasks shall be threaded. You can check its performance by using the null backend on single core already - through without rendering.

The point is how to queue such tasks. https://github.com/dolphin-emu/dolphin/pull/6042 is a proposed way for serialization, but without the threading for now. So there is indeed progress on this task.

DrHouse64 · 02-17-2018, 07:15 AM #7

(02-17-2018, 01:38 AM)degasus Wrote: The idea was to write some kind of queue for rendering tasks. So the GPU emulation shall be done on the same that as the CPU emulation, but the backend tasks shall be threaded. You can check its performance by using the null backend on single core already - through without rendering.

The point is how to queue such tasks. https://github.com/dolphin-emu/dolphin/pull/6042 is a proposed way for serialization, but without the threading for now. So there is indeed progress on this task.

Oh, so this is what 6042 is about. I see, thanks.

linkdude64 · 02-18-2018, 01:06 AM #8

Great post!