Dolphin, the GameCube and Wii emulator - Forums

Full Version: Does raising internal resolution and SSAA also increase cpu load?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
(05-13-2012, 11:09 AM)NaturalViolence Wrote: [ -> ]
Quote:Furthermore,which is even more interesting,in the third test,why is there no 36 fps at all resolutions ? Why fps still scales when lowering resolutions to 53 and 67 fps when card can deliver 99 fps at higher settings ? Only explanation is that perfomance is affected by cpu when changing resolution,not gpu.

That's the only explanation you can think of? REALLY?

No. It's a bandwidth issue. Without a cached copy the efb copies have to be constantly updated before either thread can proceed. This causes both threads to stall while a new copy is created and downloaded/uploaded (not necessarily in that order). The bigger the copies are the longer this takes. This slows down rendering tremendously.

And you are saying updating efb copies is slower when higher internal resolution is used ?
Well NV, I'm not sure the giant red face is entirely necessary, I'm just asking for help with a question.

Regardless we seem to be getting somewhere. You've mentioned FIFO, I understood from your post that a FIFO system is used with lle but not hle. Is this correct?...Neobrain's post confused me. It is not immediately obvious to me why this would slow down the graphics rendering, could you please elaborate a bit more on this?

Also, can you confirm that if using a hypothetical all powerful gpu (with unlimited rendering performance), would emulation performance at Ir x4 9xSSAA be the same as with Ir x1?

(crosses fingers that the response doesn't come with an even bigger, even redder face lol)
(05-13-2012, 08:59 PM)bret emerald Wrote: [ -> ]Neobrain's post confused me.

Sorry for the confusion, but I'm not entirely sure either where NV is hinting at (I guess he's trying to point out the correct thing tho). However, it's a fact that the GPU FIFO has nothing to do with DSP LLE. Also, a FIFO isn't something that is "used", it's a core part of the GC's/Wii's GPU which is being emulated (independently of any audio configuration, obviously).

EDIT: Oh well, I guess NV's point was that to emulate the GPU FIFO properly, Dolphin has to use very sequential programming: Usually applications are programmed in a way that allows the CPU and the GPU to work in parallel, i.e. the CPU just tells the GPU what stuff to render and can continue doing other stuff while GPU is rendering. However, for proper GFX emulation in Dolphin, the CPU needs to wait for the GPU to finish rendering. Anyway, I guess NV will be able to make his point clearer, it'll probably boil down to this reason.
Quote:Well NV, I'm not sure the giant red face is entirely necessary, I'm just asking for help with a question.

You are. And people are giving you stupid answers and answering different questions than the one you asked.

Trust me the image I was originally planning to use was far more harsh, but I decided it was too harsh and would probably get me a warning if I posted it.

Quote:You've mentioned FIFO, I understood from your post that a FIFO system is used with lle but not hle. Is this correct?

No. HLE runs out of sync with the cpu thread, LLE runs in sync with the cpu thread. FIFO is used to sync the video and cpu threads.

Quote:Oh well, I guess NV's point was that to emulate the GPU FIFO properly, Dolphin has to use very sequential programming: Usually applications are programmed in a way that allows the CPU and the GPU to work in parallel, i.e. the CPU just tells the GPU what stuff to render and can continue doing other stuff while GPU is rendering. However, for proper GFX emulation in Dolphin, the CPU needs to wait for the GPU to finish rendering. Anyway, I guess NV will be able to make his point clearer, it'll probably boil down to this reason.

Ding ding ding ding ding!

Give this man a cookie for the correct answer (do germans eat cookies or have you replaced them with some more efficient desert?).

It may not affect LLE directly but it's the reason your performance drops when you raise your IR even when LLE is on. We use the term "bottleneck" very loosely in these forums.

If anyone wants to a technical post on why I would be interested. Is the CPU doing something with the finished frame?
From what I understand (I could be totally wrong):
Let's say you're running dolphin with dual core on and hle audio. Two threads are doing most of the work, the cpu thread and the video/audio thread (they're not called that but that doesn't really matter). The cpu thread emulates the GC/Wii cpu.

When the emulated cpu needs to tell the gpu to do something it writes the command to a buffer, called the command buffer or FIFO buffer. It uses first-in-first-out stack logic so that's why it's called that. If the buffer is full and the cpu thread is trying to push a command onto the buffer it must stall until there is room in the buffer. While stalling it just sits there waiting, doing nothing.

The video/audio thread is responsible for gpu and dsp emulation. It runs in parallel and reads commands from the buffer, the command is removed from the buffer after the read is complete (which is called a pop or poke operation). So in other words the cpu thread fills the buffer while the video/audio thread depletes the buffer. The buffer is how the real hardware keeps the cpu and gpu running asynchronously in parallel without the cpu getting too far ahead. The real hardware also has a command processor to handle managing the buffer but I'm not sure if dolphin emulates this.

Sometimes the video thread can emulate the command on its own (for example writing to a register will only require cpu code). Sometimes it needs to hardware accelerate a task by running a shader on the gpu. The video thread cannot proceed until the shader finishes (it stalls) but the cpu thread might be doing work while the gpu is doing work if the fifo buffer isn't full. If the gpu is not able to keep up the buffer will fill up and both threads will be stalled. The amount of time they remain stalled waiting for the gpu/video thread to finish is dependent on how long it takes the gpu to complete its task. Higher IR means longer thread stalls if the buffer fills up and therefore lower performance. Keep in mind that since video/audio are on the same thread in this example the hle dsp emulator won't be able to do any work while the thread is waiting for a shader to complete. If the buffer fills up then the gpu performance significantly impacts performance, which we call a gpu bottleneck. If the buffer doesn't fill up then performance is almost entirely dependent on cpu performance. This all assumes that no efb copy emulation or efb copy to texture emulation is being done. Things get a bit more complicated when you account for efb copy to ram emulation. And even more complicated when you add lle dsp emulation and lle on thread.

EFB copy to texture doesn't real complicate things any further since it's being done entirely on the gpu with the efb copies being stored as texture in vram.

I'll have to continue this tomorrow.

To-do:
-talk about efb copy emulation, various methods, how they affect thread stalling
-talk about the affect of changing IR and how that affects things
-talk about lle dsp emulation and how the thread sync affects things
-give a "big picture" summary

Edit: Add one more day, I didn't have any time to work on this today.
Fwiw, you make it sound like flipper and the DSP read their commands from the same buffer. That is not the case, of course.
......when did I imply that? All I said was that they're emulated by the same thread, which is true.

Edit: Oh I see what you mean. I'm still not sure how I could have phrased that better, since it really is the thread that reads from the buffer.
Thanks to NV and Neobrain for the very informative answers. I think the bottom line is that emulation is a whole different bag of cats to traditional gaming, but sometimes it's hard to break the mindset.
Dolphin is more like a pipeline than true parallelism. It can do several things in parallel but if something is running slower than something else it will stall the other "stages" and slow everything down.

I don't have time to elaborate on how everything works so I'll just try and explain your results.

In the first example:
Quote:****efb copies to texture:

Ir x1___________223fps
Ir x4___________220fps
Ir x2 SSAA x9___99fps

We see that the framerate is going down significantly when IR goes up. The logical conclusion to draw from this data is that the cpu thread is spending lots of time stalled waiting for the video thread because the fifo buffer is full and the video thread is spending lots of time stalled waiting for the gpu to complete its shaders. The result is a "gpu bottleneck". In other words the gpu is not keeping up with the rest of the system and is acting as the weakest link. The amount of time spent stalled is going up as IR goes up because the shaders take longer to complete, more time stalled = lower framerate.

Quote:****efb copies to ram (cache):

Ir x1___________128fps
Ir x4___________85fps
Ir x2 SSAA x9___55fps

Now efb copy to ram has been turned on. This makes the stalls even longer as the cpu and video/audio threads not only have to wait on the gpu but also have to wait on the memory transfers and texture/ram encoding/decoding. There are now two sources of stalling in the video/audio thread. The gpu is still the bottleneck and so the IR still goes down as IR goes up. Note that the stall introduced by ram copy emulation is constant and does not scale with IR. If it did we would see a more exponential decay of framerate (think 128 fps -> 32 fps -> 2 fps).

Quote:****efb copies to ram (no cache):

Ir x1___________67fps
Ir x4___________53fps
Ir x2 SSAA x9___36fps

The situation gets even worse when the cache functions are disabled. With the cache option turned off the texture copies are updated every single time they are used instead of only when changes are made to the ram copies. Thus the stalls from efb copy emulation are more frequent and therefore framerate is even lower. This does not change the fact that the gpu is still producing stalls as well and therefore higher IR still decreases framerate.

Quote:****hle audio

Ir x1___________87fps
Ir x4___________82fps
Ir x2 SSAA x9___55fps

****lle audio (not on thread)

Ir x1___________73fps
Ir x4___________70fps
Ir x2 SSAA x9___50fps

Remember how I said video/audio emulation where done by the same thread? Well you have now introduced a THIRD stall. Since the LLE dsp emulator runs in sync with the cpu thread the cpu thread stalls whenever the lle dsp emulator is doing any sort of work. Framerate is now very low as there are three sources of cpu thread stalling. In other words:

1. CPU thread does some work
2. CPU thread waits for video/audio thread to finish shaders (stall, scales with IR)
3. CPU thread waits for video/audio thread to emulate efb copies (stall, depends on how many efb copies the game needs, how big they are, how often the game engine uses them, and what the game engine does with them)
4. CPU thread waits for video/audio thread to finish emulating audio with the lle dsp emulator (stall, depends on how many sounds are running at once and what the game engine is doing with them)
5. When all of this is done the cpu thread can do some more work, so we go back to stage 1

The above sequence is not how dolphin actually works, it's just to illustrate some of your performance bottlenecks.

Note that with dsp lle on thread the cpu thread still stalls when dsp lle is doing work but the video thread can now be doing work while the audio thread is doing work, which it could not before. In other words stages 2 and 4 can run in parallel, potentially improving performance.

Stages 1, 2, and 3 can run in parallel only if the cpu thread is the bottleneck (the fifo buffer isn't filling up).
Pages: 1 2 3