Dolphin, the GameCube and Wii emulator - Forums

Full Version: Multithreading
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(04-24-2012, 01:27 AM)nbmatt Wrote: [ -> ]Never mind. You all desperately need to brush up on your people skills. It's a very long drop from that ivory tower you're standing on.

Emulation is a complicated thing, and I imagine most if not all the changes made to Dolphin have their reasons. There is still a lot that I still have to understand about how Dolphin works and what all the settings do. I also know that just about everybody here knows more than I do and it's worth listening to what they have to say; just about every possible 'optimization' you can think of has been thought of before.

On the flip side of the coin though, I do think some people may be a bit 'heavy-handed' in dealing with the newbie questions, but that is simply the result of having to answer the same questions time and time again. Don't take it personally and try to learn from the answers they give you. If you take the time to really understand what is going on under the bonnet of Dolphin, it will give you a lot more understanding of what can and can't be done and how you can squeeze the very best performance from your system.
He made that post a month ago and hasn't been seen since. I think he left.

romalias

(12-11-2011, 05:04 AM)Xtreme2damax Wrote: [ -> ]Never, Dolphin can not use more than two cores. Attempting to do so would create instability and decrease speed contrary to popular belief. You would have found the answer you were looking for had you searched the forum, there are hundreds of threads explaining why it's neither possible or feasible.

(originally posted in support thread)

I have heard so many reasons why dolphin will never utilize more then two cores, however, programming APIs have come a long way since dolphin began. Today it is possible to simply add in a few pragmas and run many of the math loops in parallel or even on the GPU.

As for the claim that performance increase would not be that substantial, I beg to differ. I have built my version of dolphin utilizing parallel pragmas where suggested by Intel compiler, and thrown in several of my own, in long or intensive loops, and the result of that was doubled performance in very heavy load games such as epic mickey. These routines do increase speed, especially in the assembler / dissembler regions.

To be exact, the optimizing for multicore brought my frame rate from a dismal 8 FPS in heavy load regions of Epic mickey (unplayable), to full speed in many areas. For reference I am running on an AMD quad phenom II 955 BE @ 3.5g. I can't imagine how much this would reduce the overhead on an 8 core. This is simply a result of performing loops and math operations in parallel where possible. I believe this should become a standard option (it could be a separate code path enabled from the hacks window).

I, however, am neither a coder nor a developer. I'll grant that some of the performance was simply a result of heavily optimized ICC code, however, even optimized code left mickey in an unplayable state without the addition of multicore support. In my builds current state it does have many synchronization issues, however, it is still very playable and mainly audio sync issues which were already present.

You can see the results of added parallel processing in the Youtube video link posted with this. (when it's processed)

I would very much appreciate, either a.) some assistance with addressing synchronization issues, as I just decided to tinker with my source at random about a week ago and found myself redoing bits and pieces and swapping out libraries for performance. Or, b.) someone more skilled than I and more familiar with the design of the emulator to re-evaluate the assessment of dismal performance gains utilizing four cores rather than two.

OpenMP pragmas could be used where I used Intel pragmas (in the right hands) to recreate the gains in a 'portable platform independent' code for linux/OSX/windows users to equally enjoy.

Another optimization technique I'd like to try out for AMD cores is utilizing Open64 to compile (linux only unfortunately QQ).
some of the common compilers already do parallel loop optimisations.
(06-05-2012, 04:21 PM)romalias Wrote: [ -> ]To be exact, the optimizing for multicore brought my frame rate from a dismal 8 FPS in heavy load regions of Epic mickey (unplayable), to full speed in many areas. For reference I am running on an AMD quad phenom II 955 BE @ 3.5g. I can't imagine how much this would reduce the overhead on an 8 core. This is simply a result of performing loops and math operations in parallel where possible. I believe this should become a standard option (it could be a separate code path enabled from the hacks window).

Yeah right. Provide a patch and a way to reproduce your results, then maybe someone will listen to what you say.
Video link is bust
no patch

i call bullshit.
This seems too much like a pimping of his Youtube profile which is decked out with illegal content.

romalias

(06-06-2012, 07:54 AM)Squall Leonhart Wrote: [ -> ]Video link is bust
no patch

i call bullshit.
This seems too much like a pimping of his Youtube profile which is decked out with illegal content.

Sorry it took so long to post back, but here is the primary source area I'm attempting to parallel, and my code changes. The revisions show promise, I am able to get full speed in a demanding game such as mickey even on my (inferior?) Phenom II, but there are some major synching issues that (not being a developer / coder) I'm unable to deal with at present.

Give me another month or two and no life and I might be able to figure something out to keep things synched a bit better. As it stands now I have all four cores at 100% when playing, the problem with that is that it leaves no room for code cache / Jit etc. So the frame rate is only stable when things aren't being loaded all other times the screen will pause to let the CPU (Core) catch up.

Maybe a dev could help out, I need a way of limiting the amount of processor time Fifo/GPU are allowed to take in order to leave enough room for other tasks to be performed in real-time.

Also this code is not portable and can only be compiled with ICC builds, I'm sure the Devs may know how to make it work using open source libraries though ( OpenMP? ) I'm happy to continue working on it myself, but I would appreciate help.

[CODE Excerpt]

FIFO.CPP

while (GpuRunningState)
{

g_video_backend->PeekMessages();

cilk_spawn VideoFifo_CheckAsyncRequest();

CommandProcessor::SetCpStatus();
cilk_sync;
// check if we are able to run this buffer
while (GpuRunningState && !CommandProcessor::interruptWaiting && fifo.bFF_GPReadEnable && fifo.CPReadWriteDistance && !AtBreakpoint() && !PixelEngine::WaitingForPEInterrupt())
{

if (!GpuRunningState) break;

fifo.isGpuReadingData = true;
CommandProcessor::isPossibleWaitingSetDrawDone = fifo.bFF_GPLinkEnable ? true : false;

u32 readPtr = fifo.CPReadPointer;
u8 *uData = Memory::GetPointer(readPtr);

if (readPtr == fifo.CPEnd) readPtr = fifo.CPBase;
else readPtr += 32;

_assert_msg_(COMMANDPROCESSOR, (s32)fifo.CPReadWriteDistance - 32 >= 0 ,
"Negative fifo.CPReadWriteDistance = %i in FIFO Loop !\nThat can produce inestabilty in the game. Please report it.", fifo.CPReadWriteDistance - 32);

ReadDataFromFifo(uData, 32);


cilk_spawn OpcodeDecoder_Run(g_bSkipCurrentFrame);

Common::AtomicStore(fifo.CPReadPointer, readPtr);
Common::AtomicAdd(fifo.CPReadWriteDistance, -32);
if((GetVideoBufferEndPtr() - g_pVideoData) == 0)
Common::AtomicStore(fifo.SafeCPReadPointer, fifo.CPReadPointer);
CommandProcessor::SetCpStatus();

// This call is pretty important in DualCore mode and must be called in the FIFO Loop.
// If we don't, s_swapRequested or s_efbAccessRequested won't be set to false
// leading the CPU thread to wait in Video_BeginField or Video_AccessEFB thus slowing things down.
VideoFifo_CheckAsyncRequest();
CommandProcessor::isPossibleWaitingSetDrawDone = false;
cilk_sync;
}
[End Code]

Note that I'm learning to use clik_spawn to create the parallel regions and parallel studio to find optimal placements, I'm not familiar with anything else... So I leave it up to someone else to port a complete product if ever there is one.
Can't seem to get Youtube to cooperate...

This will have to do unbelievers.

Image link

http://postimage.org/image/ttl2bg14n/

There is a bit of code just below the excerpt that calls the YieldCPU() function. That is where the GPU thread rests after the CPU thread stops sending it commands to draw the frame. Search for the various places where YieldCPU() is called around the FIFO, CommandProcessor, PixelEngine source code. These are the points where the GPU thread allows time for the CPU thread to catch up.

There is currently no way to limit the GPU thread. Both the CPU and GPU threads just go for broke executing as much as they have time for.

Once upon a time, I made a commit labelled "multi-threaded FIFO" which is what you have done above...

Have fun.
But the idea is great - btw, the vertex loader could be done in parallel without syncronisation issues. But i don't think of openmp or similar, opencl would have big benefits: The vertex information can be streamed undecoded (and smaller) to the gpu and be processed there. The decoded vertices will be stored directly on a gpu buffer for rendering.
ya, using transform feedback aka stream out would probably greatly improve VertexLoader's performance. However, changing VertexLoader to make use of this feature while keeping backwards compatiblilty likely needs a gread deal of architectural rework.

That said, if anyone has the motivation to implement it, go ahead Wink
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14