Dolphin, the GameCube and Wii emulator - Forums

Full Version: How to build/optimize for Haswell (AVX2) ?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
(06-17-2014, 05:59 AM)tecfreak Wrote: [ -> ]You were right. There is absolutely no difference.

I dunno. Think of all the things you and your computer could do with an extra three seconds...
depending on the maximum IPC, clock rate, and number of cores, it could be a lot. Like, my computer can get a whole 42000000000-ish things done in 3 seconds at it's fastest, or even up to 368400000000 if you include the GPU.

that's a lot of things, don't you think?
Depends on the task. Where it can be significant for some, it'll be a rounding error for others. In the case of Dolphin, it's not statistically relevant. The increase in benchmarked performance is less than 1%.

Btw, I see what you did there :p A quad-core CPU running @ 3.5 GHz gives you 42000000000 cycles in 3 seconds, though you're not taking into account superscalar operations. It could theoretically have an IPC greater than 1, in which case your estimate is smaller than the "absolute maximum".
I'm using a dual-core processor with a maximum IPC of 2, forget which instruction can be done twice per clock, though. I was never successful at getting this thing to unlock the extra cores, and I sure as hell wouldn't be able to hold it above stock on this mobo (usually increases to 125w consumption at stock, this one, on unlocking, which is where it gets extremely dangerous for this mobo).
[quote="kinkinkijkin"
forget which instruction can be done twice per clock
[/quote]

It's probably not that the instruction "normally" finishes in half a cycle, but that superscaling allows it to complete in that time frame. I'm not too knowledgeable on x86 or x64 architectures or assembly (more so ARM) but that's what I gather when we're talking about Intel or AMD. If you're interested, have a look at this extensive document about instruction time latencies (measured in core cycles): http://www.agner.org/optimize/instruction_tables.pdf
(06-17-2014, 05:59 AM)tecfreak Wrote: [ -> ]
(06-17-2014, 05:11 AM)shuffle2 Wrote: [ -> ]it would be interesting if you posted the same dolphin sources built with normal settings and then with march=haswell. I have a very hard time believing it really "runs smoother".
(but if it's true, I'd like to know why).

I ran the official benchmark to figure out whether disabling HT in the BIOS/EFI settings makes any difference or not.

Here are the results:
@march=native (core-avx2) and O3

HT ON:
12min50sec
http://i.imgur.com/UIDCAxc.jpg

HT OFF:
9min32sec
http://i.imgur.com/k9Ong2p.jpg

A speedup of ~25% !


Bench results with default build settings of same source will follow.

Edit:
You were right. There is absolutely no difference.

@default build settings with HT off
9min35sec
http://i.imgur.com/096BImh.jpg
why do we have such a performance drop with HT on, btw? Is it the same for Linux and OSX systems or an isolated Windows issue?
Maybe Dolphin should only use on thread per core when HT is enabled?
(06-30-2014, 07:51 AM)Oehr Wrote: [ -> ]why do we have such a performance drop with HT on, btw? Is it the same for Linux and OSX systems or an isolated Windows issue?
Maybe Dolphin should only use on thread per core when HT is enabled?
I ran these benches on linux (Ubuntu 14.04 - Linux 3.13 x64).
Oehr Wrote:why do we have such a performance drop with HT on, btw?

Shared resources. For example half the L1 Dcache is reserved for each thread per physical core.

Oehr Wrote:Maybe Dolphin should only use on thread per core when HT is enabled?

The OS controls thread delegation in this case. And most modern OS already do this. The only way to completely remove any possibility of a performance hit from HT is to turn it off.
thanks naturalviolence and tecfreak!

i thought that dolphin had additional issues with systems beyond 4 threads/cores. So comparing a 4-core and a 6 or 8 core CPU with identical single-core performance, does dolphin actually get another boost or is 4 cores its sweet spot?
Dolphin can use:
1 thread for CPU emulation (and optionally DSP emulation)
1 thread for GPU emulation
optionally 1 thread for DSP emulation

That's basically one thread for each major chip in the Wii/GC. The DSP thread will basically never be doing more work than both of the other two, so it never fills up a third core.

That can leave spare cores if you're on anything other than a dual core, which means that any other programs you're running, plus the OS, can use those cores and avoid hindering Dolphin.

All this together means that Dolphin will run mostly the same on a dual and quad core chip with the same single-threaded IPS if nothing else is going on, and throwing more cores at it just gives extra room for other stuff to happen at the same time.
Pages: 1 2 3