Dolphin, the GameCube and Wii emulator - Forums

Full Version: OpenGL performance regression on (ATI/AMD) Radeon GPUs since 4.0-1778
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
(11-02-2014, 12:41 PM)JMC47 Wrote: [ -> ]Yeah, but we don't use coherent mapping any more on AMD; we use Pinned Memory now. Unless the newer drivers are totally backwards, there's no way that could be the issue at all.

How about pushing a new a commit that re-enables coherent mapping for ATI/AMD cards only while leaving the current behavior as it is for NVIDIA and posting a test version of that (with the latest master-dev changes,of course) for benchmarking?
Like I said, we're using the same exact setup as before that build now. There's literally nothing to do.
(11-02-2014, 12:53 PM)JMC47 Wrote: [ -> ]Like I said, we're using the same exact setup as before that build now. There's literally nothing to do.

Is that test rig using an ATI HD5850? What's the closest equivalent to that card? A HD6850 seems like a good match. Same GPU architecture and it's using the exact same driver optimizations. (good for benchmarks / regression testing)
I'm thinking about how to handle this. I kind of want to combine all your stuff into one thread so none of the information gets lost. I can make a build that re-enables buffer storage for all cards, but I can't really do one that enables it for just AMD. As of right now, we're using Pinned Memory on AMD cards, which is even faster than Buffer Storage in my testing.

If you're absolutely certain that this is a slowdown, I'll make a build that re-enables buffer storage and we'll see what happens. The reason that we didn't do coherent mapping for one GPU manufacturer but not the others is that it makes the code very messy, and is likely a driver issue that will randomly get fixed in the future. We stumbled into the fact that Pinned Memory was actually faster than Buffer Storage on AMD/ATi cards, which is how we got around it.

If that's no longer the case, then obviously we need to adjust again. We'll need multiple people to reproduce this though, and you'll need to be very exact with your benchmarks for us. If you have lots of ATi cards, it will come into much use for us to handle this.
I'm not seeing a regression here either. My system:
- AMD Radeon HD7790
- Catalyst 14.9
- Intel Core 2 Duo E6750 clocked at 3.2 Ghz
- Win 7 64 Bit

Tested for this regression in OpenGL of course. (otherwise mostly using D3D)

Wind Waker:
4.0-1769: 36 fps
4.0-1778: 19 fps
4.0-2010: 36 fps
4.0-3921: 38 fps
4.0-3926: 38 fps

New Super Mario Bros title screen:
4.0-1769: 88 fps
4.0-1778: 86 fps
4.0-2010: 88 fps
4.0-3921: 101 fps
4.0-3926: 101 fps

So again, -1 on this regression from me, and the latest builds are faster than the builds before the bad change(1778).
Even if we can't confirm a regression, it's nice to double check things.
@JMC: Merge all performance-related threads into one and then delete the old ones. SGTM.

P.S. PM is enabled. You can post PMs from now on.
Now I know why you're unable to reproduce this performance regression. It's because it happens mostly in complex, vertex/geometry-heavy scenes.
With light-to-medium geometry complexity, the performance is about the same.

NSMB tested: In W1-1 there's almost no difference in performance between the two builds, while in the W-select "spinning round islands" screen, all builds without coherent mapping are 10+% slower on Radeon GPUs.

The performance hit in the SMG1_observatory_test "benchmark" is even worse...much worse (about 25~30%)

Coherent mapping *does* make AMD GPUs faster.

EDIT: Updated / corrected all previous posts.
Getting the same results as mimimi.

Tested with Super Mario Galaxy in the observatory.
66fps (1769)
48fps (1778)
66fps (2010)
67fps (4053)
After a long round of NSMB benchmark runs (multiple versions x 10 runs with 2 different GPUs), I have some pretty interesting results for you:

NSMB_title OpenGL @ 4xIR, EFB to RAM, borderless fullscreen, framelimit OFF. FPS average of 7 runs:
----------------------------------------------------------------------------------------------------------------------------------------------------------
4.0-1778 (non-coherent with GL_ARB_Buffer_Storage) = 36 fps
4.0-2010 (non-coherent with GL_AMD_Pinned_Memory) = 38 fps
4.0-1769 (coherent mapping) = 38 fps

NSMB_W1-1 OpenGL @ 4xIR, EFB to RAM, borderless fullscreen, framelimit OFF. FPS average of 7 runs:
----------------------------------------------------------------------------------------------------------------------------------------------------------
4.0-1778 (non-coherent with GL_ARB_Buffer_Storage) = 32 fps
4.0-2010 (non-coherent with GL_AMD_Pinned_Memory) = 34 fps
4.0-1769 (coherent mapping) = 34 fps

NSMB_W1_map OpenGL @ 4xIR, EFB to RAM, borderless fullscreen, framelimit OFF. FPS average of 7 runs:
-----------------------------------------------------------------------------------------------------------------------------------------------------------
4.0-1778 (non-coherent with GL_ARB_Buffer_Storage) = 50 fps
4.0-2010 (non-coherent with GL_AMD_Pinned_Memory) = 55 fps
4.0-1769 (coherent mapping) = 59 fps

NSMB_W-select_spinning_islands OpenGL @ 4xIR, EFB to RAM, borderless fullscreen, framelimit OFF. FPS average of 7 runs:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
4.0-1778 (non-coherent with GL_ARB_Buffer_Storage) = 35 fps
4.0-2010 (non-coherent with GL_AMD_Pinned_Memory) = 35 fps
4.0-1769 (coherent mapping) = 39 fps


Conclusion:

1. GL_ARB_buffer_storage is still slow on AMD, even with the latest drivers (GL_AMD_pinned_memory is faster)
2. Coherent mapping is as fast as [non-coherent + AMD_pinned_memory] in light-to-medium complexity scenes, but significantly faster in vertex-heavy scenes.
3. Coherent mapping + GL_AMD_Pinned_Memory = win/win for Radeon GPUs.



and finally some OpenGL vs. Direct3D results* for comparison (at the same settings):
* NOTE: There's also a performance regression with the Direct3D backend on AMD GPUs. The framerates for D3D should be about 10% higher than those listed below.

NSMB_title
---------------
4.0-1778 OGL = 38 fps
4.0-4049 D3D = 55 fps*


NSMB_W1-1
-----------------
4.0-1778 OGL = 34 fps
4.0-4049 D3D = 47 fps*


NSMB_W1_map
---------------------
4.0-1778 OGL = 59 fps
4.0-4049 D3D = 64 fps*


NSMB_W-select_spinning_islands
-------------------------------------------
4.0-1778 OGL = 39 fps
4.0-4049 D3D = 46 fps*

Conclusion: Direct3D is still much faster than OGL on AMD GPUs and *way* faster with EFB to RAM

-------------------------------------------------------------

EDIT: Even more bench results coming up in Part 2
Pages: 1 2 3