Finally someone who is able to reproduce the OpenGL performance regression on AMD/ATI GPUs posted a year ago, but this time using an NVIDIA GPU.
NOTE #1: In that thread I benchmarked 4.0-1769 (exact same performance as 4.0-1776).
Disabling coherent mapping causes a performance drop on nearly all modern GPUs regardless of the brand (AMD or NVIDIA) or the use of vendor-specific OpenGL extensions (e.g. GL_AMD_Pinned_Memory).
Here's a quick NSMB benchmark @ 4xIR, EFB2Tex with the old HD6850 + the latest drivers (Cat. 15.11.1):
NOTE #2: Can't do a 6xIR bench because the old builds don't support anything higher than 4x.
w1-1
====
4.0-1776 OGL = 116 FPS
4.0-1778 OGL = 101 FPS
4.0-2010 OGL = 116 FPS
--------------------------
4.0-1778 D3D = 140 FPS
--------------------------
4.0-8187 OGL = 116 FPS
4.0-8187 D3D = 138 FPS
w2-overview
=========
4.0-1776 OGL = 88 FPS
4.0-1778 OGL = 74 FPS
4.0-2010 OGL = 85 FPS
-------------------------
4.0-1778 D3D = 100 FPS
--------------------------
4.0-8187 OGL = 97 FPS
4.0-8187 D3D = 126 FPS
NOTE #3: 4.0-2010 is the same as 4.0-1778, but with GL_AMD_Pinned_Memory instead of GL_ARB_Buffer_Storage.
Pinned Memory manages to close the gap, but not quite. Coherent mapping is still faster. In vertex-heavy scenes, the difference is much more pronounced (in other titles, there's a HUGE difference between coherent mapping and non-coherent with pinned memory).
Even worse, in some cases Pinned Memory can be slower than Buffer Storage:
Cake Intro
=======
4.0-1776 OGL = 75 FPS
4.0-1778 OGL = 74 FPS
4.0-2010 OGL = 72 FPS
OTOH, Coherent memory doesn't have any of these drawbacks / caveats.
NOTE #4: The latest dev. build is so much faster than the other builds because of a bunch of other optimizations (SSE-optimized vertex loaders, texture cache rewrite, more efficient JIT, etc.)
NOTE #1: In that thread I benchmarked 4.0-1769 (exact same performance as 4.0-1776).
Disabling coherent mapping causes a performance drop on nearly all modern GPUs regardless of the brand (AMD or NVIDIA) or the use of vendor-specific OpenGL extensions (e.g. GL_AMD_Pinned_Memory).
Here's a quick NSMB benchmark @ 4xIR, EFB2Tex with the old HD6850 + the latest drivers (Cat. 15.11.1):
NOTE #2: Can't do a 6xIR bench because the old builds don't support anything higher than 4x.
w1-1
====
4.0-1776 OGL = 116 FPS
4.0-1778 OGL = 101 FPS
4.0-2010 OGL = 116 FPS
--------------------------
4.0-1778 D3D = 140 FPS
--------------------------
4.0-8187 OGL = 116 FPS
4.0-8187 D3D = 138 FPS
w2-overview
=========
4.0-1776 OGL = 88 FPS
4.0-1778 OGL = 74 FPS
4.0-2010 OGL = 85 FPS
-------------------------
4.0-1778 D3D = 100 FPS
--------------------------
4.0-8187 OGL = 97 FPS
4.0-8187 D3D = 126 FPS
NOTE #3: 4.0-2010 is the same as 4.0-1778, but with GL_AMD_Pinned_Memory instead of GL_ARB_Buffer_Storage.
Pinned Memory manages to close the gap, but not quite. Coherent mapping is still faster. In vertex-heavy scenes, the difference is much more pronounced (in other titles, there's a HUGE difference between coherent mapping and non-coherent with pinned memory).
Even worse, in some cases Pinned Memory can be slower than Buffer Storage:
Cake Intro
=======
4.0-1776 OGL = 75 FPS
4.0-1778 OGL = 74 FPS
4.0-2010 OGL = 72 FPS
OTOH, Coherent memory doesn't have any of these drawbacks / caveats.
NOTE #4: The latest dev. build is so much faster than the other builds because of a bunch of other optimizations (SSE-optimized vertex loaders, texture cache rewrite, more efficient JIT, etc.)
