• Login
  • Register
  • Dolphin Forums
  • Home
  • FAQ
  • Download
  • Wiki
  • Code


Dolphin, the GameCube and Wii emulator - Forums › Dolphin Site › Dolphin Patches (Archive) v
« Previous 1 ... 3 4 5 6 7

VideoCommon > Code with SSSE3/SSE4.1 intrinsic functions
View New Posts | View Today's Posts

Pages (3): 1 2 3 Next »
Thread Rating:
  • 2 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Thread Modes
VideoCommon > Code with SSSE3/SSE4.1 intrinsic functions
02-20-2010, 02:35 PM
#1
nodchip Offline
Junior Member
**
Posts: 8
Threads: 1
Joined: Feb 2010
This is my first post for this forum.

I created a patch which increase the speed of current codes. The changes are
  1. TextureDecoder.cpp decodebytesARGB8_4()
  2. TextureDecoder.cpp decodebytesC8_To_Raw16()
  3. VertexLoader.cpp VertexLoader::RunVertices()
  4. VertexLoader_Position.cpp Pos_ReadIndex_Float()
  5. VertexLoader_TextCoord.cpp TexCoord_ReadIndex16_Float2()
The code may be slow with old CPUs because it checks whether SSSE3/SSE4.1 is available or not everytime they pass routines.
I tested with Visual Studio 2008 and Core i7. I don't know whether it works on old CPUs and Linux.


Attached Files
.patch   VideoCommon-SSSE3-SSE41-rev5089.patch (Size: 13.55 KB / Downloads: 342)
Find
02-20-2010, 11:40 PM (This post was last modified: 02-20-2010, 11:41 PM by Jack Frost.)
#2
Jack Frost Offline
aka. BhaaL
**********
Developers (Some Administrators and Super Moderators)
Posts: 499
Threads: 3
Joined: Oct 2009
You could have simply used cpu_info.bSSSE3 and cpu_info.bSSE41 instead of copying all this stuff over.

The real question is if it is worth the gain, or whether it would make more sense to optimize other parts calling those particular functions (while retaining maintainability and compatibility).
Find
02-20-2010, 11:51 PM
#3
nodchip Offline
Junior Member
**
Posts: 8
Threads: 1
Joined: Feb 2010
(02-20-2010, 11:40 PM)Jack Frost Wrote: You could have simply used cpu_info.bSSSE3 and cpu_info.bSSE41 instead of copying all this stuff over.

The real question is if it is worth the gain, or whether it would make more sense to optimize other parts calling those particular functions (while retaining maintainability and compatibility).

I checked the performance with a profiler (AMD CodeAnalyst). The speed up by the patch is a few percent. I did not feel any speed up with actual games. If you think that the patch is not worth the gain, please reject it.
Find
02-21-2010, 03:31 AM
#4
Xtreme2damax Offline
New & Improved
********
Global Moderators
Posts: 3,135
Threads: 91
Joined: Mar 2009
A little more time consuming, but what about hard coding SSSE3 or SSE4 functions into the Video Plugins be viable as well? This could offer a substantial benefit like with Gsdx for Pcsx2.

Anything that improves performance is good as far as I'm concerned so long as accurate emulation isn't sacrificed in the process. From what I've observed Dolphin has a large bottleneck with video/graphics emulation, so it could do Dolphin a world of good if someone was willing to optimize the vertex loaders and implement vertex caching.
Find
02-21-2010, 02:22 PM
#5
nodchip Offline
Junior Member
**
Posts: 8
Threads: 1
Joined: Feb 2010
(02-21-2010, 03:31 AM)Xtreme2damax Wrote: A little more time consuming, but what about hard coding SSSE3 or SSE4 functions into the Video Plugins be viable as well? This could offer a substantial benefit like with Gsdx for Pcsx2.

Anything that improves performance is good as far as I'm concerned so long as accurate emulation isn't sacrificed in the process. From what I've observed Dolphin has a large bottleneck with video/graphics emulation, so it could do Dolphin a world of good if someone was willing to optimize the vertex loaders and implement vertex caching.
Ok, so I describe about what I've done. The profiling results below are measured by following spec.
OS:Windows7 64-bit CPU: Core i7M GPU: Geforece 260M

1. TextureDecoder.cpp decodebytesARGB8_4()
The routine decodes consecutive 64-bytes data into 4x4-pixels texture data. The current code works like below.
  1. Reads 4 bytes (2+2 bytes) from src (, src+2, src+4, ...) and src+32 (, src+34, src+36, ...), respectively.
  2. Packs them into 4-bytes pixel data.
  3. Swaps the endian with Common:Confusedwap32().
  4. Writes 4 bytes to dst.
  5. Repeats 4x4=16 times.
I changed the code to decode 4x4 pixels in a loop using SIMD instructions.
  1. Reads 16x4=64 bytes from src with _mm_stream_load_si128()x4.
  2. Packs the 1st (, 2nd, 3rd and 4th) 8 bytes and the 5th (, 6th, 7th and 8th) 8 bytes into 16-bytes data with _mm_unpacklo_epi16()/_mm_unpackhi_epi16().
  3. Swaps the endian with _mm_shuffle_epi8().
  4. Writes 16 bytes to dst.
  5. Repeats 4 times.
The profiling result of the current code is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x77c91b0 TexDecoder_Decode_real 1 11.87

1 function, 39 instructions, Total: 1468 samples, 11.87% of samples in the module, 0.92% of total session samples
The profiling result after the modification is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x76d91b0 TexDecoder_Decode_real 1 9.21

1 function, 48 instructions, Total: 1171 samples, 9.21% of samples in the module, 0.67% of total session samples

2. TextureDecoder.cpp decodebytesC8_To_Raw16()
The routine decodes consecutive 8-bytes data into 16-bytes 1x8-pixels pallet texture data. The current code works like below.
  1. Reads one 1-byte index from src (, src+1, src+2, ...).
  2. Reads 2-byte pixel data from the texture lookup table.
  3. Swaps the endian with Common:Confusedwap16().
  4. Writes 2 bytes to dst.
  5. Repeats 8 times.
I changed the code to swap the endian at once using SIMD instructions.
  1. Reads one 1-byte index from src (, src+1, src+2, ...).
  2. Reads 2-byte pixel data from the texture lookup table and stores to a xmm register.
  3. Repeats 8 times.
  4. Swaps the endian with _mm_shuffle_epi8().
  5. Writes 16 bytes to dst with _mm_stream_si128().
The profiling result of the current code is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x76d91b0 TexDecoder_Decode_real 1 9.21

1 function, 48 instructions, Total: 1171 samples, 9.21% of samples in the module, 0.67% of total session samples
The profiling result after the modification is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x65e9250 TexDecoder_Decode_real 1 6.87

1 function, 60 instructions, Total: 830 samples, 6.87% of samples in the module, 0.52% of total session samples

3. VertexLoader.cpp VertexLoader::RunVertices()
The bottle neck of the routine is calculation of the texture coordinates scalings. The current routine calculates the scaling one by one for 8 times. I changed the code to calculate 4 scalings at once for 2 times. The profiling result of the current code is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x65ebdc0 VertexLoader::RunVertices 1 7.6

1 function, 95 instructions, Total: 918 samples, 7.60% of samples in the module, 0.57% of total session samples
The profiling result after the modification is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x786bdc0 VertexLoader::RunVertices 1 5.42

1 function, 85 instructions, Total: 700 samples, 5.42% of samples in the module, 0.47% of total session samples

4. VertexLoader_Position.cpp Pos_ReadIndex_Float()
The current code works like below.
  1. Reads 4 bytes from pData.
  2. Swaps the endian with Common:Confusedwap32().
  3. Writes 4 bytes to VertexManager:Confused_pCurBufferPointer.
  4. Repeats 2 or 3 times.
I changed the code to swap the endian at once using SIMD instructions.
  1. Reads 16 bytes from pData.
  2. Swaps the endian with _mm_shuffle_epi8().
  3. Writes 16 bytes to VertexManager:Confused_pCurBufferPointer.
In this case, 16 bytes is too much to read because we only need 8 or 12 bytes. But it does not harm because the whole endian swapping is done by one instruction and extra 3rd or 4th values are overwritten by the next vertex. The profiling result of the current code is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x787d440 Pos_ReadIndex16_Float3 1 12.89

1 function, 20 instructions, Total: 1664 samples, 12.89% of samples in the module, 1.12% of total session samples
The profiling result after the modification is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x664d400 Pos_ReadIndex16_Float3 1 2.2
0x664c8b0 Pos_ReadIndex_Float<1> 1 7.18

2 functions, 16 instructions, Total: 1135 samples, 9.37% of samples in the module, 0.69% of total session samples
Note: Inline expansion is applied to Pos_ReadIndex_Float<1> by the compiler in the first profiling result.

5. VertexLoader_TextCoord.cpp TexCoord_ReadIndex16_Float2()
The current code works like below.
  1. Reads 4 bytes from pData.
  2. Swaps the endian with Common:Confusedwap32().
  3. Writes 4 bytes to VertexManager:Confused_pCurBufferPointer.
  4. Repeats 2 times.
I changed the code to swap the endian at once using SIMD instructions.
  1. Reads 8 bytes from pData.
  2. Swaps the endian with _mm_shuffle_epi8().
  3. Writes 8 bytes to VertexManager:Confused_pCurBufferPointer.
The profiling result of the current code is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x6649130 TexCoord_ReadIndex16_Float2 1 18.41

1 function, 20 instructions, Total: 2229 samples, 18.41% of samples in the module, 1.36% of total session samples
The profiling result after the modification is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x66f9130 TexCoord_ReadIndex16_Float2 1 16.39

1 function, 16 instructions, Total: 1887 samples, 16.39% of samples in the module, 1.26% of total session samples

By the way, cpu_info does not work well in my environment. bSSSE3 and bSSE41 are false always. I wondered them.
Find
02-22-2010, 01:55 PM
#6
boogerlad Offline
Above and Beyond
*******
Posts: 1,134
Threads: 21
Joined: Apr 2009
interesting... I assume that lower % of samples in the module means less overhead right? Less samples=better/more optimized? I'm not a programmer yet, so I don't quite understand everything.
Find
02-22-2010, 04:11 PM
#7
nodchip Offline
Junior Member
**
Posts: 8
Threads: 1
Joined: Feb 2010
(02-22-2010, 01:55 PM)boogerlad Wrote: interesting... I assume that lower % of samples in the module means less overhead right? Less samples=better/more optimized? I'm not a programmer yet, so I don't quite understand everything.

It is right, basically. For example, "11.87% of samples in the module" means 11.87% of computational time is consumed in the routine and 100.0%-11.87%=88.13% is consumed in the other routines. If the modification result is "9.21% of samples in the module", the speed up of the routine is 11.87/9.21=1.29 times and the speed up of the module is (11.87+88.13)/(9.21+88.13)=1.03 times.
Find
02-23-2010, 01:40 AM
#8
ector Offline
PPSSPP author, Dolphin co-founder
*
Project Owner  Developers (Administrators)
Posts: 189
Threads: 2
Joined: Mar 2009
Nice stuff. If bSSSE3 and SSE4 don't work, they should be fixed - could just set them using your code. If you do that, i'll give you commit access so you can submit it, just pm me your gmail/google code account name and I'll add you.
Website Find
02-23-2010, 08:16 AM
#9
KHRZ Offline
Above and Beyond
*******
Posts: 1,527
Threads: 61
Joined: Mar 2009
Is the speedup only for processors before SSEE3? In any case, props
Specs: intel i5 3570k @ 3.4GHz;
16Gb RAM; Raedon HD 7900;
Win8 64-Bit
Website Find
02-23-2010, 09:22 AM
#10
boogerlad Offline
Above and Beyond
*******
Posts: 1,134
Threads: 21
Joined: Apr 2009
yes. It's not very noticeable though. 2-3fps faster in mkwii for me.
Find
« Next Oldest | Next Newest »
Pages (3): 1 2 3 Next »


  • View a Printable Version
  • Subscribe to this thread
Forum Jump:


Users browsing this thread: 1 Guest(s)



Powered By MyBB | Theme by Fragma

Linear Mode
Threaded Mode