VideoCommon > Code with SSSE3/SSE4.1 intrinsic functions

nodchip · 02-20-2010, 02:35 PM #1

This is my first post for this forum.

I created a patch which increase the speed of current codes. The changes are

TextureDecoder.cpp decodebytesARGB8_4()
TextureDecoder.cpp decodebytesC8_To_Raw16()
VertexLoader.cpp VertexLoader::RunVertices()
VertexLoader_Position.cpp Pos_ReadIndex_Float()
VertexLoader_TextCoord.cpp TexCoord_ReadIndex16_Float2()

The code may be slow with old CPUs because it checks whether SSSE3/SSE4.1 is available or not everytime they pass routines.
I tested with Visual Studio 2008 and Core i7. I don't know whether it works on old CPUs and Linux.

**Jack Frost** · (This post was last modified: 02-20-2010, 11:41 PM by Jack Frost.)

You could have simply used cpu_info.bSSSE3 and cpu_info.bSSE41 instead of copying all this stuff over.

The real question is if it is worth the gain, or whether it would make more sense to optimize other parts calling those particular functions (while retaining maintainability and compatibility).

nodchip · 02-20-2010, 11:51 PM #3

(02-20-2010, 11:40 PM)Jack Frost Wrote: You could have simply used cpu_info.bSSSE3 and cpu_info.bSSE41 instead of copying all this stuff over.

The real question is if it is worth the gain, or whether it would make more sense to optimize other parts calling those particular functions (while retaining maintainability and compatibility).

I checked the performance with a profiler (AMD CodeAnalyst). The speed up by the patch is a few percent. I did not feel any speed up with actual games. If you think that the patch is not worth the gain, please reject it.

**Xtreme2damax** · 02-21-2010, 03:31 AM #4

A little more time consuming, but what about hard coding SSSE3 or SSE4 functions into the Video Plugins be viable as well? This could offer a substantial benefit like with Gsdx for Pcsx2.

Anything that improves performance is good as far as I'm concerned so long as accurate emulation isn't sacrificed in the process. From what I've observed Dolphin has a large bottleneck with video/graphics emulation, so it could do Dolphin a world of good if someone was willing to optimize the vertex loaders and implement vertex caching.

nodchip · 02-21-2010, 02:22 PM #5

(02-21-2010, 03:31 AM)Xtreme2damax Wrote: A little more time consuming, but what about hard coding SSSE3 or SSE4 functions into the Video Plugins be viable as well? This could offer a substantial benefit like with Gsdx for Pcsx2.

Anything that improves performance is good as far as I'm concerned so long as accurate emulation isn't sacrificed in the process. From what I've observed Dolphin has a large bottleneck with video/graphics emulation, so it could do Dolphin a world of good if someone was willing to optimize the vertex loaders and implement vertex caching.

Ok, so I describe about what I've done. The profiling results below are measured by following spec.
OS:Windows7 64-bit CPU: Core i7M GPU: Geforece 260M

1. TextureDecoder.cpp decodebytesARGB8_4()
The routine decodes consecutive 64-bytes data into 4x4-pixels texture data. The current code works like below.

Reads 4 bytes (2+2 bytes) from src (, src+2, src+4, ...) and src+32 (, src+34, src+36, ...), respectively.
Packs them into 4-bytes pixel data.
Swaps the endian with Common:wap32().
Writes 4 bytes to dst.
Repeats 4x4=16 times.

I changed the code to decode 4x4 pixels in a loop using SIMD instructions.

Reads 16x4=64 bytes from src with _mm_stream_load_si128()x4.
Packs the 1st (, 2nd, 3rd and 4th) 8 bytes and the 5th (, 6th, 7th and 8th) 8 bytes into 16-bytes data with _mm_unpacklo_epi16()/_mm_unpackhi_epi16().
Swaps the endian with _mm_shuffle_epi8().
Writes 16 bytes to dst.
Repeats 4 times.

The profiling result of the current code is below.

Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x77c91b0 TexDecoder_Decode_real 1 11.87

1 function, 39 instructions, Total: 1468 samples, 11.87% of samples in the module, 0.92% of total session samples

The profiling result after the modification is below.

Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x76d91b0 TexDecoder_Decode_real 1 9.21

1 function, 48 instructions, Total: 1171 samples, 9.21% of samples in the module, 0.67% of total session samples

2. TextureDecoder.cpp decodebytesC8_To_Raw16()
The routine decodes consecutive 8-bytes data into 16-bytes 1x8-pixels pallet texture data. The current code works like below.

Reads one 1-byte index from src (, src+1, src+2, ...).
Reads 2-byte pixel data from the texture lookup table.
Swaps the endian with Common:wap16().
Writes 2 bytes to dst.
Repeats 8 times.

I changed the code to swap the endian at once using SIMD instructions.

Reads one 1-byte index from src (, src+1, src+2, ...).
Reads 2-byte pixel data from the texture lookup table and stores to a xmm register.
Repeats 8 times.
Swaps the endian with _mm_shuffle_epi8().
Writes 16 bytes to dst with _mm_stream_si128().

The profiling result of the current code is below.

Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x76d91b0 TexDecoder_Decode_real 1 9.21

1 function, 48 instructions, Total: 1171 samples, 9.21% of samples in the module, 0.67% of total session samples

The profiling result after the modification is below.

Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x65e9250 TexDecoder_Decode_real 1 6.87

1 function, 60 instructions, Total: 830 samples, 6.87% of samples in the module, 0.52% of total session samples

3. VertexLoader.cpp VertexLoader::RunVertices()
The bottle neck of the routine is calculation of the texture coordinates scalings. The current routine calculates the scaling one by one for 8 times. I changed the code to calculate 4 scalings at once for 2 times. The profiling result of the current code is below.

Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x65ebdc0 VertexLoader::RunVertices 1 7.6

1 function, 95 instructions, Total: 918 samples, 7.60% of samples in the module, 0.57% of total session samples

The profiling result after the modification is below.

Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x786bdc0 VertexLoader::RunVertices 1 5.42

1 function, 85 instructions, Total: 700 samples, 5.42% of samples in the module, 0.47% of total session samples

4. VertexLoader_Position.cpp Pos_ReadIndex_Float()
The current code works like below.

Reads 4 bytes from pData.
Swaps the endian with Common:wap32().
Writes 4 bytes to VertexManager:_pCurBufferPointer.
Repeats 2 or 3 times.

I changed the code to swap the endian at once using SIMD instructions.

Reads 16 bytes from pData.
Swaps the endian with _mm_shuffle_epi8().
Writes 16 bytes to VertexManager:_pCurBufferPointer.

In this case, 16 bytes is too much to read because we only need 8 or 12 bytes. But it does not harm because the whole endian swapping is done by one instruction and extra 3rd or 4th values are overwritten by the next vertex. The profiling result of the current code is below.

Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x787d440 Pos_ReadIndex16_Float3 1 12.89

1 function, 20 instructions, Total: 1664 samples, 12.89% of samples in the module, 1.12% of total session samples

The profiling result after the modification is below.

Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x664d400 Pos_ReadIndex16_Float3 1 2.2
0x664c8b0 Pos_ReadIndex_Float<1> 1 7.18

2 functions, 16 instructions, Total: 1135 samples, 9.37% of samples in the module, 0.69% of total session samples

Note: Inline expansion is applied to Pos_ReadIndex_Float<1> by the compiler in the first profiling result.

5. VertexLoader_TextCoord.cpp TexCoord_ReadIndex16_Float2()
The current code works like below.

Reads 4 bytes from pData.
Swaps the endian with Common:wap32().
Writes 4 bytes to VertexManager:_pCurBufferPointer.
Repeats 2 times.

I changed the code to swap the endian at once using SIMD instructions.

Reads 8 bytes from pData.
Swaps the endian with _mm_shuffle_epi8().
Writes 8 bytes to VertexManager:_pCurBufferPointer.

The profiling result of the current code is below.

Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x6649130 TexCoord_ReadIndex16_Float2 1 18.41

1 function, 20 instructions, Total: 2229 samples, 18.41% of samples in the module, 1.36% of total session samples

The profiling result after the modification is below.

Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x66f9130 TexCoord_ReadIndex16_Float2 1 16.39

1 function, 16 instructions, Total: 1887 samples, 16.39% of samples in the module, 1.26% of total session samples

By the way, cpu_info does not work well in my environment. bSSSE3 and bSSE41 are false always. I wondered them.

**boogerlad** · 02-22-2010, 01:55 PM #6

interesting... I assume that lower % of samples in the module means less overhead right? Less samples=better/more optimized? I'm not a programmer yet, so I don't quite understand everything.

nodchip · 02-22-2010, 04:11 PM #7

(02-22-2010, 01:55 PM)boogerlad Wrote: interesting... I assume that lower % of samples in the module means less overhead right? Less samples=better/more optimized? I'm not a programmer yet, so I don't quite understand everything.

It is right, basically. For example, "11.87% of samples in the module" means 11.87% of computational time is consumed in the routine and 100.0%-11.87%=88.13% is consumed in the other routines. If the modification result is "9.21% of samples in the module", the speed up of the routine is 11.87/9.21=1.29 times and the speed up of the module is (11.87+88.13)/(9.21+88.13)=1.03 times.

***ector*** · 02-23-2010, 01:40 AM #8

Nice stuff. If bSSSE3 and SSE4 don't work, they should be fixed - could just set them using your code. If you do that, i'll give you commit access so you can submit it, just pm me your gmail/google code account name and I'll add you.

**KHRZ** · 02-23-2010, 08:16 AM #9

Is the speedup only for processors before SSEE3? In any case, props

**boogerlad** · 02-23-2010, 09:22 AM **#10**

yes. It's not very noticeable though. 2-3fps faster in mkwii for me.