(02-21-2010, 03:31 AM)Xtreme2damax Wrote: [ -> ]A little more time consuming, but what about hard coding SSSE3 or SSE4 functions into the Video Plugins be viable as well? This could offer a substantial benefit like with Gsdx for Pcsx2.
Anything that improves performance is good as far as I'm concerned so long as accurate emulation isn't sacrificed in the process. From what I've observed Dolphin has a large bottleneck with video/graphics emulation, so it could do Dolphin a world of good if someone was willing to optimize the vertex loaders and implement vertex caching.
Ok, so I describe about what I've done. The profiling results below are measured by following spec.
OS:Windows7 64-bit CPU: Core i7M GPU: Geforece 260M
1. TextureDecoder.cpp decodebytesARGB8_4()
The routine decodes consecutive 64-bytes data into 4x4-pixels texture data. The current code works like below.
- Reads 4 bytes (2+2 bytes) from src (, src+2, src+4, ...) and src+32 (, src+34, src+36, ...), respectively.
- Packs them into 4-bytes pixel data.
- Swaps the endian with Common:
wap32().
- Writes 4 bytes to dst.
- Repeats 4x4=16 times.
I changed the code to decode 4x4 pixels in a loop using SIMD instructions.
- Reads 16x4=64 bytes from src with _mm_stream_load_si128()x4.
- Packs the 1st (, 2nd, 3rd and 4th) 8 bytes and the 5th (, 6th, 7th and 8th) 8 bytes into 16-bytes data with _mm_unpacklo_epi16()/_mm_unpackhi_epi16().
- Swaps the endian with _mm_shuffle_epi8().
- Writes 16 bytes to dst.
- Repeats 4 times.
The profiling result of the current code is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x77c91b0 TexDecoder_Decode_real 1 11.87
1 function, 39 instructions, Total: 1468 samples, 11.87% of samples in the module, 0.92% of total session samples
The profiling result after the modification is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x76d91b0 TexDecoder_Decode_real 1 9.21
1 function, 48 instructions, Total: 1171 samples, 9.21% of samples in the module, 0.67% of total session samples
2. TextureDecoder.cpp decodebytesC8_To_Raw16()
The routine decodes consecutive 8-bytes data into 16-bytes 1x8-pixels pallet texture data. The current code works like below.
- Reads one 1-byte index from src (, src+1, src+2, ...).
- Reads 2-byte pixel data from the texture lookup table.
- Swaps the endian with Common:
wap16().
- Writes 2 bytes to dst.
- Repeats 8 times.
I changed the code to swap the endian at once using SIMD instructions.
- Reads one 1-byte index from src (, src+1, src+2, ...).
- Reads 2-byte pixel data from the texture lookup table and stores to a xmm register.
- Repeats 8 times.
- Swaps the endian with _mm_shuffle_epi8().
- Writes 16 bytes to dst with _mm_stream_si128().
The profiling result of the current code is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x76d91b0 TexDecoder_Decode_real 1 9.21
1 function, 48 instructions, Total: 1171 samples, 9.21% of samples in the module, 0.67% of total session samples
The profiling result after the modification is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x65e9250 TexDecoder_Decode_real 1 6.87
1 function, 60 instructions, Total: 830 samples, 6.87% of samples in the module, 0.52% of total session samples
3. VertexLoader.cpp VertexLoader::RunVertices()
The bottle neck of the routine is calculation of the texture coordinates scalings. The current routine calculates the scaling one by one for 8 times. I changed the code to calculate 4 scalings at once for 2 times. The profiling result of the current code is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x65ebdc0 VertexLoader::RunVertices 1 7.6
1 function, 95 instructions, Total: 918 samples, 7.60% of samples in the module, 0.57% of total session samples
The profiling result after the modification is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x786bdc0 VertexLoader::RunVertices 1 5.42
1 function, 85 instructions, Total: 700 samples, 5.42% of samples in the module, 0.47% of total session samples
4. VertexLoader_Position.cpp Pos_ReadIndex_Float()
The current code works like below.
- Reads 4 bytes from pData.
- Swaps the endian with Common:
wap32().
- Writes 4 bytes to VertexManager:
_pCurBufferPointer.
- Repeats 2 or 3 times.
I changed the code to swap the endian at once using SIMD instructions.
- Reads 16 bytes from pData.
- Swaps the endian with _mm_shuffle_epi8().
- Writes 16 bytes to VertexManager:
_pCurBufferPointer.
In this case, 16 bytes is too much to read because we only need 8 or 12 bytes. But it does not harm because the whole endian swapping is done by one instruction and extra 3rd or 4th values are overwritten by the next vertex. The profiling result of the current code is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x787d440 Pos_ReadIndex16_Float3 1 12.89
1 function, 20 instructions, Total: 1664 samples, 12.89% of samples in the module, 1.12% of total session samples
The profiling result after the modification is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x664d400 Pos_ReadIndex16_Float3 1 2.2
0x664c8b0 Pos_ReadIndex_Float<1> 1 7.18
2 functions, 16 instructions, Total: 1135 samples, 9.37% of samples in the module, 0.69% of total session samples
Note: Inline expansion is applied to Pos_ReadIndex_Float<1> by the compiler in the first profiling result.
5. VertexLoader_TextCoord.cpp TexCoord_ReadIndex16_Float2()
The current code works like below.
- Reads 4 bytes from pData.
- Swaps the endian with Common:
wap32().
- Writes 4 bytes to VertexManager:
_pCurBufferPointer.
- Repeats 2 times.
I changed the code to swap the endian at once using SIMD instructions.
- Reads 8 bytes from pData.
- Swaps the endian with _mm_shuffle_epi8().
- Writes 8 bytes to VertexManager:
_pCurBufferPointer.
The profiling result of the current code is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x6649130 TexCoord_ReadIndex16_Float2 1 18.41
1 function, 20 instructions, Total: 2229 samples, 18.41% of samples in the module, 1.36% of total session samples
The profiling result after the modification is below.
Quote:CS:EIP Symbol + Offset 64-bit Timer samples
0x66f9130 TexCoord_ReadIndex16_Float2 1 16.39
1 function, 16 instructions, Total: 1887 samples, 16.39% of samples in the module, 1.26% of total session samples
By the way, cpu_info does not work well in my environment. bSSSE3 and bSSE41 are false always. I wondered them.