Align decompress sequences loop to 32+16 bytes
The alignment is added before the loop, so this shouldn't hurt performance in any case. The only way it hurts is if there is already performance instability, and we force it to be stable but in the bad case. This consistently gets us into the good case with gcc-{7,8,9} on an Intel i9-9900K and clang-9. gcc-5 is 5% worse than its best case but has stable performance. We get consistently good behavior on my Macbook Pro compiled with both clang and gcc-8. It ends up in the 50% from DSB and 50% from MITE case, but the performance is the same as the 85% DSB case, so thats fine.
Loading
Please register or sign in to comment