Skip to content
Snippets Groups Projects
  1. Nov 20, 2023
    • Nick Terrell's avatar
      [huf] Improve fast huffman decoding speed in linux kernel · 3ff79dbe
      Nick Terrell authored
      gcc in the linux kernel was not unrolling the inner loops of the Huffman
      decoder, which was destroying decoding performance. The compiler was
      generating crazy code with all sorts of branches. I suspect because of
      Spectre mitigations, but I'm not certain. Once the loops were manually
      unrolled, performance was restored.
      
      Additionally, when gcc couldn't prove that the variable left shift in
      the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
      it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
      this check. This is a no op, because `entry.nbBits` is guaranteed to be
      less than 64.
      
      Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
      fast C loops for Issue #3762. So if even after this change, there is a
      performance regression, users can opt-out at compile time.
      v1.5.5-kernel
      3ff79dbe
  2. Nov 17, 2023
  3. Nov 16, 2023
  4. Apr 04, 2023
  5. Apr 03, 2023
  6. Apr 02, 2023
  7. Apr 01, 2023
  8. Mar 31, 2023
  9. Mar 30, 2023
    • Yoni Gilad's avatar
      seekable_format: Add unit test for multiple decompress calls · 649a9c85
      Yoni Gilad authored
      This does the following:
      1. Compress test data into multiple frames
      2. Perform a series of small decompressions and seeks forward, checking
         that compressed data wasn't reread unnecessarily.
      3. Perform some seeks forward and backward to ensure correctness.
      649a9c85
    • Yoni Gilad's avatar
      seekable_format: Prevent rereading frame when seeking forward · 618bf84e
      Yoni Gilad authored
      When decompressing a seekable file, if seeking forward within
      a frame (by issuing multiple ZSTD_seekable_decompress calls
      with a small gap between them), the frame will be unnecessarily
      reread from the beginning. This patch makes it continue using
      the current frame data and simply skip over the unneeded bytes.
      618bf84e
  10. Mar 29, 2023
  11. Mar 28, 2023
    • Yann Collet's avatar
      Merge pull request #3573 from facebook/dependabot/github_actions/github/codeql-action-2.2.8 · 262e553b
      Yann Collet authored
      Bump github/codeql-action from 2.2.6 to 2.2.8
    • daniellerozenblit's avatar
      mmap for windows (#3557) · b2ad17a6
      daniellerozenblit authored
      * mmap for windows
      
      * remove enabling mmap for testing
      
      * rename FIO dictionary initialization methods + un-const dictionary objects in free functions
      
      * remove enabling mmap for testing
      
      * initDict returns void, underlying setDictBuffer methods return the size of the set buffer
      
      * fix comment
    • Han Zhu's avatar
      Remove clang-only branch hints from ZSTD_decodeSequence · b558190a
      Han Zhu authored
      Looking at the __builtin_expect in ZSTD_decodeSequence:
      
      {   size_t offset;
          #if defined(__clang__)
       if (LIKELY(ofBits > 1)) {
          #else
       if (ofBits > 1) {
          #endif
       ZSTD_STATIC_ASSERT(ZSTD_lo_isLongOffset == 1);
      
      From profile-annotated assembly, the probability of ofBits > 1 is about 75%
      (101k counts out of 135k counts). This is much smaller than the recommended
      likelihood to use __builtin_expect which is 99%. As a result, clang moved the
      else block further away which hurts cache locality. Removing this
      __built_expect along with two others in ZSTD_decodeSequence gave better
      performance when PGO is enabled. I suggest to remove these branch hints and
      rely on PGO which leverages runtime profiles from actual workload to calculate
      branch probability instead.
      b558190a
    • Han Zhu's avatar
      Inline BIT_reloadDStream · e6dccbf4
      Han Zhu authored
      Inlining `BIT_reloadDStream` provided >3% decompression speed improvement for
      clang PGO-optimized zstd binary, measured using the Silesia corpus with
      compression level 1. The win comes from improved register allocation which leads
      to fewer spills and reloads. Take a look at this comparison of
      profile-annotated hot assembly before and after this change:
      https://www.diffchecker.com/UjDGIyLz/. The diff is a bit messy, but notice three
      fewer moves after inlining.
      
      In general LLVM's register allocator works better when it can see more code. For
      example, when the register allocator sees a call instruction, it partitions the
      registers into caller registers and callee registers, and it is not free to do
      whatever it wants with all the registers for the current function. Inlining the
      callee lets the register allocation access all registers and use them more
      flexsibly.
      e6dccbf4
    • Elliot Gorokhovsky's avatar
      Merge pull request #3551 from embg/seq_prod_fuzz · 57e1b459
      Elliot Gorokhovsky authored
      Provide an interface for fuzzing sequence producer plugins
    • Elliot Gorokhovsky's avatar
    • Yann Collet's avatar
      Merge pull request #3568 from facebook/readme_cmake_fat · abb3585c
      Yann Collet authored
      Add instructions for building Universal2 on macOS via CMake
  12. Mar 27, 2023
Loading