Get the latest tech news
High-Performance PNG Decoding
2D Vector Graphics Engine
In general, everybody who implemented a DEFLATE decoder found a ceiling - at some point it's not possible to improve it further, because it's just inherently scalar and almost impossible to use any kind of SIMD to process the bit-stream. However, it could definitely replace memory load with permutations, which are done completely within a ZMM register, so theoretically the latency could be improved a lot, provided that the decoding won't exceed the limitation of this lookup (8 or 9 bits) often. Memory safety doesn't play any role in terms of performance in this case, because we are talking about a completely scalar algorithm, so lesser latency translates into more throughput.
Or read this on Hacker News