Get the latest tech news

Removing newlines in FASTA file increases ZSTD compression ratio by 10x


dard's long range mode works wonders for genome sequences without newlines First released with Zstandard 1.3.2 in 2017, the --long range match finder increases the compressor’s search window to at least 128MiB, improving deduplication inside large files. This optional feature had substantial performance overheads at launch, but various optimisations have since brought its performance within shooting distance of Zstandard’s fast defaults.

September 12, 2025First released with Zstandard 1.3.2 in 2017, the--long range match finder increases the compressor’s search window to at least 128MiB, improving deduplication inside large files. I speculated that this poor performance might be caused by the newline bytes ( 0x0A) punctuating every 60 characters of sequence, breaking the hashes used for long range pattern matching. Indeed, removing within-record newlines using seqtk seq -l 0 tripled zstd --long ’s CR to 11, yielding a 232GiB file while increasing compression time by only ~20% over Zstandard defaults.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of 10x

10x

Photo of zstd

zstd

Photo of newlines

newlines

Related news:

News photo

DirectX speeds up game loads up to 10X with new advanced shader compiling — feature debuts with Xbox PC app on ROG Xbox Ally and Ally X, more devices later

News photo

Fuse is 95% cheaper and 10x faster than NFS

News photo

AMD SEV Optimizations Ready For Linux 6.17 Plus A 10x Improvement For Intel TDX