Get the latest tech news
Removing newlines in FASTA file increases ZSTD compression ratio by 10x
dard's long range mode works wonders for genome sequences without newlines First released with Zstandard 1.3.2 in 2017, the --long range match finder increases the compressor’s search window to at least 128MiB, improving deduplication inside large files. This optional feature had substantial performance overheads at launch, but various optimisations have since brought its performance within shooting distance of Zstandard’s fast defaults.
September 12, 2025First released with Zstandard 1.3.2 in 2017, the--long range match finder increases the compressor’s search window to at least 128MiB, improving deduplication inside large files. I speculated that this poor performance might be caused by the newline bytes ( 0x0A) punctuating every 60 characters of sequence, breaking the hashes used for long range pattern matching. Indeed, removing within-record newlines using seqtk seq -l 0 tripled zstd --long ’s CR to 11, yielding a 232GiB file while increasing compression time by only ~20% over Zstandard defaults.
Or read this on Hacker News