CDC: Why Decompression is Worth the Complexity

When I started building ncps, a Nix cache protocol server, the fundamental question was: should we chunk compressed NARs or decompress first?

Decompressing before chunking adds CPU cost and complexity. Compressing individual chunks adds storage overhead. It seemed wasteful. The naive approach — chunk the compressed NAR directly — is simpler and avoids the extra round-trip through decompression and recompression.

But I had data. Over a year of real-world use, I'd accumulated 60,910 NAR files across multiple nixpkgs revisions and platforms. Before committing to the complex path, I wanted empirical evidence.

So I tested both approaches on my actual store.

The Question: Does Compression Destroy CDC Benefits?#

The premise is simple: content-defined chunking relies on finding patterns in byte sequences. Compression scrambles those patterns. If you chunk compressed data, how much of the deduplication benefits do you lose?

I decided to answer this empirically rather than guess. Using my ncps store as the dataset, I processed the accumulated NARs through two different chunking strategies and compared the results.

The Test: Compressed vs. Uncompressed Chunking#

Strategy 1: Chunk the Compressed NAR Directly

Take each NAR as-is (typically zstd or xz compressed)
Run FastCDC on the compressed bytes
Store resulting chunks

This is simpler. No decompression. No recompression. Just chunk what you receive.

Strategy 2: Decompress → Chunk Raw Bytes → Recompress Chunks

Decompress the NAR
Run FastCDC on the raw, uncompressed bytes
Compress individual chunks with zstd before storage

This adds overhead: decompression on every upload, recompression for every new chunk. But in theory, operating on uncompressed data should reveal redundancy that compression obscures.

For the experimental test, FastCDC was configured with:

Minimum chunk size: 16 KB
Average chunk size: 64 KB (normalized)
Maximum chunk size: 256 KB

(Note: To handle the massive throughput required for decompressing and chunking gigabytes of data on the fly, I built and open-sourced kalbasit/fastcdc, which is currently the fastest FastCDC implementation in Go. Performance matters when you choose the CPU-heavy path.)

(Note: The production ncps instance uses a larger configuration — 64 KB min / 256 KB avg / 1 MB max — to reduce metadata overhead at scale. The test results below reflect the smaller configuration to maximize dedup hit rate.)

The Results: A Decisive Win for Decompression#

I ran the test on a carefully selected dataset of 57,988 NAR files to isolate the chunking behavior:

Metric	Compressed Chunking	Uncompressed Chunking
Dedup Hit Rate	6.4%	47.8%
Total Chunks Created	2,120,515	6,929,698
Unique Chunks Stored	1,982,166	3,611,860
Duplicate Chunks	138,349	3,317,838
Raw Chunk Storage	140 GB	301 GB
With Per-Chunk Compression	—	115 GB

That first row is the killer: 6.4% dedup hit rate from compressed chunking vs. 47.8% from uncompressed.

Compressed chunking barely works. Compression algorithms intentionally remove redundancy by driving data toward maximum Shannon entropy, destroying the structural byte-level patterns that FastCDC relies on to find chunk boundaries. The duplicates are still there — they're just invisible when you're looking at scrambled data.

Uncompressed chunking finds nearly 48% of chunks are duplicates. Even accounting for the storage overhead of per-chunk recompression (roughly 3%), the final store is 115 GB instead of the 140 GB footprint that compressed chunking produces. That's an 18% reduction compared to the flawed approach, proving that the storage and deduplication gains more than justify the added CPU complexity of decompressing and recompressing the chunks.

What This Means for ncps#

This empirical finding made the decision clear: ncps chunks uncompressed NARs. The pipeline is:

Receive compressed NAR (zstd or xz)
Decompress to raw bytes
Stream through FastCDC
For each chunk:
- Compute BLAKE3 hash (content-address key)
- Check if chunk already exists (dedup lookup)
- If exists: map the chunk hash to the file recipe without storing duplicate data
- If new: compress with zstd, store in object store
Record the recipe (ordered list of chunk hashes) in metadata

On retrieval, chunks are fetched, decompressed, and reassembled in order. HTTP Content-Encoding: zstd provides transport-level compression for clients.

The complexity is real. But the empirical validation made it unavoidable: decompression first is the right call.

The Chunk Distribution: Real-World Scale#

At 60,910 NARs, the chunking behavior is well-established. Here's the actual chunk size distribution from ncps (using the 64 KB min / 256 KB avg / 1 MB max configuration):

Size Range	Count	Percentage
< 64 KB (tails)	491,590	45.0%
64–128 KB	271,790	24.9%
128–256 KB	211,939	19.4%
256–512 KB	93,471	8.6%
512 KB–1 MB	21,814	2.0%
> 1 MB	1,794	0.2%

The sub-64KB chunks are FastCDC tails — small chunks that fall below the 64 KB minimum due to how boundaries align at file ends. At 45% of the distribution, they represent a significant portion of metadata overhead. Tracking millions of these tiny chunks in a database introduces I/O bottlenecks that can offset storage savings, which perfectly justifies the larger block sizes used in production.

The bulk of data (44.3%) falls in the 64–256 KB range, which is manageable for individual I/O operations. Chunks over 1 MB are rare (0.2%), confirming that the maximum chunk size constraint is working as intended.

The larger chunk configuration in production (64 KB min / 256 KB avg / 1 MB max, compared to the experimental 16 KB / 64 KB / 256 KB setup) naturally reduces perfect-match deduplication opportunities — fewer boundaries mean fewer ways for chunks to align — but it also reduces metadata rows by roughly 75% and significantly lowers disk I/O overhead during ingestion and retrieval.

Production Reality: Your ncps Store#

The experimental test on 57,988 NARs showed the potential. But how does this play out in a real, growing cache?

My ncps instance tells the story:

Total NARs pushed: 60,910
Total chunks created (references): 1,881,303
Unique chunks stored: 1,092,187
Duplicate chunk references: 789,116
Actual dedup hit rate: 41.95%
Raw chunk storage: 342 GB (uncompressed)
Compressed chunk storage: 122 GB (with per-chunk zstd)

A 41.95% dedup hit rate in production is remarkably close to the 47.8% from the experimental test. The difference is explained by two factors. First, the diversity in a real cache: over a year, multiple nixpkgs revisions, different platforms (Linux/macOS, ARM/AMD64), and one-off packages reduce perfect-match opportunities. Second, the production configuration uses a larger average chunk size (256 KB vs. 64 KB in the test), which naturally reduces the number of perfect duplicate matches but significantly reduces metadata overhead and disk I/O during operations.

For context: the original Nix-compressed NARs for these 60,910 files would have consumed roughly 150 GB on disk using standard compression. Instead, CDC with per-chunk compression stores the same content in 122 GB — beating standard compression by 18% by squeezing out cross-file redundancies that whole-file compression misses.

By storing 1,092,187 unique chunks instead of referencing 1,881,303, the cache saves the equivalent of 789,116 chunk storage operations. That's not wasted space — that's genuine deduplication working in production.

Why Attic Got This Right#

Attic made the same architectural choice: decompress, chunk, recompress. They justified it with design reasoning; I've now validated it with production data from a real cache.

This convergence matters. It's not a coincidence. Both of us independently concluded that the complexity is justified because the math works. My data just proves it at scale.

The Broader Lesson#

When you're designing a system, the choice between simplicity and correctness isn't always obvious. Chunking compressed data is simpler. It requires fewer CPU cycles, fewer round-trips through compression routines. It's an attractive path.

But empirical testing on real data reveals the cost: you lose 47.8% of potential deduplication (in the experimental test) — and in production, you achieve a 41.95% dedup hit rate. That's not a small benefit. Over 1.88 million chunk references across 60,910 packages, 789,116 of them are duplicates that don't consume additional storage. Chunking compressed data directly would have wasted all of that.

For any Nix cache implementation — whether Attic, ncps, or something else — this finding stands: decompress first, even though it costs more. The deduplication benefits far outweigh the added complexity.

All production data in this post comes from a live ncps instance over 12 months: 60,910 NAR files, 1,881,303 total chunk references, 1,092,187 unique chunks stored, 41.95% dedup hit rate. Raw storage: 342 GB uncompressed, 122 GB with per-chunk zstd compression. The compressed vs. uncompressed chunking test was run on a 57,988-NAR dataset using FastCDC with normalized chunking (16 KB min, 64 KB avg, 256 KB max) and BLAKE3 content hashing.