Content-Defined Chunking (CDC) Results for NAR Storage
I’ve been benchmarking Content-Defined Chunking (CDC) to optimize storage for our NAR files. The goal was to maximize deduplication by decompressing the source NARs before chunking, then re-compressing the unique chunks for final storage.
The Pipeline: Compressed NAR → Decompress → CDC Chunking → Re-compress Chunks → Store
Summary Statistics#
| Metric | Value |
|---|---|
| Total Files (NARs) | 60,784 |
| Original Compressed Size | 161.06 GB |
| Post-Deduplication Size | 125.46 GB |
| Net Space Saved | 35.60 GB (22.1%) |
| Unique Chunks Stored | 3,934,658 |
| Deduplication Hit Rate | 47.72% |
Database Insights#
The following queries highlight the storage efficiency and the chunk distribution across the dataset.
1. Physical Storage Impact#
This compares the raw size of the chunks (transient size during the CDC process) vs. the final compressed size on disk. Note that the 307 GB "raw" size was never actually stored; it represents the data processed through the pipeline.
ncps=# SELECT
pg_size_pretty(SUM("size")) AS raw_physical_size,
pg_size_pretty(SUM("compressed_size")) AS compressed_physical_size
FROM "chunks";
raw_physical_size | compressed_physical_size
-------------------+--------------------------
307 GB | 117 GB
(1 row)2. Deduplication Efficiency#
This query tracks how many chunks were generated in total versus how many unique chunks were actually stored after deduplication.
ncps=# WITH chunk_stats AS (
SELECT
COUNT(*) AS total_chunks_created,
COUNT(DISTINCT "chunk_id") AS unique_chunks_stored
FROM "nar_file_chunks"
)
SELECT
total_chunks_created,
unique_chunks_stored,
(total_chunks_created - unique_chunks_stored) AS duplicate_chunks,
ROUND(((total_chunks_created - unique_chunks_stored)::NUMERIC / total_chunks_created) * 100, 2) AS dedup_hit_rate_percent
FROM chunk_stats;
total_chunks_created | unique_chunks_stored | duplicate_chunks | dedup_hit_rate_percent
----------------------+----------------------+------------------+------------------------
7525610 | 3934658 | 3590952 | 47.72
(1 row)Key Observations#
- Impact of Decompression: Decompressing NARs before chunking is essential. It prevents the "avalanche effect" of compression and allowed for a 47.72% hit rate.
- Compression Efficiency: The individual chunks remain highly compressible, moving from 307 GB raw to 117 GB compressed (a ~2.6:1 ratio).
- Storage Trade-off: We've achieved a 22.1% net reduction in disk footprint compared to the original compressed NARs. This comes at the cost of increased metadata complexity, moving from ~60k objects to nearly 4 million unique chunk records.