Content-Defined Chunking (CDC) Results for NAR Storage

I’ve been benchmarking Content-Defined Chunking (CDC) to optimize storage for our NAR files. The goal was to maximize deduplication by decompressing the source NARs before chunking, then re-compressing the unique chunks for final storage.

The Pipeline: Compressed NARDecompressCDC ChunkingRe-compress ChunksStore

Summary Statistics#

MetricValue
Total Files (NARs)60,784
Original Compressed Size161.06 GB
Post-Deduplication Size125.46 GB
Net Space Saved35.60 GB (22.1%)
Unique Chunks Stored3,934,658
Deduplication Hit Rate47.72%

Database Insights#

The following queries highlight the storage efficiency and the chunk distribution across the dataset.

1. Physical Storage Impact#

This compares the raw size of the chunks (transient size during the CDC process) vs. the final compressed size on disk. Note that the 307 GB "raw" size was never actually stored; it represents the data processed through the pipeline.

ncps=# SELECT
    pg_size_pretty(SUM("size")) AS raw_physical_size,
    pg_size_pretty(SUM("compressed_size")) AS compressed_physical_size
FROM "chunks";

 raw_physical_size | compressed_physical_size 
-------------------+--------------------------
 307 GB            | 117 GB
(1 row)

2. Deduplication Efficiency#

This query tracks how many chunks were generated in total versus how many unique chunks were actually stored after deduplication.

ncps=# WITH chunk_stats AS (
    SELECT
        COUNT(*) AS total_chunks_created,
        COUNT(DISTINCT "chunk_id") AS unique_chunks_stored
    FROM "nar_file_chunks"
)
SELECT
    total_chunks_created,
    unique_chunks_stored,
    (total_chunks_created - unique_chunks_stored) AS duplicate_chunks,
    ROUND(((total_chunks_created - unique_chunks_stored)::NUMERIC / total_chunks_created) * 100, 2) AS dedup_hit_rate_percent
FROM chunk_stats;

 total_chunks_created | unique_chunks_stored | duplicate_chunks | dedup_hit_rate_percent 
----------------------+----------------------+------------------+------------------------
              7525610 |              3934658 |          3590952 |                  47.72
(1 row)

Key Observations#

  • Impact of Decompression: Decompressing NARs before chunking is essential. It prevents the "avalanche effect" of compression and allowed for a 47.72% hit rate.
  • Compression Efficiency: The individual chunks remain highly compressible, moving from 307 GB raw to 117 GB compressed (a ~2.6:1 ratio).
  • Storage Trade-off: We've achieved a 22.1% net reduction in disk footprint compared to the original compressed NARs. This comes at the cost of increased metadata complexity, moving from ~60k objects to nearly 4 million unique chunk records.