Format Evolution and Decision Record¶

This document records how the crushr format architecture emerged through deliberate experimentation and elimination, not incremental feature accumulation.

It is not a changelog. It is a design evidence record.

The format variants described below are research history. They are not runtime feature toggles for canonical extraction behavior.

Each phase documents:

the hypothesis under test
why it was plausible
how it was evaluated
what the results showed
the resulting architectural decision

Current Outcome (Summary)¶

The current crushr architecture is defined by:

Extent identity as primary truth
Mirrored dictionaries for naming
Fail-closed naming semantics
No central authority required for recovery

The following were rejected:

manifest-led designs
metadata-heavy recovery strategies
placement-based optimizations
leadership-based dictionary systems (FORMAT-15)

FORMAT-05 — Self-identifying blocks¶

Hypothesis¶

Embedding identity directly with payload blocks enables recovery without reliance on central metadata.

Why this seemed plausible¶

Traditional archive failures are dominated by metadata corruption. Moving identity closer to data may preserve recoverability.

Test¶

Compared: - payload-only recovery - metadata-indexed recovery - self-identifying block recovery

Under deterministic corruption (truncation, overwrite, fragmentation).

Result¶

Metadata-indexed recovery failed early under corruption
Self-identifying blocks retained recoverable structure
Overhead was acceptable relative to recovery gains

Decision¶

Promoted — established the foundation for extent identity.

FORMAT-07 — Metadata-heavy reinforcement¶

Hypothesis¶

Increasing metadata redundancy improves recovery reliability.

Why this seemed plausible¶

Redundant metadata is a common resilience strategy in archive formats.

Test¶

Introduced expanded metadata layers and duplication strategies.

Result¶

Increased archive size significantly
Did not improve recovery in proportion to cost
Metadata remained a correlated failure domain

Decision¶

Rejected — redundancy at the metadata layer does not solve structural fragility.

FORMAT-08 — Placement optimization¶

Hypothesis¶

Strategic placement of metadata and payload improves survivability under corruption.

Why this seemed plausible¶

Physical layout can influence which regions are more likely to survive partial damage.

Test¶

Varied: - metadata placement strategies - payload clustering patterns

Result¶

No consistent recovery advantage
Outcomes were highly dependent on corruption pattern
Added complexity without deterministic benefit

Decision¶

Rejected — placement strategy is not a reliable recovery mechanism.

Hypothesis¶

Iterative tuning of prior designs may yield compound improvements.

Why this seemed plausible¶

Earlier phases established partial success; refinement might converge on optimal behavior.

Test¶

Multiple minor variations across: - metadata structure - block organization - recovery heuristics

Result¶

No breakthrough improvement
Confirmed that structural assumptions, not tuning, were the limiting factor

Decision¶

Neutral / transitional — provided evidence that a structural shift was required.

FORMAT-11 — Extent identity consolidation¶

Hypothesis¶

Treating extents as independently verifiable units will maximize recovery under corruption.

Why this seemed plausible¶

Earlier phases showed payload-adjacent identity outperformed metadata-centered designs.

Test¶

Implemented: - per-extent hashing (BLAKE3) - independent validation - removal of central dependency for payload reconstruction

Result¶

High recovery rates under all corruption modes
Payload integrity preserved even when metadata was lost
Clear separation between structural truth and naming

Decision¶

Promoted (core architecture) — extent identity becomes the primary invariant.

FORMAT-12 — Inline naming¶

Hypothesis¶

Attaching naming data directly to extents enables named recovery without centralized metadata.

Why this seemed plausible¶

If identity works locally, naming might also survive when colocated with payload.

Test¶

Compared: - extent_identity_only
- extent_identity_inline_path
- manifest-based naming

Measured: - recovery rate
- name retention
- archive size overhead

Result¶

Named recovery matched manifest-based approaches
Significant duplication cost for repeated paths
Demonstrated feasibility of decentralized naming

Decision¶

Promoted (transitional) — validated decentralized naming, but not efficient enough long-term.

FORMAT-12-STRESS — Inline naming under scale¶

Hypothesis¶

Inline naming remains viable under large-scale workloads.

Test¶

Applied large datasets with high path repetition.

Result¶

Path duplication caused measurable archive bloat
Performance degraded under repeated string storage

Decision¶

Demoted — naming must be decoupled from per-extent duplication.

FORMAT-13 — Dictionary introduction¶

Hypothesis¶

Centralizing naming into a dictionary reduces duplication while preserving recovery.

Why this seemed plausible¶

Separating naming from extents may retain benefits while reducing overhead.

Test¶

Introduced dictionary structures mapping extents → paths.

Result¶

Archive size improved significantly
Naming restored efficiently
Introduced new dependency risk (dictionary survival)

Decision¶

Promoted with caution — effective but introduces a recoverability dependency.

FORMAT-14A — Mirrored dictionaries¶

Hypothesis¶

Replicating dictionaries removes the single-point-of-failure introduced in FORMAT-13.

Why this seemed plausible¶

Redundant but independent copies may allow naming recovery even when partially corrupted.

Test¶

multiple dictionary copies
no primary designation
independent validation via checksums

Result¶

Naming preserved if any valid dictionary survives
No coordination dependency required
Balanced size vs recovery tradeoff

Decision¶

Promoted (final naming architecture) — mirrored dictionaries adopted.

Comparison artifact schemas for active families: - schemas/crushr-lab-salvage-format12-inline-path-comparison.v1.schema.json - schemas/crushr-lab-salvage-format12-stress-comparison.v2.schema.json - schemas/crushr-lab-salvage-format13-comparison.v1.schema.json - schemas/crushr-lab-salvage-format13-stress-comparison.v1.schema.json - schemas/crushr-lab-salvage-format14a-dictionary-resilience.v1.schema.json - schemas/crushr-lab-salvage-format14a-dictionary-resilience-stress.v1.schema.json - schemas/crushr-lab-salvage-format15.v1.schema.json - schemas/crushr-lab-salvage-format15-stress.v1.schema.json

FORMAT-15 — Factored dictionary leadership¶

Hypothesis¶

Introducing a “leader” dictionary reduces redundancy while preserving recovery.

Why this seemed plausible¶

Reducing duplication could improve efficiency without sacrificing correctness.

Test¶

designated primary dictionary
fallback handling for secondary structures

Result¶

Recovery degraded when leader was corrupted
Naming collapsed despite surviving data
No meaningful size advantage over mirrored model

Decision¶

Rejected — leadership reintroduces a central point of failure.

Branch Outcomes¶

Design branch	Status	Reason
Metadata-heavy / manifest-led	Rejected	Fragile under corruption
Placement optimization	Rejected	Non-deterministic benefit
Extent identity	Promoted	Strong recovery invariant
Inline naming	Transitional	Correct but inefficient
Central dictionary	Partial success	Efficient but fragile
Mirrored dictionaries	Promoted	Best resilience/size balance
Dictionary leadership (FORMAT-15)	Rejected	Reintroduced failure point

Remaining Open Questions¶

The current architecture is stable, but not final. Active areas:

Compression strategy vs identity placement
Dictionary scaling limits under extreme datasets
Optimal tail-frame indexing for large archives
Benchmark-driven validation vs ZIP / 7z under corruption

Key Takeaway¶

crushr’s architecture is not the result of incremental feature design.

It is the result of repeatedly asking:

“What survives when the archive is broken?”

and removing every design that failed to answer that question correctly.

Format Evolution and Decision Record¶

Current Outcome (Summary)¶

FORMAT-05 — Self-identifying blocks¶

Hypothesis¶

Why this seemed plausible¶

Test¶

Result¶

Decision¶

FORMAT-07 — Metadata-heavy reinforcement¶

Hypothesis¶

Why this seemed plausible¶

Test¶

Result¶

Decision¶

FORMAT-08 — Placement optimization¶

Hypothesis¶

Why this seemed plausible¶

Test¶

Result¶

Decision¶

FORMAT-09/10 — Incremental refinement phase¶

Hypothesis¶

Why this seemed plausible¶

Test¶

Result¶

Decision¶

FORMAT-11 — Extent identity consolidation¶

Hypothesis¶

Why this seemed plausible¶

Test¶

Result¶

Decision¶

FORMAT-12 — Inline naming¶

Hypothesis¶

Why this seemed plausible¶

Test¶

Result¶

Decision¶

FORMAT-12-STRESS — Inline naming under scale¶

Hypothesis¶

Test¶

Result¶

Decision¶

FORMAT-13 — Dictionary introduction¶

Hypothesis¶

Why this seemed plausible¶

Test¶

Result¶

Decision¶

FORMAT-14A — Mirrored dictionaries¶

Hypothesis¶

Why this seemed plausible¶

Test¶

Result¶

Decision¶

FORMAT-15 — Factored dictionary leadership¶

Hypothesis¶

Why this seemed plausible¶

Test¶

Result¶

Decision¶

Branch Outcomes¶

Remaining Open Questions¶

Key Takeaway¶