Format Evolution and Decision Record¶
This document records how the crushr format architecture emerged through deliberate experimentation and elimination, not incremental feature accumulation.
It is not a changelog. It is a design evidence record.
The format variants described below are research history. They are not runtime feature toggles for canonical extraction behavior.
Each phase documents:
- the hypothesis under test
- why it was plausible
- how it was evaluated
- what the results showed
- the resulting architectural decision
Current Outcome (Summary)¶
The current crushr architecture is defined by:
- Extent identity as primary truth
- Mirrored dictionaries for naming
- Fail-closed naming semantics
- No central authority required for recovery
The following were rejected:
- manifest-led designs
- metadata-heavy recovery strategies
- placement-based optimizations
- leadership-based dictionary systems (FORMAT-15)
FORMAT-05 — Self-identifying blocks¶
Hypothesis¶
Embedding identity directly with payload blocks enables recovery without reliance on central metadata.
Why this seemed plausible¶
Traditional archive failures are dominated by metadata corruption. Moving identity closer to data may preserve recoverability.
Test¶
Compared: - payload-only recovery - metadata-indexed recovery - self-identifying block recovery
Under deterministic corruption (truncation, overwrite, fragmentation).
Result¶
- Metadata-indexed recovery failed early under corruption
- Self-identifying blocks retained recoverable structure
- Overhead was acceptable relative to recovery gains
Decision¶
Promoted — established the foundation for extent identity.
FORMAT-07 — Metadata-heavy reinforcement¶
Hypothesis¶
Increasing metadata redundancy improves recovery reliability.
Why this seemed plausible¶
Redundant metadata is a common resilience strategy in archive formats.
Test¶
Introduced expanded metadata layers and duplication strategies.
Result¶
- Increased archive size significantly
- Did not improve recovery in proportion to cost
- Metadata remained a correlated failure domain
Decision¶
Rejected — redundancy at the metadata layer does not solve structural fragility.
FORMAT-08 — Placement optimization¶
Hypothesis¶
Strategic placement of metadata and payload improves survivability under corruption.
Why this seemed plausible¶
Physical layout can influence which regions are more likely to survive partial damage.
Test¶
Varied: - metadata placement strategies - payload clustering patterns
Result¶
- No consistent recovery advantage
- Outcomes were highly dependent on corruption pattern
- Added complexity without deterministic benefit
Decision¶
Rejected — placement strategy is not a reliable recovery mechanism.
FORMAT-09/10 — Incremental refinement phase¶
Hypothesis¶
Iterative tuning of prior designs may yield compound improvements.
Why this seemed plausible¶
Earlier phases established partial success; refinement might converge on optimal behavior.
Test¶
Multiple minor variations across: - metadata structure - block organization - recovery heuristics
Result¶
- No breakthrough improvement
- Confirmed that structural assumptions, not tuning, were the limiting factor
Decision¶
Neutral / transitional — provided evidence that a structural shift was required.
FORMAT-11 — Extent identity consolidation¶
Hypothesis¶
Treating extents as independently verifiable units will maximize recovery under corruption.
Why this seemed plausible¶
Earlier phases showed payload-adjacent identity outperformed metadata-centered designs.
Test¶
Implemented: - per-extent hashing (BLAKE3) - independent validation - removal of central dependency for payload reconstruction
Result¶
- High recovery rates under all corruption modes
- Payload integrity preserved even when metadata was lost
- Clear separation between structural truth and naming
Decision¶
Promoted (core architecture) — extent identity becomes the primary invariant.
FORMAT-12 — Inline naming¶
Hypothesis¶
Attaching naming data directly to extents enables named recovery without centralized metadata.
Why this seemed plausible¶
If identity works locally, naming might also survive when colocated with payload.
Test¶
Compared:
- extent_identity_only
- extent_identity_inline_path
- manifest-based naming
Measured:
- recovery rate
- name retention
- archive size overhead
Result¶
- Named recovery matched manifest-based approaches
- Significant duplication cost for repeated paths
- Demonstrated feasibility of decentralized naming
Decision¶
Promoted (transitional) — validated decentralized naming, but not efficient enough long-term.
FORMAT-12-STRESS — Inline naming under scale¶
Hypothesis¶
Inline naming remains viable under large-scale workloads.
Test¶
Applied large datasets with high path repetition.
Result¶
- Path duplication caused measurable archive bloat
- Performance degraded under repeated string storage
Decision¶
Demoted — naming must be decoupled from per-extent duplication.
FORMAT-13 — Dictionary introduction¶
Hypothesis¶
Centralizing naming into a dictionary reduces duplication while preserving recovery.
Why this seemed plausible¶
Separating naming from extents may retain benefits while reducing overhead.
Test¶
Introduced dictionary structures mapping extents → paths.
Result¶
- Archive size improved significantly
- Naming restored efficiently
- Introduced new dependency risk (dictionary survival)
Decision¶
Promoted with caution — effective but introduces a recoverability dependency.
FORMAT-14A — Mirrored dictionaries¶
Hypothesis¶
Replicating dictionaries removes the single-point-of-failure introduced in FORMAT-13.
Why this seemed plausible¶
Redundant but independent copies may allow naming recovery even when partially corrupted.
Test¶
- multiple dictionary copies
- no primary designation
- independent validation via checksums
Result¶
- Naming preserved if any valid dictionary survives
- No coordination dependency required
- Balanced size vs recovery tradeoff
Decision¶
Promoted (final naming architecture) — mirrored dictionaries adopted.
Comparison artifact schemas for active families:
- schemas/crushr-lab-salvage-format12-inline-path-comparison.v1.schema.json
- schemas/crushr-lab-salvage-format12-stress-comparison.v2.schema.json
- schemas/crushr-lab-salvage-format13-comparison.v1.schema.json
- schemas/crushr-lab-salvage-format13-stress-comparison.v1.schema.json
- schemas/crushr-lab-salvage-format14a-dictionary-resilience.v1.schema.json
- schemas/crushr-lab-salvage-format14a-dictionary-resilience-stress.v1.schema.json
- schemas/crushr-lab-salvage-format15.v1.schema.json
- schemas/crushr-lab-salvage-format15-stress.v1.schema.json
FORMAT-15 — Factored dictionary leadership¶
Hypothesis¶
Introducing a “leader” dictionary reduces redundancy while preserving recovery.
Why this seemed plausible¶
Reducing duplication could improve efficiency without sacrificing correctness.
Test¶
- designated primary dictionary
- fallback handling for secondary structures
Result¶
- Recovery degraded when leader was corrupted
- Naming collapsed despite surviving data
- No meaningful size advantage over mirrored model
Decision¶
Rejected — leadership reintroduces a central point of failure.
Branch Outcomes¶
| Design branch | Status | Reason |
|---|---|---|
| Metadata-heavy / manifest-led | Rejected | Fragile under corruption |
| Placement optimization | Rejected | Non-deterministic benefit |
| Extent identity | Promoted | Strong recovery invariant |
| Inline naming | Transitional | Correct but inefficient |
| Central dictionary | Partial success | Efficient but fragile |
| Mirrored dictionaries | Promoted | Best resilience/size balance |
| Dictionary leadership (FORMAT-15) | Rejected | Reintroduced failure point |
Remaining Open Questions¶
The current architecture is stable, but not final. Active areas:
- Compression strategy vs identity placement
- Dictionary scaling limits under extreme datasets
- Optimal tail-frame indexing for large archives
- Benchmark-driven validation vs ZIP / 7z under corruption
Key Takeaway¶
crushr’s architecture is not the result of incremental feature design.
It is the result of repeatedly asking:
“What survives when the archive is broken?”
and removing every design that failed to answer that question correctly.