73 experiments across 4 research tracks. 30 hours of distributed compute. 0.000 standard deviation. A complete record of systematic pipeline hardening through adversarial self-research.
Failure Mode 04 is Metric Shadowing—when a measurement becomes the target and ceases to be useful. The pipeline diagnosed this on itself during Track 1.
The v1 metric was sustain_rate × mean_strength_weight. The system optimized for highest score—but the skeptic had been softened so much it was barely challenging anything. The metric went up. The quality went down.
The Coherence pipeline diagnosed FM-04 on itself. The same framework built to detect metric shadowing in organizations caught metric shadowing in its own scoring logic.
The fix was structural: replace the metric entirely. The v2 correctness metric uses reproducibility as a multiplicative gate and verifies evidence through Jaccard overlap. You cannot game evidence verification by softening the skeptic.
correctness = reproducibility × stability_rate × mean(evidence_score × diagnostic_weight × strength_weight)
13 experiments · Spark 1 · March 10 evening
Baseline 0.058. Explored evidence-first processes, citation verification, dimension-specific matching. Incremental gains to 0.156 (2.7x).
Forcing verbatim word-copying in contradiction findings (cr_017) jumped Jaccard overlap from ~0.17 to 0.50. Score: 0.233, a 4x improvement over baseline.
Dropping misalignment dimension from Authority (cr_024) jumped to 0.400. Fewer findings of higher quality beat more findings of mixed quality.
Five different experiments reached the 0.667 ceiling through different mechanisms but the same outcome: one perfect finding with maximum diagnostic weight and verified evidence.
98% of extracted items are never cited in any finding. Broad extraction ensures coverage; aggressive filtering ensures quality.
Edge documents (employee reviews, regulatory filings) are scarce but high-signal. Setting asymmetric limits preserved edge signal while reducing center noise.
Why cr_021 and not cr_032? The 0.667 experiments achieve scores through extreme concentration—a single finding. The cr_021 config produces 3 sustained findings across multiple dimensions. It is the more diagnostically useful output.