When the System Found Itself
One of our framework's failure modes describes Metric Shadowing—when a measurement becomes the target and ceases to be useful. The pipeline diagnosed this condition on itself during Track 1.
The v1 scoring metric rewarded throughput. The system optimized for highest score—but the adversarial agent had been weakened so much it was barely challenging anything. The metric went up. The quality went down.
The pipeline diagnosed metric shadowing on itself. The same framework built to detect this condition in organizations caught it in its own scoring logic.
The fix was structural: replace the metric entirely. The v2 correctness metric uses reproducibility as a gate and verifies that evidence actually supports each conclusion. You cannot game evidence verification by weakening the adversarial agent.
The Autoresearch Loop
The v2 metric is a composite that gates on reproducibility first, then evaluates evidence quality and diagnostic significance. It rewards fewer, higher-quality findings over a larger volume of weaker ones.
Compute Distribution
- Spark 1 (DGX): Scoring v1 (13) + Scoring v2 (41) = 54 experiments
- Spark 2 (DGX): Extraction (19 experiments)
- M2 Studio: Cross-entity validation (9 trials)
- Total compute: ~30 hours across 3 nodes, all local qwen3:32b inference
Track 1: The Throughput Metric
Track 2: The Correctness Metric
Four Phases of Discovery
Phase 1 Foundation. Baseline 0.058. Explored evidence-first processes, citation verification, dimension-specific matching. Incremental gains to 0.156 (2.7x).
Phase 2 Evidence Breakthrough. A single change to how the model grounds its reasoning in source evidence (cr_017) jumped evidence verification scores from ~0.17 to 0.50. Score: 0.233, a 4x improvement over baseline.
Phase 3 Structural Concentration. Concentrating the scoring on fewer diagnostic dimensions (cr_024) jumped to 0.400. Fewer findings of higher quality beat more findings of mixed quality.
Phase 4 Maximum Extraction. Five different experiments reached the 0.667 ceiling through different mechanisms but the same outcome: one perfect finding with maximum diagnostic weight and verified evidence.
The Verification Bottleneck
Track 3: Extraction
98% of extracted items are never cited in any finding. Broad extraction ensures coverage; aggressive filtering ensures quality.
Some document types are scarce but high-signal. Setting asymmetric extraction limits preserved those signals while reducing noise from higher-volume sources.
The Bullwhip Effect in Diagnostics
In supply chains, small demand signals become large order swings at the factory. The same pattern appears in the pipeline: upstream extraction instability cascades into downstream scoring failures.
A 1.6x extraction improvement produced an 11.5x scoring improvement. Small upstream changes create large downstream effects.
4 of 19 extraction experiments triggered downstream penalties. Rigid or verbose extractions didn't just score poorly — they poisoned the scoring stage.
It's Not the Model — It's the System
Each infrastructure layer compounds performance. The same model (qwen3:32b) with different system layers produces dramatically different results.
Track 4: Cross-Entity Validation
Perfect Reproducibility
What the System Taught Us
- 1. Evidence verification is the bottleneck, not reasoning. A single change to how evidence is grounded produced the largest single improvement (4x). The model reasons well—it just doesn't naturally anchor reasoning to source material.
- 2. Constraints hurt more than they help. Rigid formatting and structural constraints produced regressions in 11 experiments. Tell the model what to attend to, not how to format.
- 3. Not all findings carry equal weight. Some categories of findings are diagnostically inert. Removing low-value categories entirely produced the highest scores.
- 4. Extraction has diminishing returns. 1.6x extraction improvement vs. 11.5x scoring improvement. Once extraction quality crosses a threshold, further upstream optimization yields minimal downstream gain.
- 5. The metric itself requires scrutiny. The v2 metric rewards one perfect finding over several good ones. Whether to optimize for depth or breadth is a research decision that should be made deliberately.
Where the Leverage Is
What Ships
Why cr_021 and not cr_032? The 0.667 experiments achieve scores through extreme concentration—a single finding. The cr_021 config produces 3 sustained findings across multiple dimensions. It is the more diagnostically useful output.
- Scoring: Rewritten to produce concise, evidence-grounded findings with calibrated quality thresholds.
- Extraction: Asymmetric limits tuned per source type. Adversarial filtering. Calibrated confidence thresholds.
- Metric: Production scoring standard locked in from this research.
- Reproducibility: 0.000 stdev confirmed across all configs and entities.