Autoresearch: Pipeline Hardening

The FM-04 Moment

When the System Found Itself

Failure Mode 04 is Metric Shadowing—when a measurement becomes the target and ceases to be useful. The pipeline diagnosed this on itself during Track 1.

What Happened

The v1 metric was sustain_rate × mean_strength_weight. The system optimized for highest score—but the skeptic had been softened so much it was barely challenging anything. The metric went up. The quality went down.

Why This Matters

The Coherence pipeline diagnosed FM-04 on itself. The same framework built to detect metric shadowing in organizations caught metric shadowing in its own scoring logic.

The fix was structural: replace the metric entirely. The v2 correctness metric uses reproducibility as a multiplicative gate and verifies evidence through Jaccard overlap. You cannot game evidence verification by softening the skeptic.

Methodology

The Autoresearch Loop

Step 1

Hypothesis

Agent proposes a specific prompt change. Scoped to a single variable.

Step 2

Modification

Prompt file modified programmatically. Diff recorded.

Step 3

Triple-Blind Trial

3 independent trials against Nike run_079. Same evidence, different scoring runs.

Step 4

Measurement

Mean, stdev, and component metrics computed and logged to JSONL.

Step 5

Decision

Score improves → keep. Regresses → revert. Propose next hypothesis.

The Correctness Metric (v2)

correctness = reproducibility × stability_rate × mean(evidence_score × diagnostic_weight × strength_weight)

Methodology

Compute Distribution

Experiment Distribution Across Tracks & Devices

Spark 1 (DGX): Scoring v1 (13) + Scoring v2 (41) = 54 experiments
Spark 2 (DGX): Extraction (19 experiments)
M2 Studio: Cross-entity validation (9 trials)
Total compute: ~30 hours across 3 nodes, all local qwen3:32b inference

Track 1 — Scoring V1

The Throughput Metric

13 experiments · Spark 1 · March 10 evening

Scoring V1 Trajectory

Gold line = composite score. Blue bars = skeptic throughput. Red markers = experiments with low_challenge_rate (skeptic too lenient). Most prompt changes had zero measurable effect. The metric was abandoned after ar_013.

Track 2 — Scoring V2

The Correctness Metric: 41 Experiments

Correctness Score Across 41 Experiments

Foundation (cr_001–012) Breakthroughs (cr_013–025) Maximum Extraction (cr_026–041) — Dashed line = running best. 0.058 baseline → 0.667 ceiling = 11.5x improvement.

Track 2 — Scoring V2

Four Phases of Discovery

Phase 1 Foundation

Baseline 0.058. Explored evidence-first processes, citation verification, dimension-specific matching. Incremental gains to 0.156 (2.7x).

Phase 2 Evidence Breakthrough

Forcing verbatim word-copying in contradiction findings (cr_017) jumped Jaccard overlap from ~0.17 to 0.50. Score: 0.233, a 4x improvement over baseline.

Phase 3 Structural Concentration

Dropping misalignment dimension from Authority (cr_024) jumped to 0.400. Fewer findings of higher quality beat more findings of mixed quality.

Phase 4 Maximum Extraction

Five different experiments reached the 0.667 ceiling through different mechanisms but the same outcome: one perfect finding with maximum diagnostic weight and verified evidence.

Track 2 — Scoring V2

The Verification Bottleneck

Evidence Score vs. Correctness

Evidence verification is the primary driver of correctness. The bottleneck is not reasoning—it is whether evidence actually supports the conclusion.

Sustained Findings vs. Diagnostic Weight

As quality concentrated into fewer findings, diagnostic weight rose. The 0.667 cluster sustains 1 finding at maximum weight.

Track 3 — Extraction

Fewer, Sharper Extractions Win

Extraction Score vs. Total Items Extracted

Inverse relationship. Baseline: 57 items, 0.283 score. Optimal (arx_017): 45 items, 0.450 score. Star = best result with asymmetric limits: center=3, edge=5.

Key Insight

98% of extracted items are never cited in any finding. Broad extraction ensures coverage; aggressive filtering ensures quality.

Edge vs. Center

Edge documents (employee reviews, regulatory filings) are scarce but high-signal. Setting asymmetric limits preserved edge signal while reducing center noise.

Track 3 — Extraction

Extraction Trajectory

Score Trajectory with Penalty Markers

Red markers = skeptic_too_harsh penalties. Pattern: upstream rigidity (rigid/verbose extractions) cascades into downstream scoring failures. The extraction and scoring systems are coupled.

Track 4 — Cross-Entity Validation

Do the Prompts Generalize?

Cross-Entity Correctness Scores

Three entities, three sectors, three evidence profiles. Reproducibility is perfect across all. Score differences reflect evidence quality, not prompt overfitting.

0.0

Sustained Count Stdev

0.997

Mean Reproducibility

3/3

Entities Confirmed Stable

The Core Result

Perfect Reproducibility

Standard Deviation Across All 73 Experiments = 0.000

Every experiment produced identical results across all trials. The system is fully deterministic given the same inputs. This was not a design goal—it is an emergent property of structured prompts with seeded sampling and fixed evidence corpora.

Learnings

What the System Taught Us

1. Evidence verification is the bottleneck, not reasoning. Forcing verbatim word-copying produced the largest single improvement (4x). The model reasons well—it just doesn't naturally ground reasoning in evidence language.
2. Constraints hurt more than they help. Templates, citation limits, format requirements produced regressions in 11 experiments. Tell the model what to attend to, not how to format.
3. Alignment findings are dead weight. Diagnostic weight 0.2. Removing alignment entirely produced the highest scores. “Your narrative matches reality” is not actionable intelligence.

4. Extraction has diminishing returns. 1.6x extraction improvement vs. 11.5x scoring improvement. Once extraction quality crosses a threshold, further upstream optimization yields minimal downstream gain.
5. The metric itself requires scrutiny. The v2 metric rewards one perfect finding over several good ones. Whether to optimize for depth or breadth is a research decision that should be made deliberately.

Learnings

Where the Leverage Is

Scoring vs. Extraction Improvement

Scoring prompt changes dominate. Resource allocation should reflect the leverage difference.

11.5x

Scoring Improvement

1.6x

Extraction Improvement

Production State

What Ships

scoring_prompts_best.py

cr_021 era · 0.267 correctness

extractor_prompts_best.py

arx_017 · 0.450 throughput

Why cr_021 and not cr_032? The 0.667 experiments achieve scores through extreme concentration—a single finding. The cr_021 config produces 3 sustained findings across multiple dimensions. It is the more diagnostically useful output.

What Changed

Scoring prompts: 1–2 sentence findings with topic vocabulary. Evidence discipline blocks. Rewritten few-shot examples.
Extraction prompts: Asymmetric limits (center=3, edge=5). Adversarial scrutiny filter. Confidence floor 0.6.
Metric: finding_derived_v1 is production standard.
Reproducibility: 0.000 stdev confirmed across all configs and entities.