Coherence Record

Autoresearch: Pipeline Hardening

73 experiments across 4 research tracks. 30 hours of distributed compute. 0.000 standard deviation. A complete record of systematic pipeline hardening through adversarial self-research.

73
Experiments
4
Research Tracks
0.000
Stdev (All Runs)
11.5x
Scoring Improvement
1.6x
Extraction Improvement
~30h
Compute Time

When the System Found Itself

Failure Mode 04 is Metric Shadowing—when a measurement becomes the target and ceases to be useful. The pipeline diagnosed this on itself during Track 1.

What Happened

The v1 metric was sustain_rate × mean_strength_weight. The system optimized for highest score—but the skeptic had been softened so much it was barely challenging anything. The metric went up. The quality went down.

Why This Matters

The Coherence pipeline diagnosed FM-04 on itself. The same framework built to detect metric shadowing in organizations caught metric shadowing in its own scoring logic.

The fix was structural: replace the metric entirely. The v2 correctness metric uses reproducibility as a multiplicative gate and verifies evidence through Jaccard overlap. You cannot game evidence verification by softening the skeptic.

The Autoresearch Loop

Step 1
Hypothesis
Agent proposes a specific prompt change. Scoped to a single variable.
Step 2
Modification
Prompt file modified programmatically. Diff recorded.
Step 3
Triple-Blind Trial
3 independent trials against Nike run_079. Same evidence, different scoring runs.
Step 4
Measurement
Mean, stdev, and component metrics computed and logged to JSONL.
Step 5
Decision
Score improves → keep. Regresses → revert. Propose next hypothesis.
The Correctness Metric (v2)

correctness = reproducibility × stability_rate × mean(evidence_score × diagnostic_weight × strength_weight)

Compute Distribution

Experiment Distribution Across Tracks & Devices
  • Spark 1 (DGX): Scoring v1 (13) + Scoring v2 (41) = 54 experiments
  • Spark 2 (DGX): Extraction (19 experiments)
  • M2 Studio: Cross-entity validation (9 trials)
  • Total compute: ~30 hours across 3 nodes, all local qwen3:32b inference

The Throughput Metric

13 experiments · Spark 1 · March 10 evening

Scoring V1 Trajectory
Gold line = composite score. Blue bars = skeptic throughput. Red markers = experiments with low_challenge_rate (skeptic too lenient). Most prompt changes had zero measurable effect. The metric was abandoned after ar_013.

The Correctness Metric: 41 Experiments

Correctness Score Across 41 Experiments
Foundation (cr_001–012) Breakthroughs (cr_013–025) Maximum Extraction (cr_026–041) — Dashed line = running best. 0.058 baseline → 0.667 ceiling = 11.5x improvement.

Four Phases of Discovery

Phase 1 Foundation

Baseline 0.058. Explored evidence-first processes, citation verification, dimension-specific matching. Incremental gains to 0.156 (2.7x).

Phase 2 Evidence Breakthrough

Forcing verbatim word-copying in contradiction findings (cr_017) jumped Jaccard overlap from ~0.17 to 0.50. Score: 0.233, a 4x improvement over baseline.

Phase 3 Structural Concentration

Dropping misalignment dimension from Authority (cr_024) jumped to 0.400. Fewer findings of higher quality beat more findings of mixed quality.

Phase 4 Maximum Extraction

Five different experiments reached the 0.667 ceiling through different mechanisms but the same outcome: one perfect finding with maximum diagnostic weight and verified evidence.

The Verification Bottleneck

Evidence Score vs. Correctness
Evidence verification is the primary driver of correctness. The bottleneck is not reasoning—it is whether evidence actually supports the conclusion.
Sustained Findings vs. Diagnostic Weight
As quality concentrated into fewer findings, diagnostic weight rose. The 0.667 cluster sustains 1 finding at maximum weight.

Fewer, Sharper Extractions Win

Extraction Score vs. Total Items Extracted
Inverse relationship. Baseline: 57 items, 0.283 score. Optimal (arx_017): 45 items, 0.450 score. Star = best result with asymmetric limits: center=3, edge=5.

Key Insight

98% of extracted items are never cited in any finding. Broad extraction ensures coverage; aggressive filtering ensures quality.

Edge vs. Center

Edge documents (employee reviews, regulatory filings) are scarce but high-signal. Setting asymmetric limits preserved edge signal while reducing center noise.

Extraction Trajectory

Score Trajectory with Penalty Markers
Red markers = skeptic_too_harsh penalties. Pattern: upstream rigidity (rigid/verbose extractions) cascades into downstream scoring failures. The extraction and scoring systems are coupled.

Do the Prompts Generalize?

Cross-Entity Correctness Scores
Three entities, three sectors, three evidence profiles. Reproducibility is perfect across all. Score differences reflect evidence quality, not prompt overfitting.
0.0
Sustained Count Stdev
0.997
Mean Reproducibility
3/3
Entities Confirmed Stable

Perfect Reproducibility

Standard Deviation Across All 73 Experiments = 0.000
Every experiment produced identical results across all trials. The system is fully deterministic given the same inputs. This was not a design goal—it is an emergent property of structured prompts with seeded sampling and fixed evidence corpora.

What the System Taught Us

  • 1. Evidence verification is the bottleneck, not reasoning. Forcing verbatim word-copying produced the largest single improvement (4x). The model reasons well—it just doesn't naturally ground reasoning in evidence language.
  • 2. Constraints hurt more than they help. Templates, citation limits, format requirements produced regressions in 11 experiments. Tell the model what to attend to, not how to format.
  • 3. Alignment findings are dead weight. Diagnostic weight 0.2. Removing alignment entirely produced the highest scores. “Your narrative matches reality” is not actionable intelligence.
  • 4. Extraction has diminishing returns. 1.6x extraction improvement vs. 11.5x scoring improvement. Once extraction quality crosses a threshold, further upstream optimization yields minimal downstream gain.
  • 5. The metric itself requires scrutiny. The v2 metric rewards one perfect finding over several good ones. Whether to optimize for depth or breadth is a research decision that should be made deliberately.

Where the Leverage Is

Scoring vs. Extraction Improvement
Scoring prompt changes dominate. Resource allocation should reflect the leverage difference.
11.5x
Scoring Improvement
1.6x
Extraction Improvement

What Ships

scoring_prompts_best.py
cr_021 era · 0.267 correctness
extractor_prompts_best.py
arx_017 · 0.450 throughput

Why cr_021 and not cr_032? The 0.667 experiments achieve scores through extreme concentration—a single finding. The cr_021 config produces 3 sustained findings across multiple dimensions. It is the more diagnostically useful output.

What Changed

  • Scoring prompts: 1–2 sentence findings with topic vocabulary. Evidence discipline blocks. Rewritten few-shot examples.
  • Extraction prompts: Asymmetric limits (center=3, edge=5). Adversarial scrutiny filter. Confidence floor 0.6.
  • Metric: finding_derived_v1 is production standard.
  • Reproducibility: 0.000 stdev confirmed across all configs and entities.
AR-001
“Automation may observe, measure, and report. It may not decide, approve, or publish.”