When the Skeptic Was the Variable

At a Glance

Experiments

across 2 phases

Compute

11h26m

two Sparks

Variants

prompt edits

Variants Won

none passed all gates

Baseline Drift

0.000

harness stable

The Setup

The original autoresearch (cr_001 through cr_041, March 2026) hardened the diagnostic pipeline against qwen3:32b. Output was deterministic. Standard deviation across identical reruns: 0.000. The system was diagnostic, not stochastic.

On 2026-03-27, Gemma 4 31B replaced qwen3:32b as the production extraction standard after a head-to-head race study. Scoring was supposed to remain locked to qwen3:32b. The intent was preserved in writing. It was not preserved in code.

Investigation traced the cause to a single point: the skeptic was tuned for qwen3 finding-text style. Gemma 4 phrases inferences differently. The skeptic read those differences as speculative. Round 2 killed them.

Methodology

The loop is the same one that hardened qwen3. Adapted for the new target.

1. Hypothesis    → one focused change, recorded
2. Variant       → prompt edit, applied to a non-production scoring_prompts.py
3. Three trials  → identical inputs, three runs, measure variance
4. Gates         → 5 correctness gates evaluated on the aggregated result
5. Decision      → keep or revert; log to experiments.jsonl
6. Stop rules    → metric shadow, regression streak, plateau, budget, manual

Reference run

Boeing fleet_101_20260220 (sandbox, 2026-04-16). Three rejected findings on the original gemma4 run, all carrying correct dimension tags (compression, diffusion). A correctly calibrated skeptic should sustain at least the compression and diffusion findings.

Gates

Every variant had to pass all five:

1. Reproducibility stdev      ≤ 0.02 across 3 trials
2. Evidence verification      ≥ baseline (catches inflation)
3. Mean weighted strength     ≥ baseline (catches dimension collapse)
4. FM-01 detected             in ≥ 2 of 3 trials
5. most_stressed_dimension    ≠ misalignment (compliance gate)

Plus a metric-shadowing tripwire: if sustain rises +30% while evidence drops −20%, halt. The original autoresearch's FM-04 Moment showed this is necessary.

Compute

Both Sparks for Phase 1, round-robin. Spark 2 only for Phase 2 (qwen2.5:72b loaded only there). Six hours forty minutes wall clock for Phase 1's 19 experiments. Four hours forty-six for Phase 2's 6 experiments. Mean per-experiment: 21 minutes on gemma4:31b, 47 minutes on qwen2.5:72b.

Phase 1 Results — Gemma 4:31B End-to-End

Variant	Sustain	Stdev	Correctness	FM-01	Decision
v00 baseline (current production)	0.200	0.000	0.800	3/3	keep
v01 inference-tolerant skeptic	0.200	0.000	0.800	3/3	keep
v02 evidence-allowlist skeptic	0.200	0.000	0.800	3/3	keep
v03 R2-softened skeptic	0.467	0.231	0.000	3/3	gate_fail
v04 sustain-floor skeptic	0.267	0.116	0.352	3/3	gate_fail
v05 quote-the-flaw skeptic	0.267	0.116	0.352	3/3	gate_fail
v06 alternative-bar skeptic	0.200	0.000	0.800	3/3	keep
v07 authority-verbatim quotation	0.067	0.116	0.047	0/3	gate_fail
v08 combined v01+v07	0.200	0.000	0.800	3/3	keep

Three patterns emerged.

The variants that did nothing

v01, v02, v06, and the v08 combined version all returned 0.200 sustain, 0.000 stdev, 0.800 correctness. Identical numbers. Three different prompt edits, zero behavioral change. The skeptic ignored them. Either the changes landed in inactive parts of the prompt, or the skeptic's decision logic on this dataset is dominated by signals these edits did not touch.

The variant that broke reproducibility

v03 (R2-softened) lifted sustain from 0.200 to 0.467. Cleanly reproducible across both runs. And introduced 0.231 stdev within trial sets. Cleanly reproducible on that too. The same variant produced the same trial-distribution every time we tested it. The destabilization is not noise. It is what gemma4 does when the skeptic prompt allows more latitude.

The reproducibility gate killed it. Correctness 0.000.

Key Finding

The 0.02 stdev gate that defined the previous autoresearch is not achievable on gemma4 under any prompt that meaningfully softens skeptic decisions. Gemma 4 at temperature 0.0 is not deterministic the way qwen3:32b was.

The variant that regressed

v07 forced the Authority agent to include verbatim quoted text from cited evidence in its reasoning. This is the cr_017 pattern from the original autoresearch, where verbatim word-copying produced the largest single improvement (4x evidence verification). Authority does not respond the same way.

Sustain crashed from 0.200 to 0.067. FM-01 detection dropped to 0 of 3. The structural inferences that authority findings are made of cannot be expressed as direct quotes from any single source. They emerge from combining evidence types (job postings + employee reviews + customer reviews) into a structural read. Forcing verbatim quoting destroyed the inference pattern.

Negative result. Clean evidence. Authority is not Truth.

What the Gates Caught

Two of Phase 1's gates were load-bearing:

Reproducibility (stdev ≤ 0.02). Killed v03, v04, v05. All three lifted sustain. None did so deterministically. The gate prevented promotion of variants that would have produced different scores on different days, defeating the entire purpose of the diagnostic.

FM-01 detection (≥ 2 of 3 trials). Killed v07. Sustain numbers and evidence scores looked superficially fine on aggregate. But the actual FM-01 detection rate was zero. The gate refused to call a variant a winner just because its arithmetic looked acceptable.

The metric-shadow tripwire never fired. The variants that lifted sustain did so without inflating evidence scores. Gemma 4 is honest in that one respect.

Reference Stability

Three baseline runs across the queue (positions 1, 10, 20) all produced exactly 0.200 sustain, 0.000 stdev, 0.800 correctness, FM-01 detected 3 of 3. Six hours and forty minutes apart. Same model, same prompts, same data. The harness has zero drift over a full day.

This matters. It means the variance we saw in v03, v04, v05, v07 is a property of the variants, not of the harness. The autoresearch instrument is sound. What it surfaced is real.

Phase 2 Results — Dual-Model

If gemma4 cannot be calibrated to deterministic high-sustain through prompt changes alone, the alternative is a different model for the skeptic. Phase 2 tested dual-model: extraction stays on gemma4:31b; scoring and skeptic move to qwen2.5:72b — the largest model on the fleet, and the same lineage the original calibration was tuned against.

Six experiments. Spark 2 only. Four hours forty-six minutes wall clock.

Variant on qwen2.5:72b	Sustain	Correctness	FM-01	Decision
v00 baseline (run 1)	0.400	0.500	3/3	keep
v00 baseline (rerun)	0.400	0.500	3/3	keep
v03 R2-softened	0.400	0.500	3/3	keep
v03 R2-softened (rerun)	0.400	0.500	3/3	keep
v04 sustain-floor	0.400	0.500	3/3	keep
v00 final drift check	0.400	0.500	3/3	keep

Six experiments, six identical results. qwen2.5:72b on Boeing fleet_101 produces sustain 0.400 deterministically, regardless of prompt variant. This is a stronger constraint than Phase 1 surfaced. Gemma 4 had a low-sustain deterministic mode AND a higher-sustain unstable mode. Qwen 2.5:72b has only the deterministic mode, anchored at 0.400.

Cross-phase comparison

Model configuration	Sustain	Stdev	Notes
gemma4:31b end-to-end	0.200	0.000	Deterministic floor
gemma4:31b + R2-softened	0.467	0.231	Lifts sustain, breaks stability
qwen2.5:72b skeptic + gemma4 extraction	0.400	0.000	Deterministic, prompt-invariant
qwen3:32b skeptic + gemma4 extraction (production)	0.580	low	Ledger avg n=113

The dual-model alternative is better than gemma4-only on every gate. It is worse than qwen3:32b on the primary metric. Larger model did not win.

Counter-Intuitive Finding

qwen2.5:72b has 2.25x the parameters of qwen3:32b. On aggregate across the production fleet (n=8 across multiple entities), qwen2.5:72b sustains at 0.62 — slightly higher than qwen3's 0.58. On Boeing specifically, qwen2.5:72b sits at 0.40. The model size advantage does not transfer. Calibration to the specific corpus matters more than parameter count.

The Decision

The qwen3:32b split is the production answer. Not provisional. Not a bridge to anything.

extraction:        gemma4:31b
scoring + skeptic: qwen3:32b

run_pipeline.sh has been carrying this configuration since 2026-04-23 as a temporary measure pending the autoresearch outcome. The autoresearch confirms it. The configuration is now permanent.

Two ways this decision could shift in the future, neither active:

A future model release reaches qwen3:32b's calibration profile while running faster. We rerun this study with that model in the comparison set.
The reference corpus changes substantially. Calibration is corpus-specific; if Boeing's evidence shape stops being representative, we re-baseline.

Until either condition triggers, the split is fixed.

What This Cost. What It Returned.

25 experiments. 11 hours 26 minutes of compute. Two Sparks. One Boeing reference run. One harness, one orchestrator, nine prompt variants. Three stable baseline reruns in Phase 1, three in Phase 2.

Returned: a clean negative result. The hypothesis that the skeptic could be re-tuned to recover deterministic high-sustain on gemma4 through prompt changes alone is wrong. The data is unambiguous. The gates worked.

Also returned: a structural understanding the original autoresearch did not need to develop. Qwen 3 was deterministic at temperature zero. Gemma 4 is not. The 0.000 stdev that defined the original hardening was a property of the model, not a property of the methodology. The methodology applied to gemma4 surfaces real variance that the previous methodology hid.

The split (extraction gemma4, scoring qwen3:32b) was restored on 2026-04-23. The autoresearch confirms it on 2026-04-25.

Investigation Trail

2026-04-16 · Sandbox fleet on gemma4:31b drops FM-01 detection from 14/15 to 2/16
2026-04-23 · Investigation traces cause to skeptic calibration mismatch. Split restored.
2026-04-23 · Phase 1 autoresearch scaffolded.
2026-04-24 · Phase 1 runs. 19 experiments, 6h40m.
2026-04-24 to 25 · Phase 2 runs. 6 experiments, 4h46m.
2026-04-25 · Split locked as permanent production configuration.