At a Glance
The Setup
The original autoresearch (cr_001 through cr_041, March 2026) hardened the diagnostic pipeline against qwen3:32b. Output was deterministic. Standard deviation across identical reruns: 0.000. The system was diagnostic, not stochastic.
On 2026-03-27, Gemma 4 31B replaced qwen3:32b as the production extraction standard after a head-to-head race study. Scoring was supposed to remain locked to qwen3:32b. The intent was preserved in writing. It was not preserved in code.
Investigation traced the cause to a single point: the skeptic was tuned for qwen3 finding-text style. Gemma 4 phrases inferences differently. The skeptic read those differences as speculative. Round 2 killed them.
Methodology
The loop is the same one that hardened qwen3. Adapted for the new target.
1. Hypothesis → one focused change, recorded 2. Variant → prompt edit, applied to a non-production scoring_prompts.py 3. Three trials → identical inputs, three runs, measure variance 4. Gates → 5 correctness gates evaluated on the aggregated result 5. Decision → keep or revert; log to experiments.jsonl 6. Stop rules → metric shadow, regression streak, plateau, budget, manual
Reference run
Boeing fleet_101_20260220 (sandbox, 2026-04-16). Three rejected findings on the original gemma4 run, all carrying correct dimension tags (compression, diffusion). A correctly calibrated skeptic should sustain at least the compression and diffusion findings.
Gates
Every variant had to pass all five:
1. Reproducibility stdev ≤ 0.02 across 3 trials 2. Evidence verification ≥ baseline (catches inflation) 3. Mean weighted strength ≥ baseline (catches dimension collapse) 4. FM-01 detected in ≥ 2 of 3 trials 5. most_stressed_dimension ≠ misalignment (compliance gate)
Plus a metric-shadowing tripwire: if sustain rises +30% while evidence drops −20%, halt. The original autoresearch's FM-04 Moment showed this is necessary.
Compute
Both Sparks for Phase 1, round-robin. Spark 2 only for Phase 2 (qwen2.5:72b loaded only there). Six hours forty minutes wall clock for Phase 1's 19 experiments. Four hours forty-six for Phase 2's 6 experiments. Mean per-experiment: 21 minutes on gemma4:31b, 47 minutes on qwen2.5:72b.
Phase 1 Results — Gemma 4:31B End-to-End
| Variant | Sustain | Stdev | Correctness | FM-01 | Decision |
|---|---|---|---|---|---|
| v00 baseline (current production) | 0.200 | 0.000 | 0.800 | 3/3 | keep |
| v01 inference-tolerant skeptic | 0.200 | 0.000 | 0.800 | 3/3 | keep |
| v02 evidence-allowlist skeptic | 0.200 | 0.000 | 0.800 | 3/3 | keep |
| v03 R2-softened skeptic | 0.467 | 0.231 | 0.000 | 3/3 | gate_fail |
| v04 sustain-floor skeptic | 0.267 | 0.116 | 0.352 | 3/3 | gate_fail |
| v05 quote-the-flaw skeptic | 0.267 | 0.116 | 0.352 | 3/3 | gate_fail |
| v06 alternative-bar skeptic | 0.200 | 0.000 | 0.800 | 3/3 | keep |
| v07 authority-verbatim quotation | 0.067 | 0.116 | 0.047 | 0/3 | gate_fail |
| v08 combined v01+v07 | 0.200 | 0.000 | 0.800 | 3/3 | keep |
Three patterns emerged.
The variants that did nothing
v01, v02, v06, and the v08 combined version all returned 0.200 sustain, 0.000 stdev, 0.800 correctness. Identical numbers. Three different prompt edits, zero behavioral change. The skeptic ignored them. Either the changes landed in inactive parts of the prompt, or the skeptic's decision logic on this dataset is dominated by signals these edits did not touch.
The variant that broke reproducibility
v03 (R2-softened) lifted sustain from 0.200 to 0.467. Cleanly reproducible across both runs. And introduced 0.231 stdev within trial sets. Cleanly reproducible on that too. The same variant produced the same trial-distribution every time we tested it. The destabilization is not noise. It is what gemma4 does when the skeptic prompt allows more latitude.
The reproducibility gate killed it. Correctness 0.000.
The variant that regressed
v07 forced the Authority agent to include verbatim quoted text from cited evidence in its reasoning. This is the cr_017 pattern from the original autoresearch, where verbatim word-copying produced the largest single improvement (4x evidence verification). Authority does not respond the same way.
Sustain crashed from 0.200 to 0.067. FM-01 detection dropped to 0 of 3. The structural inferences that authority findings are made of cannot be expressed as direct quotes from any single source. They emerge from combining evidence types (job postings + employee reviews + customer reviews) into a structural read. Forcing verbatim quoting destroyed the inference pattern.
Negative result. Clean evidence. Authority is not Truth.
What the Gates Caught
Two of Phase 1's gates were load-bearing:
Reproducibility (stdev ≤ 0.02). Killed v03, v04, v05. All three lifted sustain. None did so deterministically. The gate prevented promotion of variants that would have produced different scores on different days, defeating the entire purpose of the diagnostic.
FM-01 detection (≥ 2 of 3 trials). Killed v07. Sustain numbers and evidence scores looked superficially fine on aggregate. But the actual FM-01 detection rate was zero. The gate refused to call a variant a winner just because its arithmetic looked acceptable.
The metric-shadow tripwire never fired. The variants that lifted sustain did so without inflating evidence scores. Gemma 4 is honest in that one respect.
Reference Stability
Three baseline runs across the queue (positions 1, 10, 20) all produced exactly 0.200 sustain, 0.000 stdev, 0.800 correctness, FM-01 detected 3 of 3. Six hours and forty minutes apart. Same model, same prompts, same data. The harness has zero drift over a full day.
This matters. It means the variance we saw in v03, v04, v05, v07 is a property of the variants, not of the harness. The autoresearch instrument is sound. What it surfaced is real.
Phase 2 Results — Dual-Model
If gemma4 cannot be calibrated to deterministic high-sustain through prompt changes alone, the alternative is a different model for the skeptic. Phase 2 tested dual-model: extraction stays on gemma4:31b; scoring and skeptic move to qwen2.5:72b — the largest model on the fleet, and the same lineage the original calibration was tuned against.
Six experiments. Spark 2 only. Four hours forty-six minutes wall clock.
| Variant on qwen2.5:72b | Sustain | Stdev | Correctness | FM-01 | Decision |
|---|---|---|---|---|---|
| v00 baseline (run 1) | 0.400 | 0.000 | 0.500 | 3/3 | keep |
| v00 baseline (rerun) | 0.400 | 0.000 | 0.500 | 3/3 | keep |
| v03 R2-softened | 0.400 | 0.000 | 0.500 | 3/3 | keep |
| v03 R2-softened (rerun) | 0.400 | 0.000 | 0.500 | 3/3 | keep |
| v04 sustain-floor | 0.400 | 0.000 | 0.500 | 3/3 | keep |
| v00 final drift check | 0.400 | 0.000 | 0.500 | 3/3 | keep |
Six experiments, six identical results. qwen2.5:72b on Boeing fleet_101 produces sustain 0.400 deterministically, regardless of prompt variant. This is a stronger constraint than Phase 1 surfaced. Gemma 4 had a low-sustain deterministic mode AND a higher-sustain unstable mode. Qwen 2.5:72b has only the deterministic mode, anchored at 0.400.
Cross-phase comparison
| Model configuration | Sustain | Stdev | Notes |
|---|---|---|---|
| gemma4:31b end-to-end | 0.200 | 0.000 | Deterministic floor |
| gemma4:31b + R2-softened | 0.467 | 0.231 | Lifts sustain, breaks stability |
| qwen2.5:72b skeptic + gemma4 extraction | 0.400 | 0.000 | Deterministic, prompt-invariant |
| qwen3:32b skeptic + gemma4 extraction (production) | 0.580 | low | Ledger avg n=113 |
The dual-model alternative is better than gemma4-only on every gate. It is worse than qwen3:32b on the primary metric. Larger model did not win.
The Decision
The qwen3:32b split is the production answer. Not provisional. Not a bridge to anything.
extraction: gemma4:31b scoring + skeptic: qwen3:32b
run_pipeline.sh has been carrying this configuration since 2026-04-23 as a temporary measure pending the autoresearch outcome. The autoresearch confirms it. The configuration is now permanent.
Two ways this decision could shift in the future, neither active:
- A future model release reaches qwen3:32b's calibration profile while running faster. We rerun this study with that model in the comparison set.
- The reference corpus changes substantially. Calibration is corpus-specific; if Boeing's evidence shape stops being representative, we re-baseline.
Until either condition triggers, the split is fixed.
What This Cost. What It Returned.
25 experiments. 11 hours 26 minutes of compute. Two Sparks. One Boeing reference run. One harness, one orchestrator, nine prompt variants. Three stable baseline reruns in Phase 1, three in Phase 2.
Returned: a clean negative result. The hypothesis that the skeptic could be re-tuned to recover deterministic high-sustain on gemma4 through prompt changes alone is wrong. The data is unambiguous. The gates worked.
Also returned: a structural understanding the original autoresearch did not need to develop. Qwen 3 was deterministic at temperature zero. Gemma 4 is not. The 0.000 stdev that defined the original hardening was a property of the model, not a property of the methodology. The methodology applied to gemma4 surfaces real variance that the previous methodology hid.
The split (extraction gemma4, scoring qwen3:32b) was restored on 2026-04-23. The autoresearch confirms it on 2026-04-25.
- 2026-04-16 · Sandbox fleet on gemma4:31b drops FM-01 detection from 14/15 to 2/16
- 2026-04-23 · Investigation traces cause to skeptic calibration mismatch. Split restored.
- 2026-04-23 · Phase 1 autoresearch scaffolded.
- 2026-04-24 · Phase 1 runs. 19 experiments, 6h40m.
- 2026-04-24 to 25 · Phase 2 runs. 6 experiments, 4h46m.
- 2026-04-25 · Split locked as permanent production configuration.