Multi-Model Orchestration: Matching the Right Model to Each Research Task

Why One Model Is Not Enough

“Pick the best model, run everything through it” — the laziest approach. The problem is the same as asking one analyst to do macro, credit, quant, and execution simultaneously: each task has a different cognitive profile, and no single model is best at all of them.

We run a multi-model architecture. Which model handles which task — that routing decision is itself part of analytical quality.

Model Roster and Division of Labor

Claude — Core reasoning. Thesis construction, evidence chains, adversarial stress-testing, cross-language synthesis — the highest-stakes cognitive tasks in the workflow. Why we chose it, and where it falls short → see Claude Reasoning Engine.

OpenAI o-series (o3 / o4-mini) — Math and quantitative reasoning. Option payoff structure modeling, scenario analysis with probability distributions, verification of quantitative claims in research reports. The o-series excels at multi-step mathematical reasoning where each step must be logically validated — a different cognitive capability from the natural language reasoning Claude excels at.

A concrete scenario: the risk assessment perspective requires evaluating the asymmetry structure of a thesis — “if right, the path payoff is 4x; if wrong, the confidence drawdown is 1x.” The mathematical verification of that 4:1 structure goes through the o-series. The logical construction of the thesis itself goes through Claude. Each model does what it does best.

GPT-5.5 — Visual data parsing. Embedded charts in earnings presentations, scanned regulatory documents, shipping manifest images, satellite imagery — things text-only models cannot handle. GPT-5.5 extracts structured data from visual sources, which then feeds into Claude for reasoning. Parsing is one task; reasoning is another. The best parsing model is not necessarily the best reasoning model.

Gemma 4 — Preprocessing and triage. Initial news filtering, summarization of low-priority sources, routine translation, metadata extraction. These tasks do not need a frontier model — a capable but cheaper option works fine. The savings get concentrated on the analytical tasks where model quality actually affects outcomes.

Routing Logic

Model selection is neither random nor manual. We define task categories with explicit routing rules.

Structured analytical reasoning (thesis building, evidence chains, adversarial review) → Claude
Quantitative verification and mathematical modeling → OpenAI o-series
Visual data extraction → GPT-5.5
Preprocessing, triage, and routine extraction → Gemma 4

Routing happens at the workflow level, not the conversation level. A single research process may invoke three or four models in sequence: Gemma 4 for initial news triage, GPT-5.5 for parsing visual data from a shipping report, Claude for building the analytical thesis, o-series for verifying the quantitative risk assessment. Each stage’s output feeds the next. The human researcher reviews the final synthesis, not the intermediate routing.

Why This Matters for Investment Quality

The multi-model approach is not a technical luxury. It is a direct investment in analytical quality. Force a reasoning-optimized model to do quantitative computation — you get slower, less reliable results. Force a vision model to do long-form reasoning — you get shallow analysis. Use a frontier model for routine preprocessing — you burn budget that could go to tasks where model quality makes a real difference.

The analogy: we would not have a fundamental analyst make calls for the risk manager. Same logic applies to models. Model-level specialization mirrors analytical-level specialization. Both improve outcomes.

Maturity Issues with Routing Itself

To be honest, the routing rules above look clean on paper. In practice, it is messier.

A few engineering problems we have not fully solved:

Ambiguous task boundaries. “Should this go to reasoning or quant?” — not always a clear call. A credit analysis report might contain qualitative industry judgment alongside DCF parameter sensitivity testing. Splitting it into two sub-tasks routed separately is correct, but where to cut and how to split still relies on human judgment. Automation is not there yet.

Model updates invalidate routing. Last quarter, o3 was best at a certain class of probability reasoning. This quarter, a Claude update may have closed the gap. Routing rules are not write-once — every major model update requires re-running benchmarks to decide whether the division of labor needs adjustment. We do this quarterly, but ideally it would be continuous.

Error propagation. Upstream model output feeds downstream models. If the upstream model gets something wrong — say GPT-5.5 extracts an incorrect number from a chart — downstream Claude builds a plausible-looking causal chain on top of bad data. Currently we rely on human researchers spot-checking at key nodes. There is no automated cross-model consistency verification yet.

Non-linear costs. In theory, routing preprocessing to cheap models and core reasoning to expensive ones saves money. In practice, when a research workflow calls four models with multiple iteration rounds each, total cost is not necessarily lower than running everything through one good model. What you save is quality risk, not necessarily dollars.

None of this has made us abandon multi-model architecture — the single-model ceiling is lower. But it is worth documenting, so we do not make the actual operation sound smoother than it is.