Foundation Models · Feb 5, 2026

Why We Chose Claude as Our Primary Reasoning Engine

Investment research is not a summarization task. It is a reasoning task. We needed a model that could construct causal chains, pressure-test assumptions, and work across Chinese and English source material simultaneously. This is our technical selection process.

Reasoning, Not Summarization

Most AI applications in finance treat language models as search engines — feed data in, get a summary out. That is not what we need. We need a model that can build causal chains from a 200-page annual report, pressure-test its own arguments, and work across Chinese and English source material in the same analytical pass.

Claude is our current primary model. This choice is not permanent — we re-evaluate quarterly — but as of today, for structured analytical reasoning, it is the most reliable engine we have tested.

Long Context: Big Window Is Not the Same as Good Window

A single Chinese annual report runs 200+ pages. Add an 80-page sell-side initiation report, layer on macro data and news — all of it held in working memory at once, not retrieved in fragments through RAG, but reasoned over in a single pass. Claude’s context window handles this natively.

But a large window does not automatically mean good performance. What we care about is tail-end degradation — when context fills past 80%, most models lose attention to early inputs. In our testing, Claude’s reasoning quality stays relatively stable in that range. This matters because the most important contradictions and buried disclosures in annual reports tend to live in footnotes and tail-end paragraphs.

We have not published formal benchmark comparisons (that is not what a research publication does), but our internal testing suggests that for causal reasoning over 100K+ tokens of Chinese-language material, Claude’s output consistency is meaningfully better than the other options we have tested. This judgment may change with model iterations, which is why we re-test quarterly.

Causal Chains: Structure, Not Summary

Our methodology requires building explicit A→B→C evidence chains where each link is labeled “fact” or “assumption.” A concrete example: analyzing a company’s cash flow deterioration, the chain might be “accounts receivable turnover days increasing (fact) → downstream customers losing payment capacity (assumption) → leading signal of industry demand contraction (inference).” Each step requires the model to judge whether it is fact or assumption, and to flag when assumptions carry too much weight.

Claude handles this kind of multi-step argumentation well. It does not tend to forget the first premise by the fourth step — a common problem in long-chain reasoning. It can also identify the weakest link in the chain and proactively note “the evidence here is insufficient.”

To be clear: this is our judgment within our specific use case. Causal chain quality depends heavily on prompt design — different prompting approaches may yield different results.

Bilingual Reasoning vs. Bilingual Translation

Most models can translate. Cross-language reasoning is a different skill.

A specific scenario: a PBOC statement uses the phrase “合理充裕” (reasonably ample). The literal English translation works, but the signal strength of this phrase in Chinese monetary policy context — its subtle difference from “适度” (moderate) or “充足” (sufficient) — gets lost. We need the model to understand this Chinese context while comparing it against related language in Fed minutes within the same reasoning framework.

Claude can process Chinese regulatory filings and English credit reports in the same analytical pass, integrating information from both languages at the reasoning level rather than the translation level. It is not perfect — it occasionally misreads nuances in Chinese financial terminology — but it operates at the right level.

Falsification Testing: Making the Model Attack Its Own Arguments

Our Popperian methodology requires every thesis to have defined falsification criteria: if X happens, the thesis is wrong.

Before publishing a view, we use Claude for adversarial testing — asking it to construct the strongest counter-argument, identify the most likely failure mode, and evaluate whether our falsification criteria cover enough of the risk space. This is the most demanding use of the model’s reasoning depth: it needs to understand not just your thesis, but where your thesis might be wrong.

Honestly, this is also where Claude occasionally disappoints. Its counter-arguments are sometimes too “polite” — it flags risks without genuinely trying to demolish the thesis. We have done considerable prompt-level tuning to address this, and it is improving, but not yet where we want it.

Where Claude Falls Short

Two clear weaknesses:

Quantitative computation. Complex financial modeling, Monte Carlo simulations, optimization problems — Claude handles these slowly and unreliably. Not impossible, just error-prone enough that we do not trust the output. These tasks get routed to other models. → See Multi-Model Orchestration

Visual data parsing. Chart images, scanned financial statements, complex table extraction from PDFs — Claude’s multimodal capabilities lag here. Every provider is iterating fast in this space; the assessment could look completely different in three months.

One more thing that is not a weakness per se but requires management: Claude defaults to “balanced” analysis. Investment research sometimes needs sharp, directional judgment — not an even-handed listing of both sides. We handle this at the prompt level by explicitly requiring “give your judgment, do not hedge,” which works better than the default, though Claude’s politeness instinct still surfaces occasionally.

Quarterly Evaluation: How We Decide to Keep Using It

Each quarter we re-run the same test suite across major models. The test set includes:

  • A historical annual report with known conclusions (can the model independently reach them?)
  • A set of bilingual central bank policy documents (cross-language reasoning quality)
  • An investment thesis with deliberately embedded logical traps (falsification capability)
  • A 150K-token long document (tail-end degradation)

The evaluation criterion is not “who scores highest” but “who is most stable on the dimensions we care about most.” Reasoning consistency is weighted far above generation speed or cost.

As of the current evaluation cycle, Claude remains the best overall choice. But the gap is narrowing — particularly in causal chain construction, where competitors are improving fast. The next evaluation could go differently.

Position in the Stack

Claude handles the reasoning layer only. Upstream: MCP-connected data sources (→ see Data Layer). Downstream: structured research memos. Tasks it handles poorly get routed to other models (→ see Multi-Model Orchestration). Final judgment is made by human researchers.

Choosing Claude is not a declaration of faith. This article documents the current rationale, not a conclusion.