Every system component rests on a peer-reviewed academic tradition. This is not a bibliography — it is an intellectual genome, where every design decision traces back to a specific intellectual tradition.
From Bayes' original probability theorem to Black-Litterman portfolio view combination to Tetlock's superforecasting methodology. This lineage defines the mathematical core of our Thesis Tracking system: beliefs are probabilities, evidence updates beliefs, and calibration is a trainable skill.
The original theorem underpinning all probabilistic reasoning in our system. 260 years before AI practitioners made "prior" and "posterior" everyday vocabulary, Bayes established the framework for inferring cause probabilities from observed data. Our Thesis Tracking system does fundamentally the same thing — updating confidence in theses from market evidence.
The milestone paper that brought Bayesian inference into investment management. The Black-Litterman model solved the practical problems of Markowitz mean-variance optimization: instead of requiring precise return expectations, investors express "views" (priors) that blend with market equilibrium (likelihood). Our Thesis Tracking architecture directly borrows this framework — theses are views, market data is likelihood, and the system continuously blends both.
The mathematical bible of hierarchical Bayesian models and posterior updating. When we need to share information across related theses (e.g., multiple companies in the same sector), hierarchical models enable partial pooling — each thesis has its own parameters, but the parameters themselves are drawn from a higher-level distribution. This book provides the complete path from theory to computation.
The operational manual for calibrated prediction, distilled from the Good Judgment Project's empirical research. Tetlock found that the best forecasters share common traits: frequent small updates (not dramatic pivots), distinguishing known from unknown, actively seeking disconfirming evidence. These principles are directly encoded in our Prior Assignment and Evidence Accumulation systems.
The most dangerous mistake in market research is treating correlation as causation. This lineage provides the complete toolkit from time-series causal tests to structural causal models, ensuring every 'A caused B' judgment has methodological backing.
The foundational definition of predictive causality in time series: if the history of X improves prediction of Y, then X "Granger-causes" Y. Not true causation, but extremely practical for financial signal detection — our signal pipeline uses Granger causality tests extensively as initial screening.
The potential outcomes framework (Rubin Causal Model): causal effects defined as "treated outcome vs. counterfactual untreated outcome." We cannot run RCTs in financial research, but this framework forces rigorous counterfactual thinking — "What would the market have done if this policy hadn't been enacted?"
The systematic theory of causal graphs (DAGs) and do-calculus. Pearl's causal ladder — association (seeing), intervention (doing), counterfactual (imagining) — provides a deeper causal reasoning framework than statistical regression. Our Causal Chain Visualizer lets analysts draw DAGs while the system checks for confounding, mediation, or collider bias.
Instrumental variables (IV), regression discontinuity (RDD), difference-in-differences (DiD) — practical causal inference methods when RCTs are impossible. These are not theoretical tools — they are the actual methods behind our stress testing and temporal robustness checks. When an analyst claims causation, the system demands a control strategy.
If you cannot articulate what would prove you wrong, your thesis is not rigorous enough to publish. This lineage, from philosophy of science to statistical methodology, defines our standard for what constitutes good research.
Falsifiability as the demarcation criterion between science and non-science. Popper's core insight: a theory that cannot be falsified is not a scientific theory. Our Popperian Exit Protocol comes directly from this principle — every thesis must define its own "death condition" before publication.
A more nuanced methodology than Popper: research programmes have a "hard core" and "protective belt," tolerating anomalies temporarily without immediate abandonment. This explains why our falsification triggers are not binary — theses can absorb negative evidence within a time window, but the protective belt cannot expand indefinitely.
How to design statistical tests with genuine power. Mayo's "severe testing" concept: a good test not only passes when the hypothesis is true, but more importantly, has the power to reject when the hypothesis is false. Our quantitative stress testing draws theoretical support from here — a test's value lies not in confirmation, but in its power to reject false hypotheses.
From Knight's fundamental distinction between risk and uncertainty, to Markowitz's quantitative framework, to Taleb's systematic critique of normality assumptions — this lineage shapes our understanding of what can be modeled and what cannot.
The fundamental distinction between risk (randomness with calculable probabilities) and uncertainty (unknowns that cannot be assigned probabilities). This distinction remains the cornerstone of financial analysis 105 years later — our system explicitly separates "modelable risk" from "unmodelable uncertainty," maintaining humility toward the latter.
The foundational paper of mean-variance optimization and the origin of quantitative risk management. Although subsequent work (especially Taleb) proved that normality assumptions fail under extreme conditions, Markowitz's "risk-return tradeoff" framework remains the starting point for all portfolio analysis.
Fat-tailed risk, model overfit, and why stress testing is non-negotiable. Taleb's core argument: extreme events have far greater impact than normal distribution models predict, and financial systems are most fragile to precisely these events. This directly drives our insistence on Monte Carlo simulation and sensitivity analysis — we must consider possibilities beyond the model.
Beyond resilience: antifragile systems don't just withstand shocks — they benefit from them. Our self-improving pipeline (every mistake generates a permanent rule, the system gets stronger over time) pursues antifragility — not avoiding mistakes, but ensuring every mistake makes the system better.
Human cognitive limitations and systematic biases are not bugs — they are design inputs. Understanding System 1/System 2 tells us when to use fast models (Gemini) vs. deep reasoning (Claude); understanding working memory limits tells us why context windows matter.
Bounded rationality: humans (and AI) are not optimizers but satisficers — seeking "good enough" solutions rather than global optima. This insight directly influences our system design: we don't pursue a single optimal model, but find "good enough" model assignments for each cognitive task.
Working memory capacity limits: humans can process roughly 7 chunks of information at once. AI context windows are an analogous constraint — Claude with 1M tokens can handle more chunks simultaneously, but still faces attention dilution. Our Cognitive Load Optimization derives directly from Miller's research.
Prospect theory revealed systematic human biases under uncertainty: loss aversion, anchoring, certainty effect. These biases exist not only in human analysts but may also be encoded in AI training data. Our Red Team Analysis is specifically designed to detect such biases.
The dual-system theory of System 1 (fast, intuitive, automatic) and System 2 (slow, deliberate, conscious). This maps directly to our model architecture: Gemini Flash / Haiku is System 1 (fast triage, sentiment tagging, news filtering), Claude Opus is System 2 (deep reasoning, causal analysis, thesis construction).
From information theory's mathematical foundation to Transformer architecture to Constitutional AI to Model Context Protocol — every layer of the stack has academic roots. We're not 'using AI tools' — we're conducting research within a theoretically grounded computational framework.
The foundational work of information theory. Shannon's core concepts — information entropy, channel capacity, redundancy — directly drive our signal-to-noise filtering. When news arrives, the system measures not "is it important" but "how much new information does it carry" — this is Shannon's information theory operationalized.
The Transformer architecture paper — the underlying architecture of every model in our stack (Claude, GPT, Gemini, Llama, DeepSeek, Mistral). Self-attention enables long-range dependencies across sequences, the computational foundation for processing full annual reports and cross-document references.
The RAG paradigm: combining retrieval and generation so models access external knowledge rather than relying solely on parametric memory. Our Vector Store and Semantic Search system derives directly from this — models don't need to "remember" all research reports, just retrieve relevant context at inference time.
The theoretical basis for AI supervising AI. Constitutional AI demonstrated that AI systems can improve output quality through self-supervision mechanisms. Our Self-Improving Error Log (every mistake generates a permanent rule) applies this principle — the system constrains future behavior using past mistakes.
The universal connector standard for AI-to-tool integration. MCP enables AI models to access external tools and data sources in a standardized way — instead of custom integrations per data source, all sources expose themselves to all models through a unified protocol. Our MCP Server Mesh is built entirely on this standard.
Extracting millions of interpretable features from Claude via sparse autoencoders. This research demonstrated that AI model internals are not incomprehensible black boxes — specific neurons do correspond to interpretable concepts. For high-stakes financial applications, this interpretability is the foundation of trust.
AI capability is advancing faster than interpretability research — if we cannot see how the model thinks, we cannot trust its conclusions. This directly drives our insistence on extended thinking audit trails: every reasoning chain must be visible, auditable, and traceable.
Garbage in, garbage out — no matter how powerful the model. This lineage from the relational data model to financial research data cleaning standards defines what 'clean data' means in our system.
The foundational paper of the relational data model. Codd's normalization theory (eliminating redundancy, ensuring consistency) is the conceptual ancestor of our normalized data layer — though we handle unstructured and semi-structured data, the underlying principle of "one fact stored once" remains.
This paper implicitly established data cleaning standards for financial research: survivorship bias, delisting bias, outlier treatment. Our data denoising pipeline's delisting bias correction and outlier detection derive directly from the best practices Fama-French established.
Tidy data principles: each variable a column, each observation a row, each type of observation a table. Seemingly simple, but in cross-market data normalization (different trading calendars, accounting standards, currencies), these principles anchor data consistency.
A systematic taxonomy of data quality problems: missing values, duplicates, inconsistencies, staleness. This classification framework is the academic skeleton of our data pipeline QA — each problem category has corresponding detection and repair strategies.
The newest lineage: from agent memory architectures to multi-agent debate to financial-domain empirical validation. These 2023-2025 papers directly shaped our agent collaboration architecture and our positioning of 'AI-assisted judgment, not AI-replaced judgment.'
Dual-tier agent memory architecture: working memory (current context) and long-term memory (cross-session persistence). Our Self-Improving Error Log and cross-session institutional memory derive directly from MemGPT/Letta's design — agents learn not just within a single conversation, but accumulate knowledge across conversations.
The closest published analog to our agent architecture: multiple LLM agents simulate different roles in a trading firm (analyst, trader, risk manager), reaching trading decisions through bull/bear debate. Key finding: multi-agent debate decision quality significantly outperforms single-agent.
Pure RL-trained reasoning model matching OpenAI o1 on math benchmarks without human annotation data. We deploy DeepSeek-R1 locally for quantitative verification — the dual requirements of data sovereignty and rigorous mathematical reasoning precisely match this model's design goals.
Empirical validation of multi-agent debate for interpretable factor discovery. FactorMAD demonstrated that debate mechanisms improve not just factor quality but also interpretability — consistent with our adversarial review pipeline philosophy: debate produces not just better conclusions, but more transparent reasoning.
Systematic benchmarking of LLM trading capabilities. Core finding: most LLM agents fail to beat simple buy-and-hold. This is not a negation of AI, but a validation of our positioning — AI assists human judgment, it does not replace it. Autonomous trading is not the goal; augmented analytical capability is.