Foundational Literature

Lineage 1 · Bayesian Tradition

From Bayes' original probability theorem to Black-Litterman portfolio view combination to Tetlock's superforecasting methodology. This lineage defines the mathematical core of our Thesis Tracking system: beliefs are probabilities, evidence updates beliefs, and calibration is a trainable skill.

Bayes, T. (1763). An Essay towards Solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal Society of London.

The original theorem underpinning all probabilistic reasoning in our system. 260 years before AI practitioners made "prior" and "posterior" everyday vocabulary, Bayes established the framework for inferring cause probabilities from observed data. Our Thesis Tracking system does fundamentally the same thing — updating confidence in theses from market evidence.

Black, F. & Litterman, R. (1992). Global Portfolio Optimization. Financial Analysts Journal, 48(5), 28-43.

The milestone paper that brought Bayesian inference into investment management. The Black-Litterman model solved the practical problems of Markowitz mean-variance optimization: instead of requiring precise return expectations, investors express "views" (priors) that blend with market equilibrium (likelihood). Our Thesis Tracking architecture directly borrows this framework — theses are views, market data is likelihood, and the system continuously blends both.

Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. & Rubin, D.B. (2013). Bayesian Data Analysis. 3rd edition, CRC Press.

The mathematical bible of hierarchical Bayesian models and posterior updating. When we need to share information across related theses (e.g., multiple companies in the same sector), hierarchical models enable partial pooling — each thesis has its own parameters, but the parameters themselves are drawn from a higher-level distribution. This book provides the complete path from theory to computation.

Tetlock, P. (2015). Superforecasting: The Art and Science of Prediction. Crown.

The operational manual for calibrated prediction, distilled from the Good Judgment Project's empirical research. Tetlock found that the best forecasters share common traits: frequent small updates (not dramatic pivots), distinguishing known from unknown, actively seeking disconfirming evidence. These principles are directly encoded in our Prior Assignment and Evidence Accumulation systems.

Lineage 2 · Causal Inference

The most dangerous mistake in market research is treating correlation as causation. This lineage provides the complete toolkit from time-series causal tests to structural causal models, ensuring every 'A caused B' judgment has methodological backing.

Granger, C. (1969). Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica, 37(3), 424-438.

The foundational definition of predictive causality in time series: if the history of X improves prediction of Y, then X "Granger-causes" Y. Not true causation, but extremely practical for financial signal detection — our signal pipeline uses Granger causality tests extensively as initial screening.

Rubin, D. (1974). Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology, 66(5), 688-701.

The potential outcomes framework (Rubin Causal Model): causal effects defined as "treated outcome vs. counterfactual untreated outcome." We cannot run RCTs in financial research, but this framework forces rigorous counterfactual thinking — "What would the market have done if this policy hadn't been enacted?"

Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd edition, Cambridge University Press.

The systematic theory of causal graphs (DAGs) and do-calculus. Pearl's causal ladder — association (seeing), intervention (doing), counterfactual (imagining) — provides a deeper causal reasoning framework than statistical regression. Our Causal Chain Visualizer lets analysts draw DAGs while the system checks for confounding, mediation, or collider bias.

Angrist, J. & Pischke, J. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.

Instrumental variables (IV), regression discontinuity (RDD), difference-in-differences (DiD) — practical causal inference methods when RCTs are impossible. These are not theoretical tools — they are the actual methods behind our stress testing and temporal robustness checks. When an analyst claims causation, the system demands a control strategy.

Lineage 3 · Falsification & Scientific Method

If you cannot articulate what would prove you wrong, your thesis is not rigorous enough to publish. This lineage, from philosophy of science to statistical methodology, defines our standard for what constitutes good research.

Popper, K. (1934/1959). The Logic of Scientific Discovery. Routledge.

Falsifiability as the demarcation criterion between science and non-science. Popper's core insight: a theory that cannot be falsified is not a scientific theory. Our Popperian Exit Protocol comes directly from this principle — every thesis must define its own "death condition" before publication.

Lakatos, I. (1978). The Methodology of Scientific Research Programmes. Cambridge University Press.

A more nuanced methodology than Popper: research programmes have a "hard core" and "protective belt," tolerating anomalies temporarily without immediate abandonment. This explains why our falsification triggers are not binary — theses can absorb negative evidence within a time window, but the protective belt cannot expand indefinitely.

Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge University Press.

How to design statistical tests with genuine power. Mayo's "severe testing" concept: a good test not only passes when the hypothesis is true, but more importantly, has the power to reject when the hypothesis is false. Our quantitative stress testing draws theoretical support from here — a test's value lies not in confirmation, but in its power to reject false hypotheses.

Lineage 4 · Risk & Uncertainty

From Knight's fundamental distinction between risk and uncertainty, to Markowitz's quantitative framework, to Taleb's systematic critique of normality assumptions — this lineage shapes our understanding of what can be modeled and what cannot.

Knight, F. (1921). Risk, Uncertainty and Profit. Houghton Mifflin.

The fundamental distinction between risk (randomness with calculable probabilities) and uncertainty (unknowns that cannot be assigned probabilities). This distinction remains the cornerstone of financial analysis 105 years later — our system explicitly separates "modelable risk" from "unmodelable uncertainty," maintaining humility toward the latter.

Markowitz, H. (1952). Portfolio Selection. Journal of Finance, 7(1), 77-91.

The foundational paper of mean-variance optimization and the origin of quantitative risk management. Although subsequent work (especially Taleb) proved that normality assumptions fail under extreme conditions, Markowitz's "risk-return tradeoff" framework remains the starting point for all portfolio analysis.

Taleb, N.N. (2007). The Black Swan: The Impact of the Highly Improbable. Random House.

Fat-tailed risk, model overfit, and why stress testing is non-negotiable. Taleb's core argument: extreme events have far greater impact than normal distribution models predict, and financial systems are most fragile to precisely these events. This directly drives our insistence on Monte Carlo simulation and sensitivity analysis — we must consider possibilities beyond the model.

Taleb, N.N. (2012). Antifragile: Things That Gain from Disorder. Random House.

Beyond resilience: antifragile systems don't just withstand shocks — they benefit from them. Our self-improving pipeline (every mistake generates a permanent rule, the system gets stronger over time) pursues antifragility — not avoiding mistakes, but ensuring every mistake makes the system better.

Lineage 5 · Cognition & Behavior

Human cognitive limitations and systematic biases are not bugs — they are design inputs. Understanding System 1/System 2 tells us when to use fast models (Gemma 4) vs. deep reasoning (Claude); understanding working memory limits tells us why context windows matter.

Simon, H. (1955). A Behavioral Model of Rational Choice. Quarterly Journal of Economics, 69(1), 99-118.

Bounded rationality: humans (and AI) are not optimizers but satisficers — seeking "good enough" solutions rather than global optima. This insight directly influences our system design: we don't pursue a single optimal model, but find "good enough" model assignments for each cognitive task.

Miller, G. (1956). The Magical Number Seven, Plus or Minus Two. Psychological Review, 63(2), 81-97.

Working memory capacity limits: humans can process roughly 7 chunks of information at once. AI context windows are an analogous constraint — Claude with 1M tokens can handle more chunks simultaneously, but still faces attention dilution. Our Cognitive Load Optimization derives directly from Miller's research.

Kahneman, D. & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk. Econometrica, 47(2), 263-292.

Prospect theory revealed systematic human biases under uncertainty: loss aversion, anchoring, certainty effect. These biases exist not only in human analysts but may also be encoded in AI training data. Our Red Team Analysis is specifically designed to detect such biases.

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.

The dual-system theory of System 1 (fast, intuitive, automatic) and System 2 (slow, deliberate, conscious). This maps directly to our model architecture: Gemma 4 / Haiku is System 1 (fast triage, sentiment tagging, news filtering), Claude Opus is System 2 (deep reasoning, causal analysis, thesis construction).

Lineage 6 · AI & Computation

From information theory's mathematical foundation to Transformer architecture to Constitutional AI to Model Context Protocol — every layer of the stack has academic roots. We're not 'using AI tools' — we're conducting research within a theoretically grounded computational framework.

Shannon, C. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423.

The foundational work of information theory. Shannon's core concepts — information entropy, channel capacity, redundancy — directly drive our signal-to-noise filtering. When news arrives, the system measures not "is it important" but "how much new information does it carry" — this is Shannon's information theory operationalized.

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017.

The Transformer architecture paper — the underlying architecture of every model in our stack (Claude, GPT, Gemma 4, Llama, DeepSeek, Mistral). Self-attention enables long-range dependencies across sequences, the computational foundation for processing full annual reports and cross-document references.

Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.

The RAG paradigm: combining retrieval and generation so models access external knowledge rather than relying solely on parametric memory. Our Vector Store and Semantic Search system derives directly from this — models don't need to "remember" all research reports, just retrieve relevant context at inference time.

Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic.

The theoretical basis for AI supervising AI. Constitutional AI demonstrated that AI systems can improve output quality through self-supervision mechanisms. Our Self-Improving Error Log (every mistake generates a permanent rule) applies this principle — the system constrains future behavior using past mistakes.

Anthropic (2024). Model Context Protocol (MCP). Open standard specification.

The universal connector standard for AI-to-tool integration. MCP enables AI models to access external tools and data sources in a standardized way — instead of custom integrations per data source, all sources expose themselves to all models through a unified protocol. Our MCP Server Mesh is built entirely on this standard.

Olah, C. et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic.

Extracting millions of interpretable features from Claude via sparse autoencoders. This research demonstrated that AI model internals are not incomprehensible black boxes — specific neurons do correspond to interpretable concepts. For high-stakes financial applications, this interpretability is the foundation of trust.

Amodei, D. (2025). The Urgency of Interpretability. darioamodei.com.

AI capability is advancing faster than interpretability research — if we cannot see how the model thinks, we cannot trust its conclusions. This directly drives our insistence on extended thinking audit trails: every reasoning chain must be visible, auditable, and traceable.

Lineage 7 · Data Engineering

Garbage in, garbage out — no matter how powerful the model. This lineage from the relational data model to financial research data cleaning standards defines what 'clean data' means in our system.

Codd, E.F. (1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13(6), 377-387.

The foundational paper of the relational data model. Codd's normalization theory (eliminating redundancy, ensuring consistency) is the conceptual ancestor of our normalized data layer — though we handle unstructured and semi-structured data, the underlying principle of "one fact stored once" remains.

Fama, E.F. & French, K.R. (1992). The Cross-Section of Expected Stock Returns. Journal of Finance, 47(2), 427-465.

This paper implicitly established data cleaning standards for financial research: survivorship bias, delisting bias, outlier treatment. Our data denoising pipeline's delisting bias correction and outlier detection derive directly from the best practices Fama-French established.

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1-23.

Tidy data principles: each variable a column, each observation a row, each type of observation a table. Seemingly simple, but in cross-market data normalization (different trading calendars, accounting standards, currencies), these principles anchor data consistency.

Chu, X. et al. (2016). Data Cleaning: Overview and Emerging Challenges. SIGMOD 2016.

A systematic taxonomy of data quality problems: missing values, duplicates, inconsistencies, staleness. This classification framework is the academic skeleton of our data pipeline QA — each problem category has corresponding detection and repair strategies.

Lineage 8 · Agentic AI & Financial Agents

The newest lineage: from agent memory architectures to multi-agent debate to financial-domain empirical validation. These 2023-2025 papers directly shaped our agent collaboration architecture and our positioning of 'AI-assisted judgment, not AI-replaced judgment.'

Packer, C. et al. (2023 → 2025). MemGPT → Letta: Operating System for LLMs. UC Berkeley.

Dual-tier agent memory architecture: working memory (current context) and long-term memory (cross-session persistence). Our Self-Improving Error Log and cross-session institutional memory derive directly from MemGPT/Letta's design — agents learn not just within a single conversation, but accumulate knowledge across conversations.

Xiao, Y. et al. (2024). TradingAgents: Multi-Agents LLM Financial Trading Framework. arXiv 2412.20138.

The closest published analog to our agent architecture: multiple LLM agents simulate different roles in a trading firm (analyst, trader, risk manager), reaching trading decisions through bull/bear debate. Key finding: multi-agent debate decision quality significantly outperforms single-agent.

DeepSeek AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2501.12948.

Pure RL-trained reasoning model matching OpenAI o1 on math benchmarks without human annotation data. We deploy DeepSeek-R1 locally for quantitative verification — the dual requirements of data sovereignty and rigorous mathematical reasoning precisely match this model's design goals.

Duan, Y. et al. (2025). FactorMAD: Multi-Agent Debate Framework for Alpha Factor Mining. ACM ICAIF 2025.

Empirical validation of multi-agent debate for interpretable factor discovery. FactorMAD demonstrated that debate mechanisms improve not just factor quality but also interpretability — consistent with our adversarial review pipeline philosophy: debate produces not just better conclusions, but more transparent reasoning.

Chen et al. (2025). StockBench: Can LLM Agents Trade Stocks Profitably? arXiv 2510.02209.

Systematic benchmarking of LLM trading capabilities. Core finding: most LLM agents fail to beat simple buy-and-hold. This is not a negation of AI, but a validation of our positioning — AI assists human judgment, it does not replace it. Autonomous trading is not the goal; augmented analytical capability is.

Academic Foundations