NeurIPS 2026 Competition Track

Agenthon 2026

Verifiable AI for quantitative finance.

A four-track competition testing whether AI agents can produce finance outputs that survive automated, leakage-controlled, cheat-resistant evaluation.

Explore tracks How scoring works

Submission Docker agent

Admissibility g0-g3 gates

Ranking Track metrics

Core question

Can AI agents produce finance answers that can be checked by machine?

Agenthon extends the Alphathon program into its first NeurIPS edition. The competition keeps the four-track structure and hard finance setting, while adding sealed held-out data, automated leakage controls, reproducible reruns, and public leaderboards.

Every official submission runs offline in a sandboxed Docker container. It must pass integrity, schema, cutoff/resource, and domain-semantics gates before any leaderboard metric is computed.

Four tracks

Same competition spine, four finance problems.

T1 Coding

Quant-finance coding agents

Build a Docker agent that solves quantitative finance coding tasks under pytest and financial-invariant checks.

Verb: solve
Metric: pass@1 / pass@3
Gate: pytest + invariants

T2 Forecasting

Reasoning-augmented time series

Forecast future panels using time-series data plus a time-stamped text corpus, then prove information uplift over text-blind baselines.

Verb: forecast
Metric: CRPS composite
Gate: as-of cutoff + calibration

T3 Simulation

Accelerated market simulation

Submit an ABIDES-compatible simulator that is faster while preserving matching-engine semantics and market stylized facts.

Verb: simulate
Metric: events/sec
Gate: semantic regression

T4 Tabular

Evidence-grounded prediction

Predict labels, values, or rankings over tabular entities with citations from a frozen evidence corpus and confidence intervals.

Verb: analyze
Metric: quality + coverage
Gate: faithfulness + embargo

Protocol

Submit, check, score.

The visible rule is simple. A submission is ranked only after it passes the same admissibility sequence used by every track.

Submit

A Docker image implements one stable CLI verb for its track.

Check

g0-g3 verify integrity, schema, cutoff/resource rules, and domain semantics.

Score

Only admissible runs receive a track metric and bootstrap confidence interval.

Public/private firewall

Transparent grading, sealed answers.

Each track has a public practice repo and a private sealed exam repo. Public repos include practice tasks, baselines, smoke scorers, and participant docs. Private repos hold held-out tasks, oracle keys, final scoring logic, and audit material.

Public practice Private exam

Public-dev and validation units Private-test held-out units

Runnable baselines and smoke scorer Oracle solutions and final scorer

Manifest and canary safety checks Canary registry and audit logs

Leaderboard

Competition scores will appear here.

Each team row will show its total score first. Expanding a row reveals the per-track score breakdown used to compute or explain that total.

Rank Team Total Breakdown

-- Team placeholder 01 -- View tracks

T1 Coding--

T2 Forecasting--

T3 Simulation--

T4 Tabular--

-- Team placeholder 02 -- View tracks

T1 Coding--

T2 Forecasting--

T3 Simulation--

T4 Tabular--

-- Team placeholder 03 -- View tracks

T1 Coding--

T2 Forecasting--

T3 Simulation--

T4 Tabular--

2026 phases

Development, final, verification.

Upcoming

Development opens August 3

Public repos open, teams practice, and the validation leaderboard goes live.

0% complete

Upcoming Aug 3 - Sep 28
Development

Public repos open. Teams practice, iterate, and use the live validation leaderboard.
Upcoming Sep 29 - Oct 12
Final

One submission per team is evaluated on sealed private-test units with a hidden leaderboard.
Upcoming Oct 13 - Oct 25
Verification

Organizers rerun top submissions with fresh seeds and review reproducibility.

Scientific outputs

Agenthon is designed to explain failures, not just rank winners.

Cross-track failure map

Gate failures are labeled and aggregated to show where finance agents break down.

Information uplift

T2 isolates whether text and reasoning beat strong text-blind forecasting baselines.

Speed-realism frontier

T3 measures throughput only after semantic fidelity and stylized facts survive checks.

Faithfulness under embargo

T4 requires evidence-backed predictions that do not cite future or unsupported facts.

Resources

Built from the Agenthon 2026 source docs.

Executive summary Competition structure, tracks, and repo split Technical design spec Submission contract, gates, and scoring formulas Glossary Plain-English definitions for competition terms Team and repo guide Onboarding deck speaker script

Agenthon 2026

Can AI agents produce finance answers that can be checked by machine?

Same competition spine, four finance problems.

Quant-finance coding agents

Reasoning-augmented time series

Accelerated market simulation

Evidence-grounded prediction

Submit, check, score.

Submit

Check

Score

Transparent grading, sealed answers.

Competition scores will appear here.

Development, final, verification.

Development opens August 3

Development

Final

Verification

Agenthon is designed to explain failures, not just rank winners.

Cross-track failure map

Information uplift

Speed-realism frontier

Faithfulness under embargo

Built from the Agenthon 2026 source docs.

Real problems from real desks.

Sponsor Agenthon 2026.

Thank you from the organizers.

Lead Organizers

Christos Koutsoyannis

Pawel Polak

Industry Co-Organizers

David Rosenberg

Gary Kazantsev

Ioana Boier

Track Leads

Quant-finance coding agents

Reasoning-augmented time series

Accelerated market simulation

Evidence-grounded prediction