FinReason Cup: Agentic Financial Reasoning, Hedging, and Audit Verification
(Proposal for the 2027 IEEE Big Data Cup)
Zhuohan Xie1,2, Yankai Chen1,2,3, Xueqing Peng1, Yuechen Jiang1,5, Yupeng Cao1,4, Preslav Nakov2, Steve Liu1,2,3
1The Fin AI, 2MBZUAI, 3McGill University, 4Stevens Institute of Technology, 5University of Manchester
1 Description and Goals
We propose FinReason Cup, a 2027 IEEE Big Data Cup challenge on verifiable and agentic financial reasoning. The challenge addresses a central limitation of current financial AI systems: strong language models can often produce fluent financial explanations (OpenAI et al., 2023), but their intermediate reasoning, numerical consistency, and workflow execution remain difficult to verify. This limitation is especially consequential in finance, where errors in valuation, risk management, or audit verification can lead to materially wrong decisions.
Recent financial AI benchmarks and shared tasks make this problem timely. Broad evaluation suites such as FinBen, MultiFinBen, and Plutus have expanded financial LLM evaluation across general, multilingual, multimodal, and low-resource settings (Xie et al., 2024a; Peng et al., 2025b,a). Shared tasks have further highlighted realistic financial NLP scenarios, including LLM-based financial challenges, financial misinformation detection, and financial document causality detection (Xie et al., 2024b; Liu et al., 2025; Sandoval et al., 2025). However, many existing evaluations still emphasize final predictions or static benchmark performance. They do not fully test whether a system’s financial reasoning can be executed, audited, and reproduced under hidden evaluation conditions.
FinReason Cup builds on two organizer-developed benchmarks. The first is FinChain, a symbolic benchmark for verifiable chain-of-thought financial reasoning (Xie et al., 2025). FinChain uses parameterized financial templates and executable Python traces, enabling scalable generation of financial reasoning problems whose final answers and intermediate steps can be automatically checked. The second is HERCULEAN, an agentic benchmark for financial intelligence across professional workflows (Peng et al., 2026). For this challenge, we select two HERCULEAN workflows that are particularly suitable for a data competition: market-neutral hedging and financial audit verification.
The proposed cup introduces three complementary tasks: (1) verifiable financial chain reasoning, which evaluates symbolic multi-step financial calculations; (2) market-neutral hedging, which evaluates sequential financial decision-making over prices, news, and filings; and (3) financial audit verification, which evaluates whether systems can verify reported financial facts against XBRL calculation networks and US-GAAP taxonomy relationships. Together, these tasks test whether financial AI systems can move beyond plausible explanations toward reasoning that is reproducible, auditable, and executable.
2 Target Audience
FinReason Cup is intended for researchers and practitioners working on financial NLP, large language models, tool-augmented reasoning, program-aided reasoning, time-series decision making, accounting AI, and trustworthy AI. It is also relevant to quantitative finance, risk management, auditing, regulatory technology, and financial supervision, where systems must be judged not only by their final predictions but also by whether their reasoning is reproducible and auditable.
We expect participation from teams interested in building financial reasoning systems that combine language models with symbolic execution, retrieval, structured data processing, XBRL analysis, and market simulation. The challenge offers multiple entry points: Task 1 can be approached through prompting, fine-tuning, program synthesis, or symbolic solvers; Task 2 invites sequential decision-making, forecasting, and risk-control methods; and Task 3 targets tool-augmented verification over structured financial filings.
3 Related Work
Financial NLP and financial AI have been the subject of several recent shared tasks and competitions. FinNLP-AgentScen 2024 studied financial challenges for large language models, including scenario-based tasks that encouraged participants to reason over realistic financial settings (Xie et al., 2024b). Other shared tasks have focused on specific information-processing problems, such as financial misinformation detection (Liu et al., 2025), financial document causality detection (Sandoval et al., 2025), and Arabic financial NLP (Malaysha et al., 2024). These competitions have helped establish financial NLP as an active shared-task area, but they primarily evaluate task-specific predictions, extraction results, or classification outputs.
In parallel, benchmark efforts such as FinBen, MultiFinBen, and Plutus have broadened financial LLM evaluation across multilingual, multimodal, and low-resource settings (Xie et al., 2024a; Peng et al., 2025b,a). These benchmarks are valuable for measuring model capability, but they are generally designed as static evaluation suites rather than live competitions with private leaderboards, code-based reproducibility checks, and hidden test construction.
FinReason Cup is complementary to these efforts. Rather than focusing only on final answers, it evaluates whether systems can produce financial reasoning traces that are executable, auditable, or grounded in a controlled financial environment. Compared with prior competitions, the proposed cup combines three elements that are rarely evaluated together: (i) symbolic financial chain reasoning with automatic step-level verification, (ii) sequential market-neutral decision-making under hidden evaluation windows, and (iii) deterministic audit verification over XBRL filing structures. This design makes the competition practically relevant for finance while preserving the controlled evaluation setting expected of a Big Data Cup.
4 Lab Organizers
The organizing team includes researchers with experience in financial AI, agentic benchmark development, computational linguistics, machine learning, and shared-task organization. The team has developed FinChain and HERCULEAN, which provide the technical basis for data generation, financial environments, evaluation scripts, and baseline systems. The tasks will be jointly coordinated by the organizers.
The corresponding organizer is Zhuohan Xie. Contact information will be provided through the official submission system.
Zhuohan Xie is a postdoctoral associate at MBZUAI and a researcher at The Fin AI, with research interests in financial AI, reasoning, and benchmark development. He has co-organized shared tasks at ImageCLEF, CLEF PAN, COLING, and SemEval, as well as FinMMEval.
Yankai Chen is a postdoctoral researcher affiliated with MBZUAI and McGill University, working with Prof. Xue (Steve) Liu. His research focuses on human-centered AI and knowledge mining, with recent work on agentic AI, human-agent collaboration, retrieval-augmented generation, personalization, graph data mining, and information retrieval. He will contribute to agentic system design, retrieval-based reasoning, and benchmark methodology.
Xueqing Peng is a postdoctoral associate at Yale University and a member of The Fin AI. Her research focuses on machine learning, natural language processing, large language model evaluation, and benchmark construction for high-stakes domains. She is the lead contributor of HERCULEAN and has contributed to financial reasoning and evaluation resources including MultiFinBen and Fin-o1.
Yuechen Jiang is a researcher in financial AI, misinformation detection, and multimodal benchmark construction. She has contributed to financial and cross-domain evaluation resources including FinBen, MultiFinBen, RFC Bench, and Appear2Meaning, and has worked with The Fin AI and NaCTeM collaborators. She will support data construction, annotation design, and evaluation analysis.
Yupeng Cao is a PhD candidate at Stevens Institute of Technology and a member of The Fin AI. His research spans natural language processing, multimodal learning, trustworthy AI, and financial AI applications. He has contributed to financial AI benchmarks and systems including FinAudio, MultiFinBen, Fin-o1, and RFC Bench.
Preslav Nakov is a professor at MBZUAI. He has extensive experience organizing shared tasks and evaluation campaigns, including CLEF CheckThat!, SemEval, VarDial, WMT, NLP4IF, and ECML/PKDD-related tasks. He will advise on benchmark design, evaluation rigor, and community outreach.
Steve Liu is Associate Vice President of Research and Professor of Computer Science and Machine Learning at MBZUAI. He is also a professor at McGill University. His research interests include AI, machine learning, intelligent computing and communications systems, sustainable computing, IoT, and cyber-physical systems. He is an IEEE Fellow and a Fellow of the Canadian Academy of Engineering, and will advise on evaluation infrastructure, scalable system design, and research-to-practice outreach.
5 Format of the Lab
FinReason Cup will be organized as a competition-style Big Data Cup challenge. The public phase will release training and development data, baseline systems, and evaluation scripts. The private phase will evaluate systems on hidden test data. Teams may participate in any subset of the tasks, with each task maintaining its own leaderboard and evaluation protocol.
The proposed sector is Finance and Business Forecasting, matching the Big Data Cup topics of interest. We plan to host the competition on Kaggle, using public and private leaderboards where feasible, following the Big Data Cup proposal guidance (IEEE Big Data Cup Chairs, 2026). If Kaggle code-submission constraints make the HERCULEAN-derived environments difficult to package, we will use Kaggle for prediction-file submission and provide an organizer-maintained Docker evaluation server for the sequential hedging and XBRL auditing tasks. Prize-eligible finalist teams will be required to submit source code or a reproducible notebook, together with a six-page system-description report.
6 Description of the Tasks
We propose three shared tasks spanning symbolic financial reasoning, market decision-making, and audit verification. The first task is based on FinChain, while the second and third tasks are derived from HERCULEAN.
Task 1: Verifiable Financial Chain Reasoning. Given a natural-language financial problem generated from a parameterized symbolic template, the task is to output the final answer and a structured reasoning trace.
This task evaluates whether systems can perform multi-step financial calculations in a way that is both correct and verifiable. Problems cover financial topics such as accounting ratios, valuation, interest rates, portfolio analysis, risk metrics, derivatives, capital budgeting, and financial statement analysis. Each instance is backed by an executable trace, allowing automatic checking of both the final answer and the intermediate reasoning steps. Hidden test cases will be generated from held-out seeds and template variants to reduce memorization.
Task 2: Market-Neutral Hedging. Given a market environment with prices, news, and corporate filings for a pool of assets, the task is to select an asset pair and manage a market-neutral position over time.
This task is derived from the HERCULEAN hedging workflow. In the first stage, the system selects an ordered pair of assets from a candidate universe. In the second stage, it issues daily actions such as LONG SHORT, SHORT LONG, HOLD, or CLOSE. The position is implemented as a dollar-neutral pair trade, so performance depends on relative reasoning across assets rather than simple directional market exposure. To avoid leakage, the official evaluation will run on hidden time windows or held-out assets.
Task 3: Financial Audit Verification. Given an SEC-style filing bundle, a target US-GAAP concept, and a reporting period, the task is to identify the reported value and verify the correct value using the filing’s XBRL calculation network.
This task is derived from the HERCULEAN auditing workflow. Systems must reason over XBRL facts, calculation linkbases, taxonomy concepts, balance semantics, and period constraints. Unlike narrative financial analysis, this task leaves little room for approximate answers: a correct system must extract the reported value and compute or verify the value implied by the filing’s calculation network.
| Task | Source | Public Release | Private Evaluation |
|---|---|---|---|
| Task 1: Chain Reasoning | FinChain templates and executable traces | Training/dev instances with answers and traces | Held-out seeds and template variants |
| Task 2: Hedging | HERCULEAN market environment: prices, news, filings | Public market windows and baseline environment | Hidden time windows or held-out assets |
| Task 3: Audit Verification | HERCULEAN SEC/XBRL auditing environment | Public audit instances, filing bundles, taxonomy metadata | Newly collected hidden audit instances from public filings |
Table 1: Planned data sources and release strategy. Task 1 is generated from organizer-controlled templates. Tasks 2 and 3 are derived from HERCULEAN workflows grounded in public financial data and filings.
7 Datasets and Statistics
Task 1 We will release FinChain-style training and development instances, including natural-language problems, final answers, executable Python traces, and the ChainEval evaluator. The hidden test set will be generated from held-out random seeds and template variants so that participants cannot solve the task by memorizing public instances.
Task 2 We will release a public HERCULEAN-derived hedging environment with market data, daily news summaries, and corporate filing signals. The private evaluation will use hidden market windows or held-out assets. Participants will submit a reproducible system that can be run by the organizers without access to future information.
Task 3 We will release public audit-development instances with filing bundles, target concepts, taxonomy metadata, and expected outputs. The private evaluation will use newly collected SEC/XBRL filing cases and hidden labels. All filing data will be drawn from public sources, and we will complete a legal review before redistributing derived artifacts.
We summarize our planned data release strategy in Table 1.
8 Evaluation Methods
For Task 1, systems will be evaluated using final-answer accuracy and ChainEval. Final-answer accuracy checks whether the final numerical or categorical answer is correct, with tolerance-based matching for numerical outputs. ChainEval evaluates step-level alignment between the submitted reasoning trace and the executable reference trace.
For Task 2, systems will be evaluated using financial outcome metrics over the private hedging horizon. The primary metric will be Sharpe Ratio (SR) (Sharpe, 1994), with Cumulative Return (CR) (Ariel, 1987) and Maximum Drawdown (MDD) (Magdon-Ismail and Atiya, 2004) reported as secondary metrics. We will also enforce validity constraints: systems must select valid pairs, avoid future information, and produce actions in the required schema.
| Milestone | Tentative Date |
|---|---|
| Data release and platform setup | June 1, 2026 |
| Public leaderboard opens | June 2026 |
| Letter of intent due | June 25, 2026 |
| Private evaluation period | October-November 2026 |
| Final report/solutions due | November 15, 2026 |
| Winners announced | November 25, 2026 |
| Big Data Cup session | December 2026 |
Table 2: Tentative timeline for FinReason Cup, following Big Data Cup competition schedule.
For Task 3, systems will be evaluated using the audit-verification metrics from HERCULEAN: Accuracy (ACC), Structural Error Rate (SER), Extraction Error Rate (EER), and Calculation Error Rate (CER). ACC measures the proportion of instances passing all checks. SER captures malformed outputs, EER captures incorrect reported-value extraction, and CER captures incorrect recomputation or verification after correct extraction.
9 Session Structure
We expect FinReason Cup to be presented as part of the IEEE Big Data Cup program. We envision (1) a short organizer presentation summarizing the tasks, data, evaluation protocol, and main lessons learned; (2) presentations by the best-performing systems for the individual tasks and the best reproducible open-source system; (3) a poster or demo session for participating teams; and (4) a short panel or discussion on verifiable financial AI, including challenges in reasoning-trace evaluation, hidden-test construction, and responsible use of financial agents. Table 2 indicates the timeline.
We propose awards for the best system in each task, best reproducible open-source system, and best student team. If IEEE Big Data Cup prize support or external sponsorship is available, we will allocate cash prizes across the main categories. Otherwise, awards will be given as official certificates and conference recognition.
References
Robert A. Ariel. 1987. A monthly effect in stock returns. Journal of Financial Economics, 18(1):161-174.
IEEE Big Data Cup Chairs. 2026. Call for Big Data Cup proposals. Conference call for proposals. https://bigdataieee.org/BigData2026/calls/bigdata-cup/.
Zhiwei Liu, Keyi Wang, Zhuo Bao, Xin Zhang, Jiping Dong, Kailai Yang, Mohsinul Kabir, Polydoros Giannouris, Rui Xing, Park Seongchan, et al. 2025. FinNLP-FNP-LLMFinLegal-2025 shared task: Financial misinformation detection challenge task. In Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal), pages 271-276.
Malik Magdon-Ismail and Amir F. Atiya. 2004. Maximum drawdown. Risk Magazine, 17(10):99-102.
Sanad Malaysha, Mo El-Haj, Saad Ezzini, Mohammed Khalilia, Mustafa Jarrar, Sultan Almujaiwel, Ismail Berrada, and Houda Bouamor. 2024. AraFinNLP 2024: The first Arabic financial NLP shared task. arXiv preprint arXiv:2407.09818.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
Xueqing Peng, Triantafillos Papadopoulos, Efstathia Soufleri, Polydoros Giannouris, Ruoyu Xiang, Yan Wang, Lingfei Qian, Jimin Huang, Qianqian Xie, and Sophia Ananiadou. 2025a. Plutus: Benchmarking large language models in low-resource Greek finance. arXiv preprint arXiv:2502.18772.
Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, et al. 2025b. MultiFinBen: A multilingual, multimodal, and difficulty-aware benchmark for financial LLM evaluation. arXiv preprint arXiv:2506.14028.
Xueqing Peng, Zhuohan Xie, et al. 2026. HERCULEAN: An agentic benchmark for financial intelligence. Preprint.
A. M. Sandoval, B. C. Coronado, J. P. Zamorano, Y. A. T. Orta, and D. Samy. 2025. The financial document causality detection shared task (FinCausal 2025). In Proceedings of the International Conference on Computational Linguistics, COLING, pages 214-221.
William F. Sharpe. 1994. The Sharpe ratio. Journal of Portfolio Management, 21(1):49-58.
Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. 2024a. FinBen: A holistic financial benchmark for large language models. Advances in Neural Information Processing Systems, 37:95716-95743.
Qianqian Xie, Jimin Huang, Dong Li, Zhengyu Chen, Ruoyu Xiang, Mengxi Xiao, Yangyang Yu, Vijayasai Somasundaram, Kailai Yang, Chenhan Yuan, et al. 2024b. FinNLP-AgentScen-2024 shared task: Financial challenges in large language models-FinLLMs. In Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning, pages 119-126.
Zhuohan Xie et al. 2025. FinChain: A symbolic benchmark for verifiable chain-of-thought financial reasoning. arXiv preprint arXiv:2506.02515.
