I'm thrilled to share our new paper, "FinForge: Semi-Synthetic Financial Benchmark Generation" in AI4Finance, AAAI'26!
Key Contributions: - FinForge Framework: A hybrid pipeline integrating manual/programmatic corpus construction with rigorous LM-based synthesis. - FinForge-5k Dataset: A new snapshot benchmark comprising over 5,000 human-validated Q&A pairs across 11 financial subdomains, derived from a curated corpus of 100,000 verified documents (143M tokens). - Benchmarking Results: Evaluation of state-of-the-art open and closed-source models reveals significant variance in financial reasoning capabilities, with leading models achieving approximately 80% accuracy.
Huge thanks to my co-authors @glennmatlin , Anant Gupta, Anirudh JM, Rayan Castilla, and Yi Mei Ng for this collaboration.
Is it better to show a model too many images once (Diversity), or extract as much information from a small set of images?
I have always wanted to do an ablation study on this and recently I got the chance to do exactly that. Why? In applied domains like robotics, manufacturing, or banking, we rarely have the luxury of internet-scale diverse image datasets. We are often "Data Poor" in terms of diversity but "Data Rich" in depth.
The takeaway? Density is efficient for facts but dangerous for reasoning (logical collapse) if you don't have larger scale data.
For anyone who likes to get their papers summarized, I've tried to make a bentocard summarizer for research papers, that enables us to get overviews of the papers, with possible deepdives where interested. Of course, this has to be improved a lot, so do check the space and give your feedback!