-
Hammer: Robust Function-Calling for On-Device Language Models via Function Masking
Paper • 2410.04587 • Published • 2 -
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs
Paper • 2402.15491 • Published • 15 -
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
Paper • 2406.18518 • Published • 25 -
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
Paper • 2504.03601 • Published • 18
Collections
Discover the best community collections!
Collections including paper arxiv:2509.24002
-
Attention Is All You Need
Paper • 1706.03762 • Published • 121 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
Paper • 2210.04186 • Published
-
Less is More: Recursive Reasoning with Tiny Networks
Paper • 2510.04871 • Published • 513 -
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
Paper • 2509.25541 • Published • 142 -
Agent Learning via Early Experience
Paper • 2510.08558 • Published • 277 -
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
Paper • 2509.25454 • Published • 148
-
RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
Paper • 2509.16198 • Published • 129 -
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Paper • 2509.16941 • Published • 21 -
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Paper • 2303.08896 • Published • 4 -
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Paper • 2404.16130 • Published • 7
-
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
Paper • 2508.20453 • Published • 63 -
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
Paper • 2509.24002 • Published • 180 -
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
Paper • 2510.19286 • Published • 9 -
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
Paper • 2508.14704 • Published • 43
-
The Smol Training Playbook
📚3.12kThe secrets to building world-class LLMs
-
LLM-in-Sandbox Elicits General Agentic Intelligence
Paper • 2601.16206 • Published • 86 -
EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience
Paper • 2601.15876 • Published • 92 -
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
Paper • 2510.08697 • Published • 39
-
LongCodeZip: Compress Long Context for Code Language Models
Paper • 2510.00446 • Published • 108 -
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Paper • 2509.26507 • Published • 550 -
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
Paper • 2509.24002 • Published • 180 -
GEM: A Gym for Agentic LLMs
Paper • 2510.01051 • Published • 91
-
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Paper • 2508.15760 • Published • 47 -
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?
Paper • 2508.01780 • Published • 21 -
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Paper • 2304.08244 • Published • 1 -
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
Paper • 2508.16153 • Published • 162
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 20 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
Hammer: Robust Function-Calling for On-Device Language Models via Function Masking
Paper • 2410.04587 • Published • 2 -
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs
Paper • 2402.15491 • Published • 15 -
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
Paper • 2406.18518 • Published • 25 -
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
Paper • 2504.03601 • Published • 18
-
The Smol Training Playbook
📚3.12kThe secrets to building world-class LLMs
-
LLM-in-Sandbox Elicits General Agentic Intelligence
Paper • 2601.16206 • Published • 86 -
EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience
Paper • 2601.15876 • Published • 92 -
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
Paper • 2510.08697 • Published • 39
-
Attention Is All You Need
Paper • 1706.03762 • Published • 121 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
Paper • 2210.04186 • Published
-
Less is More: Recursive Reasoning with Tiny Networks
Paper • 2510.04871 • Published • 513 -
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
Paper • 2509.25541 • Published • 142 -
Agent Learning via Early Experience
Paper • 2510.08558 • Published • 277 -
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
Paper • 2509.25454 • Published • 148
-
LongCodeZip: Compress Long Context for Code Language Models
Paper • 2510.00446 • Published • 108 -
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Paper • 2509.26507 • Published • 550 -
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
Paper • 2509.24002 • Published • 180 -
GEM: A Gym for Agentic LLMs
Paper • 2510.01051 • Published • 91
-
RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation
Paper • 2509.16198 • Published • 129 -
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Paper • 2509.16941 • Published • 21 -
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Paper • 2303.08896 • Published • 4 -
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Paper • 2404.16130 • Published • 7
-
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Paper • 2508.15760 • Published • 47 -
LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools?
Paper • 2508.01780 • Published • 21 -
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Paper • 2304.08244 • Published • 1 -
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
Paper • 2508.16153 • Published • 162
-
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers
Paper • 2508.20453 • Published • 63 -
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
Paper • 2509.24002 • Published • 180 -
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
Paper • 2510.19286 • Published • 9 -
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
Paper • 2508.14704 • Published • 43
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 20 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48