
Qwen3.7-Max is a new generation flagship model designed for the era of intelligent agents.
Context
-
tokens
Input
-
per 1M tokens
Output
-
per 1M tokens
Downloads
-
Qwen3.7-Max is designed to serve as a universal agentic foundation — capable of writing and debugging code, automating office workflows, and sustaining autonomous execution across tasks spanning hundreds or even thousands of steps.
The core strengths of Qwen3.7-Max lie in the breadth and depth of its agentic capabilities. In coding, it handles everything from front-end prototyping to complex multi-file engineering projects. In office productivity, it automates workflows through MCP integration and multi-agent collaboration. In long-horizon autonomous execution, it maintained coherent reasoning throughout a fully autonomous kernel optimization experiment lasting 35 hours and involving more than 1,000 tool calls — demonstrating sustained and stable execution over extended periods. Beyond that, it delivers consistently strong performance regardless of whether it is deployed under Claude Code, OpenClaw, Qwen Code, or any other framework, reflecting exceptional cross-framework generalization.
In coding agent benchmarks, Qwen3.7-Max achieves leading results on SWE-Pro (60.6), SWE-Multilingual (78.3), SciCode (53.5), and QwenSVG (1608). It surpasses DS-V4-Pro Max (67.9) on Terminal Bench 2.0-Terminus with a score of 69.7, and performs on par with Opus-4.6 Max (80.8) and DS-V4-Pro Max (80.6) on SWE-Verified, scoring 80.4.
Gains in general agent benchmarks are even more notable. Qwen3.7-Max leads on MCP-Mark (60.8 vs. GLM-5.1's 57.5), MCP-Atlas (76.4 vs. Opus-4.6's 75.8), and Skillsbench (59.2 vs. K2.6's 56.2), while demonstrating strong GPU kernel optimization capabilities on Kernel Bench L3 with a median speedup of 1.98x and a 96% acceleration rate. It also performs strongly on BFCL-V4 (75.0), Qwenclaw (64.3), and ClawEval (65.2), closely trailing Opus-4.6 Max. On the office automation benchmark SpreadSheetBench-v1, it achieves a top-tier score of 87.0.
In reasoning benchmarks, Qwen3.7-Max takes the lead on GPQA Diamond (92.4 vs. Opus-4.6's 91.3), HLE (41.4 vs. Opus-4.6's 40.0), HMMT 2026 Feb (97.1 vs. Opus-4.6's 96.2), IMOAnswerBench (90.0 vs. DS-V4-Pro's 89.8), and Apex (44.5 vs. DS-V4-Pro's 38.3), demonstrating exceptional strength across high-difficulty reasoning benchmarks.
In general capabilities and multilingual performance, Qwen3.7-Max stands out on IFBench (79.1 vs. DS-V4-Pro's 77.0), showcasing precise instruction-following ability. It also leads on WMT24++ (85.8) and MAXIFE (89.2), indicating top-tier multilingual comprehension and translation quality. Strong results on SuperGPQA (73.6) and QwenWorldBench (57.3) further reinforce its broad general competence.
It is worth emphasizing that the benchmark scores above were obtained across a diverse range of agentic frameworks. Qwen3.7-Max is not optimized for any single framework — it delivers consistently strong performance under Claude Code, OpenClaw, Qwen Code, and various custom tool-use frameworks, making it a reliable foundation for agentic systems of all kinds.