最佳推理 LLM

按推理能力排名的顶级 AI 模型

o3 is the strongest reasoning model available, scoring 80 on reasoning and 85 on math while carrying the full capability set, making it the choice when output quality matters more than cost.

#	Model	推理评分	Price (in/out)	Value
1	o3 OpenAI	80%	$2 / $8	10.1
2	GPT-5.5 OpenAI	78%	$5 / $30	2.7
3	GPT-5.4 OpenAI	76%	$2.5 / $15	5.3
4	Claude Opus 4.8 Anthropic	75%	$5 / $25	3.1
5	Claude Opus 4.7 Anthropic	74%	$5 / $25	3.1
6	Claude Opus 4.6 Anthropic	73%	$5 / $25	3.1
7	GPT-5.2 OpenAI	73%	$1.75 / $14	5.5
8	DeepSeek R1 DeepSeek	72%	$0.28 / $0.42	171.4
9	Gemini 3.1 Pro Google	72%	$2 / $12	6.3
10	o1 OpenAI	72%	$15 / $60	1.2

Rankings based on public benchmark data. Prices in USD per 1M tokens (direct provider). Updated June 2026.

Our Take

o3 sets the ceiling. Reasoning at 80, math at 85 — the highest marks in this lineup, and the gap to everything else is clear rather than marginal. When you need the best possible answer on a genuinely hard problem, this is the model.

It also brings the full capability set. Vision, function-calling, tool use, all present, with a 200K context window. That matters for reasoning because the hardest problems often involve images, structured data, or multi-step tool interaction. A model that reasons well but cannot call tools or read a diagram is constrained in ways o3 is not.

The cost is real. At $2/$8, o3 is not cheap, and reasoning models spend heavily on thinking tokens billed at output rates, so a hard problem can run up a sizable trajectory. Budget for that if you commit.

The alternative worth naming is DeepSeek R1, at 72 reasoning and 78 math for roughly a tenth of the price. For text-only reasoning at volume, R1 closes much of the gap and saves enormous sums. o3 earns its premium specifically when you need the top of the curve, vision, or tool calls — not for routine analytical work where R1 suffices.

o3 remains the ceiling for reasoning that touches tools, and nothing here closes that gap. R1 is the value play for pure-text chains—roughly the same thinking, none of the tool-calling tax.

Last updated June 2026

Compare all models →

Model

推理评分

Price (in/out)

Value

OpenAI

80%

$2 / $8

10.1

GPT-5.5

OpenAI

78%

$5 / $30

2.7

GPT-5.4

OpenAI

76%

$2.5 / $15

5.3

Claude Opus 4.8

Anthropic

75%

$5 / $25

3.1

Claude Opus 4.7

Anthropic

74%

$5 / $25

3.1

Claude Opus 4.6

Anthropic

73%

$5 / $25

3.1

GPT-5.2

OpenAI

73%

$1.75 / $14

5.5

DeepSeek R1

DeepSeek

72%

$0.28 / $0.42

171.4

Gemini 3.1 Pro

Google

72%

$2 / $12

6.3

OpenAI

72%

$15 / $60

1.2

Our Take

o3 remains the ceiling for reasoning that touches tools, and nothing here closes that gap. R1 is the value play for pure-text chains—roughly the same thinking, none of the tool-calling tax.

Last updated June 2026