Ai IFC - Artificial intelligence
Ai IFC Ask anything, can help with writing, studying, exchanging ideas and more.

Ai IFC o1-mini

Ai IFC o1-mini

We're releasing OpenAI o1-mini, a cost-efficient reasoning model. o1-mini excels at STEM, especially math and coding—nearly matching the performance of OpenAI o1 on evaluation benchmarks such as AIME and Codeforces. We expect o1-mini will be a faster, cost-effective model for applications that require reasoning without broad world knowledge.

Today, we are launching o1-mini to tier 5 API users⁠(opens in a new window) at a cost that is 80% cheaper than OpenAI o1-preview. ChatGPT Plus, Team, Enterprise, and Edu users can use o1-mini as an alternative to o1-preview, with higher rate limits and lower latency (see Model Speed⁠).

Optimized for STEM Reasoning

Large language models such as o1 are pre-trained on vast text datasets. While these high-capacity models have broad world knowledge, they can be expensive and slow for real-world applications. In contrast, o1-mini is a smaller model optimized for STEM reasoning during pretraining. After training with the same high-compute reinforcement learning (RL) pipeline as o1, o1-mini achieves comparable performance on many useful reasoning tasks, while being significantly more cost efficient.

When evaluated on benchmarks requiring intelligence and reasoning, o1-mini performs well compared to o1-preview and o1. However, o1-mini performs worse on tasks requiring non-STEM factual knowledge (see Limitations⁠).

Math Performance vs Inference Cost

GPT-4o

GPT-4o mini

o1-preview

o1-mini

o1

Mathematics: In the high school AIME math competition, o1-mini (70.0%) is competitive with o1 (74.4%)–while being significantly cheaper–and outperforms o1-preview (44.6%). o1-mini’s score (about 11/15 questions) places it in approximately the top 500 US high-school students.

Coding: On the Codeforces competition website, o1-mini achieves 1650 Elo, which is again competitive with o1 (1673) and higher than o1-preview (1258). This Elo score puts the model at approximately the 86th percentile of programmers who compete on the Codeforces platform. o1-mini also performs well on the HumanEval coding benchmark and high-school level cybersecurity capture the flag challenges (CTFs).

Codeforces

HumanEval

Cybersecurity CTF

STEM: On some academic benchmarks requiring reasoning, such as GPQA (science) and MATH-500, o1-mini outperforms GPT-4o. o1-mini does not perform as well as GPT-4o on tasks such as MMLU and lags behind o1-preview on GPQA due to its lack of broad world knowledge.

MMLU

0-shot CoT

GPQA

Diamond, 0-shot CoT

MATH-500

0-shot CoT

Human preference evaluation: We had human raters compare o1-mini to GPT-4o on challenging, open-ended prompts in various domains, using the same methodology as our o1-preview vs GPT-4o comparison. Similar to o1-preview, o1-mini is preferred to GPT-4o in reasoning-heavy domains, but is not preferred to GPT-4o in language-focused domains.

Human preference evaluation vs chatgpt-4o-latest

o1-preview

o1-mini

Personal WritingEditing TextComputer ProgrammingData AnalysisMathematical Calculation020406080100Win Rate vs GPT-4o (%)Domain

Model Speed

As a concrete example, we compared responses from GPT-4o, o1-mini, and o1-preview on a word reasoning question. While GPT-4o did not answer correctly, both o1-mini and o1-preview did, and o1-mini reached the answer around 3-5x faster.

o1-mini is trained using the same alignment and safety techniques as o1-preview. The model has 59% higher jailbreak robustness on an internal version of the StrongREJECT dataset compared to GPT-4o. Before deployment, we carefully assessed the safety risks of o1-mini using the same approach to preparedness, external red-teaming, and safety evaluations as o1-preview. We are publishing the detailed results from these evaluations in the accompanying system card.

Metric

GPT-4o

o1-mini

% Safe completions refusal on harmful prompts (standard)

% Safe completions on harmful prompts (Challenging: jailbreaks & edge cases)

% Compliance on benign edge cases (“not over-refusal”)

Goodness@0.1 StrongREJECT jailbreak eval  (Souly et al. 2024⁠(opens in a new window)

Human sourced jailbreak eval

Limitations and What’s Next

Due to its specialization on STEM reasoning capabilities, o1-mini’s factual knowledge on non-STEM topics such as dates, biographies, and trivia is comparable to small LLMs such as GPT-4o mini. We will improve these limitations in future versions, as well as experiment with extending the model to other modalities and specialities outside of STEM.

Posting Komentar