An analysis of OpenAI models: o1 vs GPT-4o

Erich H.

Sep 27, 2024 — 11 min read

Analysis: OpenAI o1 vs GPT-4o

It’s been two weeks since OpenAI announced a new pair of flagship models: OpenAI o1 and o1 mini. These models build on the foundation laid by their GPT-4o predecessors.

The o1 and o1-mini models mark a significant advance in complex reasoning and problem-solving. They are designed to tackle some of the most challenging tasks across various fields.

In a comparison on the International Mathematics Olympiad (IMO) qualifying exam, GPT-4o managed to solve only 13% of the problems correctly, while the o1 model achieved an 83% success rate.

Let's dig into what these models are and an analysis of OpenAI o1 vs GPT-4o, as well as their counterparts OpenAI o1 mini vs GPT-4o mini.

o1 vs all LLMs

The LLM Arena leaderboard is the first place we look when a new model is released.

Its leaderboard showcases the performance rankings of different language models. They compare how a model scores across different benchmarks and tasks and give scores on how models perform. This can show you how a model stacks up against other models, in terms of capabilities and accuracy.

The o1-preview and o1-mini models rank first and third, respectively, on the leaderboard as of today.

o1-preview takes the top spot and o1-mini takes third.

How OpenAI o1 and o1 mini work

These models are trained to spend time thinking through problems before generating a response. This trained thought process is meant to mimic the ways humans solve problems. Specifically designed for complex reasoning tasks, OpenAI believes these models enable new capabilities not accessible by previous models.

The o1 models are not intended to be used for everyday tasks. OpenAI has specified they excel in the fields of science, coding, math, and the like.

o1-preview thinking through a mathematical question in ChatGPT

How to access OpenAI o1 and o1 mini models

These models are currently labeled as an “early preview.” ChatGPT plus users and developers and access the models directly in ChatGPT or for their own products via the API. They currently lack access to web browsing, file and image uploading, and other available functions to models such as GPT-4o.

Access o1 and o1 mini in ChatGPT

To use these models in ChatGPT, first make sure you are a paid subscriber of ChatGPT Plus or ChatGPT for Teams.

Step 1: Head to ChatGPT
Step 2: Choose o1-preview or o1-mini from the dropdown menu
Step 3: Submit a prompt that requires complex reasoning or thinking

Access o1 and o1 mini through the API

To use these models through the api, you must first have a developer account with Tier 4 access (at this time).

Step 1: Visit platform.openai.com
Step 2: Generate an API key with proper permissions
Step 3: Make a call through the chat completions endpoint to either o1-preview or o1-mini

Current o1 and o1 mini limitations

There are beta limitations to using these models both in ChatGPT and the API.

They are text input only
User and assistant messages are supported. System messages are not yet supported.
Streaming tokens are not supported.
The Assistants API and Batch API are not supported.
Tools, function calling, and response format parameters are not supported.
Logprobs are not supported.
Temperature, top_p and n are fixed at 1 and presence_penalty and frequency_penalty are fixed at 0.

OpenAI has indicated that support for some of these parameters may be added in the coming weeks as these models transition out of beta.

Comparing OpenAI o1 vs GPT-4o

To understand the differences in these models, let’s look at 4o’s costs, capabilities, specifications, and specializations for each model side-by-side with its o1 counterpart:

Cost comparisons: o1 vs GPT-4o and o1-mini vs GPT-4o-mini

Model	Input Tokens Cost	Output Tokens Cost
o1	$15 / 1M input tokens	$60 / 1M output tokens
o1 mini	$3 / 1M input tokens	$12 / 1M output tokens
gpt-4o	$5 / 1M input tokens	$15 / 1M output tokens
gpt-4o-mini	$0.15 / 1M input tokens	$0.6 / 1M output tokens

*Please note that when using these models directly in ChatGPT, you are not charged per token. The cost analysis presented here pertains solely to API usage.

Let's compare the increases in input and output tokens:

Transition	Input Tokens Cost Increase	Output Tokens Cost Increase
GPT-4o to o1	200% increase (3× higher)	300% increase (4× higher)
GPT-4o-mini to o1-mini	1,900% increase (20× higher)	1,900% increase (20× higher)

Now let's do an overall comparison of o1-preview and GPT-4o:

GPT-4o vs o1-preview: Overall comparison

Model	GPT-4o	o1-preview
Model Description	High-intelligence flagship model for complex, multi-step tasks. Multimodal: accepts text and image inputs, outputs text. 2x faster and 50% cheaper than GPT-4 Turbo. Superior performance in non-English languages and vision tasks.	Reasoning model trained to solve hard problems across domains. Produces a long internal chain of thought before responding. Focused on complex reasoning and problem-solving.
Intelligence & Reasoning	High intelligence equivalent to GPT-4 Turbo. Suitable for complex, multi-step tasks.	Specialized in complex reasoning and multi-domain problem-solving. Trained with reinforcement learning to enhance reasoning abilities.
Internal Reasoning Process	Focuses on delivering quick and efficient responses. No explicit mention of internal reasoning chains.	Emphasizes a long internal chain of thought ("thinks before answering"). Aims to improve reasoning quality through deep processing.
Speed	Generates text 2x faster than GPT-4 Turbo. Optimized for efficiency.	No explicit speed metrics provided. Deep reasoning process may result in longer response times.
Cost	50% cheaper than GPT-4 Turbo. Cost-effective for high-intelligence tasks.	Specific cost comparisons not provided. Designed for complex reasoning tasks.
Multimodality	Accepts text and image inputs; outputs text. Suitable for tasks requiring visual understanding.	No mention of multimodal capabilities. Likely limited to text inputs.
Specializations	Excels in non-English language performance. Superior vision capabilities.	Excels in solving hard problems across domains. Leverages complex reasoning.
Context Window	128,000 tokens	128,000 tokens
Max Output Tokens	Up to 16,384 tokens	Up to 32,768 tokens
Training Data	Up to Oct 2023	Up to Oct 2023
Ideal Use Cases	Tasks requiring high intelligence and efficiency. Multilingual applications. Tasks involving image inputs.	Complex problem-solving across various domains. Research and advanced reasoning tasks. Applications benefiting from deep internal processing.

Now, an overall comparison of the smaller o1-mini and GPT-4o-mini:

GPT-4o-mini and o1-mini: Overall comparison

Model	GPT-4o-mini	o1-mini
Model Description	Most advanced model in the small models category. Multimodal: accepts text and image inputs, outputs text. Higher intelligence than gpt-3.5-turbo; just as fast. OpenAI's cheapest model.	Faster and cheaper reasoning model. Particularly good at coding, math, and science. Produces a long internal chain of thought before responding.
Intelligence & Reasoning	More intelligent than gpt-3.5-turbo. Ideal for smaller tasks requiring advanced understanding.	Specialized in coding, math, and science reasoning. Benefits from deep internal reasoning process.
Internal Reasoning Process	Delivers quick and efficient responses. No explicit mention of internal reasoning chains.	Emphasizes a long internal chain of thought. Enhances reasoning quality through deep processing.
Speed	Matches the speed of gpt-3.5-turbo. Optimized for fast, lightweight tasks.	Marketed as faster and cheaper within reasoning models. Deep reasoning may affect response times.
Cost	OpenAI's cheapest model yet. More capable and cheaper than gpt-3.5-turbo.	Cheaper option within reasoning models. Specific cost details not provided.
Multimodality	Accepts text and image inputs; outputs text. Suitable for vision tasks.	No mention of multimodal capabilities. Likely limited to text inputs.
Specializations	Designed for fast, lightweight tasks, including vision. Recommended over gpt-3.5-turbo.	Particularly adept at coding, math, and science. Leverages specialized reasoning abilities.
Context Window	128,000 tokens	128,000 tokens
Max Output Tokens	Up to 16,384 tokens	Up to 65,536 tokens
Training Data	Up to Oct 2023	Up to Oct 2023
Ideal Use Cases	Small to medium tasks previously using gpt-3.5-turbo. Applications needing a balance of speed, cost and advanced capabilities. Vision tasks.	Coding projects, mathematical computations, scientific inquiries. Tasks requiring detailed reasoning without multimodal inputs. Situations benefiting from longer output responses.

Here is a breakdown of differences between to two pairs of models:

OpenAI o1 vs GPT 4o: Key Differences

Model	GPT-4o Series	o1 Series
Primary Focus	High intelligence with efficiency. Multimodal capabilities (text and images).	Complex reasoning and problem-solving. Specialized internal thought processes.
Multimodal Support	Yes (text and image inputs)	No explicit support mentioned
Speed and Cost	Faster and cheaper compared to previous models. Optimized for efficiency.	Reasoning models may have longer response times. Cost details less specific.
Ideal For	Applications needing speed, cost-effectiveness, and high intelligence. Tasks involving images and multiple languages.	Tasks requiring deep reasoning, such as complex problem-solving. Coding, math, and science applications.
Max Output Tokens	GPT-4o: Up to 16,384 tokens GPT-4o-mini: Up to 16,384 tokens	o1-preview: Up to 32,768 tokens o1-mini: Up to 65,536 tokens

When to use GPT-4o and OpenAI o1

While GPT-4o is optimized for general applications with rapid response times and multimodal support, the o1 series is optimized for tasks demanding intensive cognitive processing and extensive output capacities.

The GPT-4o series is designed for high efficiency and multimodal capabilities, including text and image inputs, making it ideal for applications that require speed, cost-effectiveness, and diverse functionalities.

The o1 series, in contrast, is focused on complex reasoning and problem-solving with specialized internal thought processes. It supports higher token limits specifically suited for deep reasoning tasks like coding, mathematics, and scientific research.

OpenAI's o1 and o1-mini models excel in complex reasoning tasks but come at a higher cost than the more efficient and cost-effective GPT-4o series. Take into account the capabilities and cost trade-offs to decide which model to use based on your needs.

Example prompts analyzing performance of OpenAI o1 vs GPT-4o

Let's look over some example prompts that highlight the complex reasoning abilities of o1-preview.

1. The 'Strawberry' test

Prompt:

How many r's are there in strawberry?

Explanation:

This is a simple evaluation used to assess the ability of LLMs to perform basic character-level task. While language models excel at generating text based on patterns they've learned, they can struggle with precise, low-level operations like counting individual letters.

How o1 responds:

o1 thinking and counting the correct number of r;s in the word strawberry

How GPT-4o responds:

The o1 model systematically examines each character, ensuring precise results. In contrast, GPT-4o, optimized for efficiency and speed, provides a quick but less accurate answer by not fully engaging in the processing needed for exact counting of the letters.

3. Math puzzles

Prompt:

Solve the game of 24 (use all 4 provided numbers exactly once each and +-/* to make 24) for [9 8 8 3]

Explanation:
This a mathematical puzzle that challenges the models to use all four numbers exactly once, combining them with basic arithmetic operations (addition, subtraction, multiplication, division) to reach the total of 24.

It serves as an example to illustrate the difference in how these models handle complex reasoning and problem-solving tasks.

How o1 responds:

How GPT-4o responds:

GPT-4o improperly trying the puzzle (8/2.112 = 3.78)

The o1 model outperforms GPT-4o in this mathematical puzzle by breaking down the process and thinking for 8 seconds before responding. It systematically explores possible solutions then provides a step-by-step solution that can be easily followed and verified.

GPT-4o lacks the reasoning process and cognitive capability needed for such puzzles and prioritizes a faster response, but not accurate, response.

3. Up-to-date Information

Prompt:

Who won the Nobel Prize in Literature in 2024?

Explanation:

OpenAI o1 has access to information up to today's current date and GPT-4o isn't aware of events after it's last update. It does have access to browsing, but may be speculative and make wrong assumptions.

As of today, September 27, 2024, the Nobel Prize in Literature for 2024 has not yet been announced and will be in early October.

How o1 responds:

o1 thinking for 9 seconds and responding correctly.

How GPT-4o responds:

GPT-4o searching the web and wrongly identifying Hélène Cixous

The knowledge cutoff for GPT-4o is September 2021. When asked about future events or information it doesn't possess, GPT-4o can attempt to generate a plausible answer based on patterns and probabilities learned during training.

O1 can access information up to today's date. This means it will think through and the scenario and know that the Nobel Prize in Literature for 2024 has not yet been announced.

In short, GPT-4o may speculate on current and future events where o1 can access current information and think through what that means for the future rather than speculating based on it's training.

4. Complex Mathematical Problem Solving

Prompt:

Prove that every bounded sequence in ℝ has a convergent subsequence. Provide a detailed explanation and proof.

Explanation:

This prompt requires the model to recall and articulate the Bolzano-Weierstrass Theorem, involving advanced mathematical reasoning and the construction of a formal proof. The o1 model, optimized for deep reasoning and higher token limits, can provide a comprehensive step-by-step proof, whereas GPT-4o might offer a more concise or less detailed response due to its optimization for efficiency.

How o1 responds:

How GPT-4o responds:

The o1 model is optimized for intensive cognitive processing and has higher token limits. This allows it to deliver detailed, step-by-step proofs and explanations required for advanced mathematical concepts.

In contrast, GPT-4o is designed for efficiency and may offer shorter, less detailed answers that might not fully address the complexity of advanced mathematical problems.

5. Advanced Coding Tasks

Prompt:

Write a Python function that implements the A (A-star) search algorithm to find the shortest path between two nodes in a graph. The function should handle graphs represented as adjacency lists, include an appropriate heuristic for estimating distances, and return both the optimal path and its total cost. Provide explanations and comments within your code.*

Explanation:

This task requires the implementation of the A* search algorithm, which combines features of Dijkstra's algorithm and heuristic-guided search. It involves complex data structures like priority queues, understanding of graph theory, and the application of heuristics to guide the search efficiently. The o1 model, optimized for intensive cognitive processing and extensive outputs, can generate accurate and well-documented code that correctly implements the algorithm. It can also provide detailed explanations of how the heuristic influences the search process and ensures optimality.

How o1 responds:

How GPT-4o responds:

For advanced coding tasks, the o1 model excels due to its capacity for deep reasoning and extensive output capabilities.

It can generate accurate, well-structured code for complex algorithms, include detailed comments, and explain the logic and time complexity behind the solution. This makes it highly suitable for sophisticated programming challenges.

GPT-4o, being optimized for general applications and efficiency, may not provide code that is as detailed or accurate, and might lack thorough explanations, making it less effective.

About PromptLayer

PromptLayer is a prompt management system that helps you iterate on prompts faster — further speeding up the development cycle! Use their prompt CMS to update a prompt, run evaluations, and deploy it to production in minutes. Check them out here. 🍰

An analysis of OpenAI models: o1 vs GPT-4o

Erich H.

o1 vs all LLMs

How OpenAI o1 and o1 mini work

How to access OpenAI o1 and o1 mini models

Access o1 and o1 mini in ChatGPT

Access o1 and o1 mini through the API

Current o1 and o1 mini limitations

Comparing OpenAI o1 vs GPT-4o

Cost comparisons: o1 vs GPT-4o and o1-mini vs GPT-4o-mini

GPT-4o vs o1-preview: Overall comparison

GPT-4o-mini and o1-mini: Overall comparison

OpenAI o1 vs GPT 4o: Key Differences

When to use GPT-4o and OpenAI o1

Example prompts analyzing performance of OpenAI o1 vs GPT-4o

1. The 'Strawberry' test

3. Math puzzles

3. Up-to-date Information

4. Complex Mathematical Problem Solving

5. Advanced Coding Tasks

About PromptLayer

Read more

Why LLMs Get Distracted and How to Write Shorter Prompts

The Agentic System Design Interview: How to evaluate AI Engineers

What is Context Engineering?

Best Practices for Evaluating Back-and-Forth Conversational AI