OpenAI’s GPT-OSS models are designed for high performance across diverse environments—from enterprise infrastructure to edge devices. Image Source: ChatGPT-4o

OpenAI Releases GPT-OSS-120B & GPT-OSS-20B as Open-Weight Language Models

Key Takeaways:

OpenAI has released GPT-OSS-120B and GPT-OSS-20B, two state-of-the-art open-weight language models available under the Apache 2.0 license.
The models match or outperform proprietary systems like GPT-4o, o3-mini, and o4-mini on benchmarks including HealthBench, AIME, and TauBench.
GPT-OSS-120B runs on a single 80 GB GPU, while the smaller GPT-OSS-20B operates on edge devices with only 16 GB of memory.
The models support tool use, chain-of-thought (CoT) reasoning, structured outputs, and adjustable reasoning effort for latency-performance tradeoffs.
OpenAI conducted extensive safety evaluations, including adversarial fine-tuning and reviews from independent experts under its Preparedness Framework.

OpenAI Launches GPT-OSS Models to Advance Open AI Development

OpenAI has introduced GPT-OSS-120B and GPT-OSS-20B, two high-performance open-weight language models optimized for reasoning tasks, tool use, and cost-effective deployment. Released under the Apache 2.0 license, the models are designed to run on consumer hardware and are fully customizable and support structured outputs.

GPT-OSS-120B achieves near-parity with o4-mini on reasoning benchmarks and operates on a single 80 GB GPU. GPT-OSS-20B delivers comparable results to o3-mini, while requiring just 16 GB of memory, making it ideal for on-device or local inference scenarios.

The models outperform other open systems on complex tasks, with GPT-OSS-120B exceeding proprietary models like o1 and GPT-4o on tool use, function calling, chain-of-thought (CoT) reasoning, and HealthBench metrics. They integrate seamlessly with OpenAI’s Responses API, support agentic workflows with excellent instruction following, and adapt reasoning intensity based on latency needs.

Model Architecture and Performance

OpenAI trained the GPT-OSS models using its most advanced pre-training and post-training techniques, with a focus on reasoning, efficiency, and real-world usability across varied deployment environments. While the company has previously open-sourced models like Whisper and CLIP, these are its first open-weight language models since GPT-2.

Each model uses a Transformer architecture with a Mixture-of-Experts (MoE) design, which activates only a subset of parameters per token for improved efficiency:

Model	Total Params	Active Params	Layers	Experts	Active Experts	Context Length
GPT-OSS-120B	117B	5.1B	36	128	4	128k
GPT-OSS-20B	21B	3.6B	24	32	4	128k

To further improve inference and memory efficiency, both models use alternating dense and locally banded sparse attention patterns (similar to GPT-3), as well as grouped multi-query attention with a group size of 8. Rotary Positional Embeddings (RoPE) are used for positional encoding.

The models support context lengths of up to 128,000 tokens and were trained on a mostly English, text-only dataset with strong representation from STEM, coding, and general knowledge. Tokenization was performed using the new o200k_harmony tokenizer, which is also being open-sourced as part of the release. You can read more in their model card.

Post-Training and Reasoning Control

Post-training followed the same methodology used for OpenAI’s o4-mini, including supervised fine-tuning and reinforcement learning (RL) stages. The models are aligned with the OpenAI Model Spec, supporting chain-of-thought reasoning, tool use, and instruction following. By applying the same techniques used in OpenAI’s state-of-the-art proprietary reasoning models, the GPT-OSS models show strong post-training performance across reasoning tasks.

Like OpenAI’s proprietary o-series, the models offer low, medium, and high reasoning modes, allowing developers to optimize for either latency or performance by adjusting a system message parameter.

Benchmark Results: Reasoning, Math, Health, and Tools

GPT-OSS-120B and GPT-OSS-20B were evaluated across standard academic benchmarks to assess their capabilities in coding, competition mathematics, health-related reasoning, and agentic tool use. In these evaluations, the models consistently matched or exceeded OpenAI’s proprietary reasoning models, including o3, o3-mini, and o4-mini:

Codeforces (competition coding):
GPT-OSS-120B achieves a higher Elo rating than o3-mini and performs on par with o4-mini, demonstrating strong results in competitive programming tasks.

GPT-OSS models achieve high Elo scores on Codeforces coding challenges, with strong performance in both tool-assisted and standalone scenarios. Image Source: OpenAI

Humanity’s Last Exam (expert-level questions across disciplines):
On this benchmark, which tests reasoning across a diverse set of expert-level academic domains, GPT-OSS-120B and GPT-OSS-20B perform competitively—particularly in tool-assisted scenarios. While proprietary models like o4-mini lead the category, GPT-OSS models still deliver meaningful results across highly specialized prompts, further validating their broad applicability.

GPT-OSS models rank competitively in expert-level questions across subjects, with 120B slightly ahead of 20B in both tool and non-tool scenarios. Image Source: OpenAI

HealthBench (realistic health conversations):
On both standard and hard evaluations, GPT-OSS-120B and GPT-OSS-20B outperform GPT-4o, o3, and o3-mini, making them particularly strong in health-related reasoning benchmarks.

GPT-OSS-120B leads in both standard and hard HealthBench evaluations, outperforming other open and proprietary models in realistic medical reasoning. Image Source: OpenAI

AIME 2024 & 2025 (competition mathematics):
Both models achieve high accuracy, with GPT-OSS-120B reaching 97.9% on AIME 2024 and GPT-OSS-20B scoring 98.7%. On AIME 2025, both models again outperform o3-mini, and GPT-OSS-20B surpasses even o3 on certain tasks—despite its smaller size.

GPT-OSS models perform strongly on AIME 2024 and 2025 competition math benchmarks, with the 20B model occasionally surpassing larger proprietary models. Image Source: OpenAI

GPT-OSS models achieve higher math accuracy as CoT and answer length increase, with GPT-OSS-120B consistently outperforming GPT-OSS-20B across all token lengths. Image Source: OpenAI

Tau-Bench (tool use and function calling):
The models show robust performance in few-shot function calling, long-answer generation, and chain-of-thought (CoT) reasoning, closely tracking the accuracy of o4-mini and o3 across tool-augmented evaluations.

On Tau-Bench, GPT-OSS-120B scores competitively on function calling, just behind o3, and ahead of o4-mini in retail use cases. Image Source: OpenAI

GPQA (PhD-level science reasoning):
GPT-OSS-120B performs competitively with o4-mini and outpaces o3-mini and o3, particularly on advanced scientific reasoning tasks that test deep domain knowledge.

Without tools, GPT-OSS-120B performs close to proprietary baselines on PhD-level science questions, slightly behind o3 and o4-mini. Image Source: OpenAI

GPT-OSS models improve on GPQA as token limits increase, with GPT-OSS-120B achieving the highest scientific reasoning accuracy. Image Source: OpenAI

MMLU (multi-subject academic knowledge):
The models perform solidly across MMLU categories, with GPT-OSS-120B achieving scores near o4-mini and GPT-OSS-20B outperforming o3-mini, reinforcing their strength in general academic understanding.

GPT-OSS models perform solidly on MMLU, with GPT-OSS-120B nearing the scores of o3 and o4-mini across a wide range of academic subjects. Image Source: OpenAI

Safety Training and Adversarial Testing

Safety remains foundational to OpenAI’s approach, especially when releasing open models. In addition to comprehensive safety training and evaluation, the team tested an adversarially fine-tuned version of GPT-OSS-120B under its Preparedness Framework. The models performed comparably to OpenAI’s frontier systems on internal benchmarks, meeting the same safety standards as its latest proprietary models. The evaluation methodology was reviewed by external experts, and the results—shared in a research paper and model card—mark a significant step toward setting new safety norms for open-weight AI systems.

During pre-training, harmful data was filtered, including content related to Chemical, Biological, Radiological, and Nuclear (CBRN) threats. Post-training employed deliberative alignment and instruction hierarchy techniques to teach refusals of unsafe prompts, defend against prompt injections, and uphold ethical boundaries.

To simulate worst-case misuse scenarios, OpenAI fine-tuned adversarial versions of the GPT-OSS models on specialized biological and cybersecurity datasets, mimicking potential attacker behavior. These variants were tested under the Preparedness Framework, and three independent expert groups reviewed the methodology. The models did not reach high-risk capability thresholds, supporting their public release. The testing methods and recommendations can be viewed in their safety paper and model card.

To encourage community-led safety research and contribute to a safer open-source ecosystem, OpenAI is launching a Red Teaming Challenge with a $500,000 prize pool, inviting researchers, developers, and enthusiasts to identify novel vulnerabilities. Results will be published and open-sourced at the end of the challenge. To learn more or participate, visit OpenAI’s official website.

Deployment and Availability

The models are now freely downloadable on Hugging Face and come quantized in MXFP4 for efficient deployment:

GPT-OSS-120B runs within 80 GB
GPT-OSS-20B runs within 16 GB

To support integration, OpenAI is releasing both the Harmony prompt format and a Harmony renderer (available in Python and Rust), along with inference references for PyTorch and Apple Metal, and a collection of sample tools for easier adoption. Ahead of launch, deployment partners included Azure, Hugging Face, Ollama, vLLM, AWS, Together AI, Databricks, Cloudflare, and others.

On the hardware side, OpenAI collaborated with industry leaders including NVIDIA, AMD, Cerebras, and Groq to optimize model performance across a wide range of deployment environments.

OpenAI has also partnered with early adopters including AI Sweden, Orange, and Snowflake to explore real-world applications of the open models. These use cases range from on-premises deployment for data security to fine-tuning on specialized datasets. By offering best-in-class open models alongside API-hosted options, OpenAI aims to give everyone—from individual developers to enterprises and governments—the flexibility to run and customize AI on their own infrastructure.

Microsoft is also enabling GPU-optimized GPT-OSS-20B inference on Windows devices via ONNX Runtime, available in Foundry Local and the AI Toolkit for VS Code.

For developers who need multimodal support, built-in tools, or tight integration with OpenAI’s platform, proprietary models accessed via the API remain the best fit. OpenAI says it is listening to developer feedback and may explore API support for GPT-OSS in the future.

Developers can explore the models through OpenAI’s open model playground and access detailed guides for using different ecosystem providers or fine-tuning the models.

Why Open Models Matter

OpenAI describes GPT-OSS-120B and GPT-OSS-20B as a milestone in delivering powerful open models that balance capability, safety, and customizability. They complement hosted models by offering developers—especially those with infrastructure limitations—a high-performance, self-hosted option.

The release supports broader goals of democratizing AI, especially for emerging markets, research labs, and government applications. By enabling flexible deployment and fine-tuning, OpenAI aims to accelerate innovation across sectors while advancing transparency and alignment research.

Q&A: GPT-OSS Open Model Release

Q: What are GPT-OSS-120B and GPT-OSS-20B?
A: Two open-weight language models from OpenAI, optimized for reasoning, tool use, and efficient deployment.

Q: How do they compare to proprietary models like GPT-4o or o3-mini?
A: They match or outperform those models on benchmarks in math, health, and coding, especially with tool use enabled.

Q: What kind of hardware do they require?
A: GPT-OSS-120B runs on a single 80 GB GPU, and GPT-OSS-20B runs on 16 GB, making it suitable for local or on-device inference.

Q: What safety measures were taken before release?
A: OpenAI ran robust safety evaluations, including adversarial fine-tuning, and had results reviewed by external experts.

Q: Where can developers get the models and supporting tools?
A: The models are available on Hugging Face, along with open-source tokenizers, renderers, and deployment references.

What This Means

By releasing GPT-OSS-120B and GPT-OSS-20B, OpenAI is setting a new standard for open-weight model capabilities and responsible deployment. These models lower barriers to entry for developers around the world—offering customizable, high-performance tools that rival proprietary systems.

With the launch of GPT-OSS, OpenAI is now producing both closed and open-weight models at a level competitive with other state-of-the-art systems. This dual approach gives developers, enterprises, and governments more choice in how they access, deploy, and fine-tune AI—whether they prefer hosted APIs or self-managed infrastructure.

As AI infrastructure grows more distributed and diverse, accessible models like GPT-OSS allow innovation to flourish beyond cloud platforms. This release reinforces the value of a healthy open-source ecosystem—one where safety, performance, and transparency can grow together.

By offering top-tier open models alongside its proprietary lineup, OpenAI is expanding who gets to build with cutting-edge AI—and where.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.

OpenAI Releases GPT-OSS-120B & GPT-OSS-20B as Open-Weight Language Models

OpenAI Releases GPT-OSS-120B & GPT-OSS-20B as Open-Weight Language Models

Key Takeaways:

OpenAI Launches GPT-OSS Models to Advance Open AI Development

Model Architecture and Performance

Post-Training and Reasoning Control

Benchmark Results: Reasoning, Math, Health, and Tools

Safety Training and Adversarial Testing

Deployment and Availability

Why Open Models Matter

Q&A: GPT-OSS Open Model Release

What This Means

Keep Reading

AiNews.com