• AiNews.com
  • Posts
  • Nvidia Open-Sources Parakeet V2 Speech Recognition Model

Nvidia Open-Sources Parakeet V2 Speech Recognition Model

A man in a denim shirt sits at a wooden desk in a bright home office, working on a desktop computer running automatic speech recognition software. The screen shows a large audio waveform at the top, a play/pause control bar beneath it, and live-transcribed text in progress below. A keyboard, mouse, coffee mug, and headphones are neatly arranged nearby, while daylight streams through a window and a book-filled shelf completes the calm, productive atmosphere.

Image Source: ChatGPT-4o

Nvidia Open-Sources Parakeet V2 Speech Recognition Model

Nvidia has open-sourced Parakeet V2, a high-performance automatic speech recognition (ASR) model that sets a new benchmark in transcription accuracy and speed. Capable of transcribing an hour of audio in just one second, the 600-million-parameter model is now freely available under a commercially permissive license for developers and researchers worldwide.

Parakeet V2 ranks first on the Open ASR leaderboard with a Word Error Rate (WER) of 6.05%, outperforming leading models like OpenAI’s Whisper and ElevenLabs’ Scribe. It’s released under a Creative Commons BY-4.0 license, allowing for both commercial and non-commercial use.

Key Features and Capabilities

Beyond speed and accuracy, Parakeet V2 is designed with features that support real-world use cases in transcription and voice AI:

  • Word-level timestamp predictions for aligning transcripts with audio

  • Automatic punctuation and capitalization for more readable outputs

  • Song-to-lyric transcription, supporting use cases in music and media

  • Robust handling of spoken numbers, a common pain point in ASR systems

It also supports transcription of long audio segments—up to 24 minutes in a single pass—thanks to the model’s architecture and use of full attention during training.

These capabilities make Parakeet V2 a strong option for developers—significantly lowering the barrier to building voice assistants, transcription platforms, and other advanced speech applications.

Built for Performance and Scale

Parakeet V2 is based on the FastConformer encoder architecture and TDT decoder, trained using Nvidia’s NeMo toolkit. It’s optimized to run on GPU-accelerated hardware, particularly Nvidia’s A100 GPUs, to deliver both faster training and inference compared to CPU-based systems.

The model was trained using a combination of:

  • Self-supervised learning with wav2vec on the LibriLight dataset

  • Stage 1 training for 150,000 steps on 128 A100 GPUs

  • Stage 2 fine-tuning using 500 hours of high-quality human-transcribed audio

On the Hugging Face Open ASR leaderboard, Parakeet V2 reports an RTFx (real-time factor multiplier) of 3380 using a batch size of 128—indicating extremely fast inference speeds, though performance will vary with different datasets and configurations.

Designed for Developers, Researchers, and Industry

The model is accessible through the Hugging Face demo and is ready for immediate use in a variety of speech-to-text applications, including:

  • Voice assistants

  • Transcription and captioning services

  • Voice analytics platforms

  • Conversational AI systems

Developers can fine-tune or adapt the model using the Nvidia NeMo toolkit, which requires an up-to-date PyTorch installation and compatible GPU hardware.

What This Means

With Parakeet V2, Nvidia is reinforcing its leadership in both AI research and practical deployment. By open-sourcing a top-tier ASR model under a permissive license, the company is giving developers and researchers a powerful tool that’s ready to be used across industries—from media and customer support to healthcare and education.

Crucially, this release also lowers the cost and complexity of adding accurate speech recognition to real-world products. Rather than relying on closed APIs or expensive enterprise contracts, developers can now fine-tune or integrate Parakeet V2 directly using open-source tools and off-the-shelf hardware. That’s a major shift, especially for smaller teams, startups, and academic labs that often struggle to access cutting-edge models.

And because it supports long-form audio, word-level timestamps, and complex formats like lyrics or spoken numbers, it’s adaptable to a wide range of use cases that go beyond basic transcription.

By making a top-performing ASR model freely available, Nvidia is removing key barriers to entry for innovation in voice technology—and setting a new standard for what open-source AI can deliver.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.