All Models

Whisper Medium

Whisper Medium

Audio

Whisper Medium (769 M params, Apache-2.0)

Mid-sized ASR that nails ~3 % WER and runs on a 3 GB GPU.

  • Spec sheet. 769 M-param encoder-decoder Transformer, 30 s audio window, trained on 680 k h of speech across 99 languages.

  • Real-world accuracy. Self-reported 2.9 % WER on LibriSpeech clean and 5.9 % on other—state-of-the-art for its size.

  • Runs light. FP16 weights ≈1.5 GB; end-to-end inference is happy on cards with ~3 GB VRAM—no H100 needed.

  • Multilingual + translate. Auto-detects 99 languages and can spit out either same-language transcripts or English translations.

Why pick it for Norman AI?

Whisper Medium gives us near-large accuracy, full multilingual coverage, and Apache freedom—all in a package light enough for edge GPUs or co-located micro-services. Perfect for call transcripts, video captions, or voice-note features without new hardware or license headaches.

response = await norman.invoke(
    {
        "model_name": "whisper-medium",
        "inputs": [
            {
                "display_title": "Prompt",
                "data": "/Users/alice/Desktop/meeting_record.mp3"
            },
            {
                "display_title": "Prompt",
                "data": "transcribe"
            }
        ]
    }
)