All Models

Distil-Whisper

Distil-Whisper

Audio

Distil-Whisper small.en (166 M params, MIT)

Lean English-ASR that keeps Whisper-level accuracy on a phone-friendly footprint.

  • 6× faster, 49 % smaller. Distilled from Whisper-small: same encoder, trimmed decoder → within 1 % WER of the teacher on tough test sets, yet needs a fraction of compute.

  • Mini spec sheet. 4-layer decoder, 30 s audio window, English-only. Checkpoint ≈ 350 MB FP16; 4-bit quant drops below 100 MB, so real-time runs on CPUs or 1 GB GPUs.

  • Accuracy in numbers. Short-form WER 12.1 %, long-form 12.8 %—just a few points behind Whisper-large-v3 while decoding 5-6× faster.

  • Chunk + batch friendly. Built-in 15 s chunking and batching make hour-long transcripts 9× quicker than Whisper’s original loop.

  • Plugs wherever Whisper does. Supported in transformers ≥ 4.35, whisper.cpp, and as a speculative-decoding helper for bigger Whisper models—drop-in swap of the model ID is all it takes.

Why pick it for Norman AI?

This gives us near-Whisper accuracy for English calls, demos, or edge devices without spinning up a beefy GPU. Use it as a standalone ASR tier or as an “assistant” model to halve latency for our existing Whisper pipelines—same API, lower bill.

response = await norman.invoke(
    {
        "model_name": "xtts-v2",
        "inputs": [
            {
                "display_title": "Prompt",
                "data": "Create a groovy, rhythmic remix of the input audio."
            },
            {
                "display_title": "Prompt",
                "data": "/Users/alice/Desktop/sample_input.aac"
            }
        ]
    }
)