asr-wav2vec2-librispeech

All Models

Audio

wav2vec 2 LibriSpeech (317 M params, Apache-2.0)

All-English speech-to-text that scores under 2 % WER and still fits on a laptop.

Pre-train → fine-tune. Starts from the 317 M-param wav2vec 2-large-960h encoder, adds two DNN layers, then CTC-fine-tunes on the full 960 h LibriSpeech set.
Benchmark numbers. 1.90 % WER on test-clean and 3.96 % on test-other—plenty for production captions.
Lean deploys. FP16 weights ≈1.3 GB (fits any 2 GB GPU); 4-bit quant slides under 350 MB, so real-time CPU inference is doable.
No strings attached. Pure Apache-2.0 weights, single-language tokenizer, 16 kHz input—zero licensing drama, zero extra language models.

Why pick it for Norman AI?

Drop-in English ASR with top-tier accuracy and sub-2 GB footprints means instant call transcripts, voice-note search, or captioning—without new GPUs or legal hoops.

response = await norman.invoke(
    {
        "model_name": "xtts-v2",
        "inputs": [
            {
                "display_title": "Prompt",
                "data": "A female speaker delivers a slightly expressive and animated speech with a high-pitched voice in a clear audio environment."
            },
            {
                "display_title": "Prompt",
                "data": "/Users/alice/Desktop/sample_input.aac"
            }
        ]
    }
)