All Models

Qwen2.5-1.5B

Qwen2.5-1.5B

Text

Qwen 2.5 1.5B (1.54 B params, Apache-2.0)

Pocket-sized Qwen that still handles marathon prompts.

  • Spec sheet. 28-layer decoder with Grouped-Query Attention (12 Q / 2 KV heads), RoPE, SwiGLU. Native 32 k-token window; stretch to 128 k with YARN/Dual-Chunk tricks.

  • Small model, big scores. Beats Phi-2 and Gemma-2B on MMLU (56 .5) and other reasoning / coding tests in the Qwen2 paper table.

  • Runs on almost anything. BF16 weights need ≈ 4.6 GB VRAM; int-4 quant dives below 1.2 GB—perfect for laptops and edge GPUs.

  • Multilingual + structured-output savvy. Trained for 29 languages, better JSON / table handling, sturdier system-prompt obedience.

  • Plug-and-play. Supported in transformers >= 4.51, vLLM, llama.cpp, Ollama, etc.—from_pretrained("Qwen/Qwen2.5-1.5B") and you’re off.

Why pick it for Norman AI?

Qwen 2.5 1.5B gives us long-context chat, solid math/code chops, and full Apache freedom, all inside a 5 GB GPU envelope—ideal for cost-squeezed inference tiers, per-tenant fine-tunes, or on-device assistants.


messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant",
     "content": "Sure! Here are some ways to eat bananas and dragonfruits together"},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

response = await norman.invoke(
    {
        "model_name": "qwen3-4b",
        "inputs": [
            {
                "display_title": "Prompt",
                "data": messages
            }
        ]
    }
)