All Models

TinyLlama-1.1B

TinyLlama-1.1B

Text

TinyLlama-1.1B (1.1 B params, Apache-2.0)

Pocket-size Llama-2 drop-in you can run almost anywhere.

  • 4 K-token brain. Keeps Llama-2’s tokenizer and 4 096-token RoPE window, so existing prompts and LoRA adapters just work.

  • Trained like a giant. Pre-trained on ~3 T tokens of web+code, then chat-aligned with UltraChat → DPO; scores well against models 3–7 × its size.

  • Featherweight deploys. ~2 GB VRAM in BF16; 4-bit quant is only ≈ 637 MB—small enough for laptops, phones, or edge gateways.

  • Plug-and-play. Works out-of-the-box with transformers >= 4.34, vLLM, llama.cpp, Ollama, GGUF, etc. from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0") and go.

  • No legal knots. Apache-2.0 means full commercial use—no “Llama license” clauses.

Why pick it for Norman AI?

TinyLlama lets us spin up an on-device or “budget” inference tier without touching our Llama-based toolchain—same prompt format, one-tenth the RAM. Ideal for edge apps, speculative decoding helpers, and per-tenant fine-tunes where every gigabyte counts.


messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant",
     "content": "Sure! Here are some ways to eat bananas and dragonfruits together"},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

response = await norman.invoke(
    {
        "model_name": "qwen3-4b",
        "inputs": [
            {
                "display_title": "Prompt",
                "data": messages
            }
        ]
    }
)