How LLM Fit Check works

LLM Fit Check estimates the total memory a large language model needs — weights + KV cache + compute buffers — and compares it against your GPU VRAM, Apple unified memory or system RAM. Model data comes live from Hugging Face: real parameter counts, exact GGUF quantization file sizes, and each model's true architecture (layers, grouped-query attention heads, context window), including special cases like DeepSeek's MLA compressed KV cache and Gemma's sliding-window attention.

How much memory do common model sizes need?

Typical totals at 8K context for standard GQA architectures (use the calculator above for your exact model, quant and context):

Model size	Q4_K_M (recommended)	FP16 (full precision)	Fits on
7–8B	~6–7 GB	~18 GB	RTX 3060 12GB, any 16GB Mac
13–14B	~11–12 GB	~32 GB	RTX 4070 12GB (tight), RTX 4080 16GB
32B	~21–23 GB	~68 GB	RTX 3090/4090 24GB, 36GB Mac
70B	~43–46 GB	~145 GB	2× 24GB GPUs, 64GB+ Mac
Mixtral 8x7B (47B MoE)	~29 GB	~94 GB	RTX 5090 32GB, 48GB Mac

Model data refreshes continuously from the Hugging Face API; trending list updates every 6 hours.

How much VRAM does a 7B, 13B or 70B model need?

At the popular Q4_K_M quantization with 8K context: a 7–8B model needs about 6–7 GB, a 13–14B model about 10–12 GB, a 32B model about 21–23 GB, and a 70B model about 43–46 GB. Higher precision (Q6_K, Q8_0, FP16) and longer context windows push these numbers up — use the calculator above for your exact setup.

What is Q4_K_M and which GGUF quantization should I use?

GGUF quantizations shrink a model's weights to fewer bits per weight. Q4_K_M (~4.85 bits/weight) is the sweet spot for most people — roughly 70% smaller than FP16 with minimal quality loss. Q6_K and Q8_0 are near-lossless but bigger; Q2_K is a last resort with noticeable degradation. If a model almost fits, try the next smaller quant before giving up.

How is the memory estimate calculated?

Weights = parameters × bits-per-weight ÷ 8 (or the exact published GGUF file size when available). KV cache = 2 × layers × KV heads × head dimension × bytes × context tokens — grouped-query-attention aware, using each model's real config, and allocated upfront for the configured context (as llama.cpp does). The compute buffer assumes flash attention — the llama.cpp / LM Studio / Ollama default since late 2025 — so it stays nearly flat as context grows. A configurable safety margin separates “fits” from “tight”.

Do MoE models like Mixtral or DeepSeek need less memory?

No — all parameters must be in memory, not just the active experts. Mixtral 8x7B holds 46.7B parameters even though only ~13B are active per token; the active count makes it faster, not smaller. The exception is the KV cache: DeepSeek's MLA architecture compresses it ~28× compared to standard attention.

How much of my Mac's unified memory can a model use?

macOS lets the GPU wire about 2/3 of unified memory below 36 GB, and 3/4 from 36 GB up (Metal's recommended working-set limit, which Ollama and LM Studio respect). A 36 GB MacBook therefore offers ~27 GB to models — minus whatever macOS and your open apps already use, which you can set in the hardware panel.

Can I run a model that's bigger than my VRAM?

Often yes — llama.cpp-style engines can keep some layers in system RAM and run them on the CPU (partial offload). It works, but expect a steep slowdown: even a small spill can halve your tokens per second. LLM Fit Check shows these models with an orange “Offloads” verdict and how many GB would land in RAM.