How to Run LLMs Locally in 2026: Step-by-Step Guide for Beginners
Everything you need to know — from hardware to software to your first model running in minutes.
Why Run LLMs Locally?
Running LLMs locally gives you complete privacy (your prompts never leave your machine), zero cost per query, offline capability, and full control over model choice and customization. Once downloaded, you can run thousands of queries with no API bills, rate limits, or internet required.
Privacy
Your prompts never leave your machine. No data sent to third-party servers. Critical for sensitive work, personal data, or confidential business use.
Zero cost per query
No API bills, no rate limits, no subscription fees. After hardware, inference is free. Run thousands of queries without watching usage metrics.
Works offline
No internet connection required once models are downloaded. Works on a plane, in a bunker, or when cloud services go down.
Full customization
Fine-tune on your own data. Swap system prompts. Run uncensored models. Integrate into local apps. You control everything.
Advertisement
What Hardware Do You Need?
The most important number is VRAM (GPU memory) or unified memory on Apple Silicon. This determines the largest model you can run and at what quality.
- -8 GB VRAM (RTX 4060): runs 7B models well — a great starting point
- -12 GB VRAM (RTX 4070): unlocks 13B models at Q4 quantization
- -24 GB VRAM or unified (RTX 3090, Mac mini M4 24GB): runs 30B+ models
- -CPU only: works, but slow — 3–8 t/s vs 30–80 t/s on GPU
Use the VRAM Calculator to check exactly what models fit your hardware. See also the budget hardware guide if you're deciding what to buy.
Choose Your Software
Ollama
Easiest — recommended for beginnersOne-command install, automatic model downloads, built-in REST API, and works on macOS, Windows, and Linux. The fastest path from zero to running. Supports all major model families. Best choice for 90% of users.
https://ollama.comLM Studio
Best GUI — no terminal requiredA polished desktop application with a model browser, chat interface, and local server mode. Download models with a click, manage multiple models, and run an OpenAI-compatible API endpoint. Ideal if you prefer a graphical interface.
https://lmstudio.aillama.cpp
Advanced — maximum controlThe underlying inference engine that Ollama and LM Studio are built on. Direct command-line interface, highly configurable, minimal overhead. Best for fine-tuned setups, unusual hardware, or integrating into your own tools.
https://github.com/ggml-org/llama.cppStep-by-Step: Run Your First Model with Ollama
- 1.
Install Ollama
Download from ollama.com and run the installer. On macOS and Windows it installs as a background service. On Linux, run:
curl -fsSL https://ollama.com/install.sh | sh
- 2.
Download and run a model
Open your terminal and run one command. Ollama downloads the model automatically and starts a chat session:
ollama run qwen3:4b
This downloads Qwen3 4B (~3 GB) — the recommended starter in 2026. Popular alternatives:
ollama run qwen3:8b(8 GB VRAM, stronger),ollama run deepseek-r1:7b(reasoning),ollama run gemma3:4b(Google Gemma 3). - 3.
Start chatting
Type your prompt and press Enter. The model responds in the terminal. Type
/byeto exit. The model stays cached for fast subsequent runs. - 4.
(Optional) Add a chat UI
Ollama exposes a local API at
http://localhost:11434. Connect any OpenAI-compatible frontend — Open WebUI is popular:docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main
Recommended Starter Models by Hardware
| Hardware Tier | Hardware Examples | Starter Model | Notes |
|---|---|---|---|
| 4–6 GB VRAM / CPU only | Integrated GPU, old GPU, CPU | Qwen3 4B Q4 / Phi-4-mini 3.8B Q4 / Gemma 3 4B Q4 | Small but capable; fast on modern CPUs |
| 8 GB VRAM | RTX 4060, RTX 3070, RTX 2080 | Llama 3.1 8B Q4_K_M / DeepSeek-R1-Distill-8B / Qwen3 8B | Best all-round 7–8B; fast, capable |
| 12 GB VRAM | RTX 4070, Arc B580 12GB | Qwen3 14B Q4_K_M / DeepSeek-R1-Distill-14B Q4 | Significant quality step up at 14B |
| 16 GB VRAM (NVIDIA) | RTX 4060 Ti 16GB, RTX 4070 Ti Super | Qwen3 14B Q8 / Llama 3.1 8B FP16 | 16 GB unlocks Q8 quality — big improvement |
| 16 GB unified (Apple) | Mac mini M4 16GB, MacBook Air M4 16GB | Llama 3.1 8B Q8 / Qwen3 8B Q8 | Higher quality quant fits in 16 GB |
| 24 GB VRAM / unified | RTX 3090, RTX 4090, Mac mini M4 24GB | Qwen3 32B Q4_K_M / DeepSeek-R1-Distill-32B / Llama 3.3 70B Q4 (partial) | Large model territory |
All models above are available via ollama run <model>. See LLM Model Sizes Explained for a full VRAM breakdown.
Common Mistakes to Avoid
Downloading a model that is too large for your VRAM
Check VRAM requirements first. A 13B model at Q4 needs ~8.5 GB VRAM. If you only have 8 GB, it will load partially onto CPU and run very slowly. Use the VRAM Calculator before downloading.
Using FP16 when Q4_K_M is available
Full FP16 precision doubles VRAM usage with minimal quality gain for most tasks. Always start with Q4_K_M or Q8_0 quantized GGUF models. Quality loss is small; VRAM savings are large.
Expecting GPU speeds on a CPU-only system
CPU inference is real but slow. A 7B model on a modern CPU runs at 5–10 t/s — usable, not snappy. If you need interactive speed, you need a GPU or Apple Silicon.
Not leaving headroom in VRAM
A model that exactly fills your VRAM will crash or stall when it tries to allocate KV cache for longer contexts. Leave 1–2 GB free. A 7B model at Q4 uses ~5 GB — fine on 8 GB.
Picking the biggest model available instead of the most suitable
A 7B model fine-tuned for your task often beats a generic 13B. Check the model card. DeepSeek Coder 7B outperforms many larger general models on coding tasks.
Frequently Asked Questions
Do I need a GPU to run LLMs locally?
No, but a GPU makes a large difference. On CPU-only hardware, a 7B model runs at 3–8 tokens per second — usable but slow for interactive chat. With a GPU (even an RTX 4060 with 8 GB VRAM), the same model runs at 30–50 tokens per second. If you only have CPU, use llama.cpp with quantized GGUF models — it is the most CPU-efficient runtime.
What is the minimum VRAM to run an LLM locally?
You can run small models (1B–3B) on 4 GB VRAM at Q4 quantization. For a genuinely useful LLM experience, 8 GB VRAM (RTX 4060, RTX 3070) runs 7B models well. 12 GB (RTX 4070) opens up 13B models. More VRAM always means better options.
Can I run LLMs locally on CPU only?
Yes. llama.cpp and Ollama both support CPU-only inference. Performance is much slower — expect 3–8 tokens per second for a 7B model on a modern CPU. For occasional or non-interactive use this is fine. Apple Silicon Macs are a special case: the unified memory architecture delivers GPU-class performance without a discrete GPU.
Ollama vs LM Studio
Which tool should you use?
CPU-Only Inference
Run LLMs without a GPU
What Can I Run?
VRAM tier reference
Best LLMs Locally
Top model picks for 2026
Qwen3 Guide
Best 8B–32B models in 2026
Gemma 3 Guide
Google 1B–27B model guide
Phi-4 Guide
Microsoft 14B that beats 70B
Run DeepSeek R1 Locally
Step-by-step DeepSeek R1 Ollama setup
Run Qwen3 Locally
Qwen3 setup with thinking mode and MoE explained
Run Llama 3 Locally
Meta Llama 3.1 8B and 3.3 70B — Ollama setup guide
Run Mistral Locally
Mistral 7B, Nemo 12B, Small 22B — install and run in minutes
Run Gemma 3 Locally
Google Gemma 3 (1B–27B) — vision-capable, fits 27B in 16 GB VRAM
Run Phi-4 Locally
Microsoft's Phi-4 14B — beats Llama 3.1 8B on reasoning, needs 10 GB VRAM
Run Llama 4 Scout Locally
Meta's Llama 4 Scout — why it needs 58 GB, dual-GPU setup, alternatives
Local AI Coding Assistant
VS Code + Continue.dev + Ollama — free GitHub Copilot alternative
Open WebUI Setup
ChatGPT-like browser UI for Ollama — free and private
VRAM Calculator
Any model, any quantization
LLMs on Windows
Ollama and LM Studio setup on Windows 10/11 — NVIDIA, AMD, Intel Arc
LLMs on Linux
Ollama on Ubuntu/Fedora/Arch — NVIDIA CUDA, AMD ROCm, Intel Arc
LLMs on Mac
Ollama on Apple Silicon — Metal auto-enabled, M1 through M4
Private Offline AI
Run AI with zero data leaving your computer — fully air-gap capable
AI on Gaming PC
Your gaming GPU runs local AI — tier table from RTX 4060 to 4090
Ollama Commands Cheat Sheet
Every Ollama CLI command: run, pull, list, API, Modelfile, environment variables.
llama.cpp Guide
Direct GGUF inference without Ollama — full GPU layer control.
Ready to pick hardware? Browse all hardware, use the VRAM Calculator, or read the budget hardware guide.
Related Guides
Ollama vs llama.cpp
Compare the two most popular local LLM runtimes side by side.
LM Studio Hardware Requirements
What hardware you need to run LM Studio smoothly.
Best GPUs for LLMs
Top GPU picks for running local AI models in 2025.
LLM RAM Requirements
How much RAM and VRAM you need for different model sizes.
Popular hardware for local LLMs
Sources & methodology
VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide specifically I leaned on:
- Ollama project — install commands, model names, and the local REST API behaviour are taken straight from the upstream README and docs.
- LM Studio — official site for the GUI workflow described in the "Choose Your Software" section.
- llama.cpp project — the inference engine behind both; CPU-only behaviour and GGUF quantization names follow its conventions.
- XiongjieDai GPU-Benchmarks-on-LLM-Inference — the "RTX 4060 → 30–50 t/s, CPU → 3–8 t/s" speed ranges are sanity-checked against community llama-bench runs here.
Spot a number that does not match the linked source? Email billybobgurr@gmail.com and I will update the guide.