How to Run LLMs Locally in 2026: Step-by-Step Guide for Beginners

Everything you need to know — from hardware to software to your first model running in minutes.

Why Run LLMs Locally?

Running LLMs locally gives you complete privacy (your prompts never leave your machine), zero cost per query, offline capability, and full control over model choice and customization. Once downloaded, you can run thousands of queries with no API bills, rate limits, or internet required.

Privacy

Your prompts never leave your machine. No data sent to third-party servers. Critical for sensitive work, personal data, or confidential business use.

Zero cost per query

No API bills, no rate limits, no subscription fees. After hardware, inference is free. Run thousands of queries without watching usage metrics.

Works offline

No internet connection required once models are downloaded. Works on a plane, in a bunker, or when cloud services go down.

Full customization

Fine-tune on your own data. Swap system prompts. Run uncensored models. Integrate into local apps. You control everything.

What Hardware Do You Need?

The most important number is VRAM (GPU memory) or unified memory on Apple Silicon. This determines the largest model you can run and at what quality.

-8 GB VRAM (RTX 4060): runs 7B models well — a great starting point
-12 GB VRAM (RTX 4070): unlocks 13B models at Q4 quantization
-24 GB VRAM or unified (RTX 3090, Mac mini M4 24GB): runs 30B+ models
-CPU only: works, but slow — 3–8 t/s vs 30–80 t/s on GPU

Use the VRAM Calculator to check exactly what models fit your hardware. See also the budget hardware guide if you're deciding what to buy.

Choose Your Software

Ollama

Easiest — recommended for beginners

One-command install, automatic model downloads, built-in REST API, and works on macOS, Windows, and Linux. The fastest path from zero to running. Supports all major model families. Best choice for 90% of users.

https://ollama.com

LM Studio

Best GUI — no terminal required

A polished desktop application with a model browser, chat interface, and local server mode. Download models with a click, manage multiple models, and run an OpenAI-compatible API endpoint. Ideal if you prefer a graphical interface.

https://lmstudio.ai

llama.cpp

Advanced — maximum control

The underlying inference engine that Ollama and LM Studio are built on. Direct command-line interface, highly configurable, minimal overhead. Best for fine-tuned setups, unusual hardware, or integrating into your own tools.

https://github.com/ggml-org/llama.cpp

Step-by-Step: Run Your First Model with Ollama

1.
Install Ollama

Download from ollama.com and run the installer. On macOS and Windows it installs as a background service. On Linux, run:
```
curl -fsSL https://ollama.com/install.sh | sh
```
2.
Download and run a model

Open your terminal and run one command. Ollama downloads the model automatically and starts a chat session:
```
ollama run qwen3:4b
```
This downloads Qwen3 4B (~3 GB) — the recommended starter in 2026. Popular alternatives: ollama run qwen3:8b (8 GB VRAM, stronger), ollama run deepseek-r1:7b (reasoning), ollama run gemma3:4b (Google Gemma 3).
3.

Start chatting

Type your prompt and press Enter. The model responds in the terminal. Type /bye to exit. The model stays cached for fast subsequent runs.
4.
(Optional) Add a chat UI

Ollama exposes a local API at http://localhost:11434. Connect any OpenAI-compatible frontend — Open WebUI is popular:
```
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main
```

Recommended Starter Models by Hardware

Hardware Tier	Hardware Examples	Starter Model	Notes
4–6 GB VRAM / CPU only	Integrated GPU, old GPU, CPU	Qwen3 4B Q4 / Phi-4-mini 3.8B Q4 / Gemma 3 4B Q4	Small but capable; fast on modern CPUs
8 GB VRAM	RTX 4060, RTX 3070, RTX 2080	Llama 3.1 8B Q4_K_M / DeepSeek-R1-Distill-8B / Qwen3 8B	Best all-round 7–8B; fast, capable
12 GB VRAM	RTX 4070, Arc B580 12GB	Qwen3 14B Q4_K_M / DeepSeek-R1-Distill-14B Q4	Significant quality step up at 14B
16 GB VRAM (NVIDIA)	RTX 4060 Ti 16GB, RTX 4070 Ti Super	Qwen3 14B Q8 / Llama 3.1 8B FP16	16 GB unlocks Q8 quality — big improvement
16 GB unified (Apple)	Mac mini M4 16GB, MacBook Air M4 16GB	Llama 3.1 8B Q8 / Qwen3 8B Q8	Higher quality quant fits in 16 GB
24 GB VRAM / unified	RTX 3090, RTX 4090, Mac mini M4 24GB	Qwen3 32B Q4_K_M / DeepSeek-R1-Distill-32B / Llama 3.3 70B Q4 (partial)	Large model territory

All models above are available via ollama run <model>. See LLM Model Sizes Explained for a full VRAM breakdown.

Common Mistakes to Avoid

Downloading a model that is too large for your VRAM

Check VRAM requirements first. A 13B model at Q4 needs ~8.5 GB VRAM. If you only have 8 GB, it will load partially onto CPU and run very slowly. Use the VRAM Calculator before downloading.

Using FP16 when Q4_K_M is available

Full FP16 precision doubles VRAM usage with minimal quality gain for most tasks. Always start with Q4_K_M or Q8_0 quantized GGUF models. Quality loss is small; VRAM savings are large.

Expecting GPU speeds on a CPU-only system

CPU inference is real but slow. A 7B model on a modern CPU runs at 5–10 t/s — usable, not snappy. If you need interactive speed, you need a GPU or Apple Silicon.

Not leaving headroom in VRAM

A model that exactly fills your VRAM will crash or stall when it tries to allocate KV cache for longer contexts. Leave 1–2 GB free. A 7B model at Q4 uses ~5 GB — fine on 8 GB.

Picking the biggest model available instead of the most suitable

A 7B model fine-tuned for your task often beats a generic 13B. Check the model card. DeepSeek Coder 7B outperforms many larger general models on coding tasks.

Frequently Asked Questions

Do I need a GPU to run LLMs locally?

No, but a GPU makes a large difference. On CPU-only hardware, a 7B model runs at 3–8 tokens per second — usable but slow for interactive chat. With a GPU (even an RTX 4060 with 8 GB VRAM), the same model runs at 30–50 tokens per second. If you only have CPU, use llama.cpp with quantized GGUF models — it is the most CPU-efficient runtime.

What is the minimum VRAM to run an LLM locally?

You can run small models (1B–3B) on 4 GB VRAM at Q4 quantization. For a genuinely useful LLM experience, 8 GB VRAM (RTX 4060, RTX 3070) runs 7B models well. 12 GB (RTX 4070) opens up 13B models. More VRAM always means better options.

Can I run LLMs locally on CPU only?

Yes. llama.cpp and Ollama both support CPU-only inference. Performance is much slower — expect 3–8 tokens per second for a 7B model on a modern CPU. For occasional or non-interactive use this is fine. Apple Silicon Macs are a special case: the unified memory architecture delivers GPU-class performance without a discrete GPU.

Ollama vs LM Studio

Which tool should you use?

CPU-Only Inference

Run LLMs without a GPU

What Can I Run?

VRAM tier reference

Best LLMs Locally

Top model picks for 2026

Qwen3 Guide

Best 8B–32B models in 2026

Gemma 3 Guide

Google 1B–27B model guide

Phi-4 Guide

Microsoft 14B that beats 70B

Run DeepSeek R1 Locally

Step-by-step DeepSeek R1 Ollama setup

Run Qwen3 Locally

Qwen3 setup with thinking mode and MoE explained

Run Llama 3 Locally

Meta Llama 3.1 8B and 3.3 70B — Ollama setup guide

Run Mistral Locally

Mistral 7B, Nemo 12B, Small 22B — install and run in minutes

Run Gemma 3 Locally

Google Gemma 3 (1B–27B) — vision-capable, fits 27B in 16 GB VRAM

Run Phi-4 Locally

Microsoft's Phi-4 14B — beats Llama 3.1 8B on reasoning, needs 10 GB VRAM

Run Llama 4 Scout Locally

Meta's Llama 4 Scout — why it needs 58 GB, dual-GPU setup, alternatives

Local AI Coding Assistant

VS Code + Continue.dev + Ollama — free GitHub Copilot alternative

Open WebUI Setup

ChatGPT-like browser UI for Ollama — free and private

VRAM Calculator

Any model, any quantization

LLMs on Windows

Ollama and LM Studio setup on Windows 10/11 — NVIDIA, AMD, Intel Arc

LLMs on Linux

Ollama on Ubuntu/Fedora/Arch — NVIDIA CUDA, AMD ROCm, Intel Arc

LLMs on Mac

Ollama on Apple Silicon — Metal auto-enabled, M1 through M4

Private Offline AI

Run AI with zero data leaving your computer — fully air-gap capable

AI on Gaming PC

Your gaming GPU runs local AI — tier table from RTX 4060 to 4090

Ollama Commands Cheat Sheet

Every Ollama CLI command: run, pull, list, API, Modelfile, environment variables.

llama.cpp Guide

Direct GGUF inference without Ollama — full GPU layer control.

Ready to pick hardware? Browse all hardware, use the VRAM Calculator, or read the budget hardware guide.

Related Guides

Ollama vs llama.cpp

Compare the two most popular local LLM runtimes side by side.

LM Studio Hardware Requirements

What hardware you need to run LM Studio smoothly.

Best GPUs for LLMs

Top GPU picks for running local AI models in 2025.

LLM RAM Requirements

How much RAM and VRAM you need for different model sizes.

Popular hardware for local LLMs

RTX 4060 (8 GB)

Budget pick. Runs 7B-8B models at 25-35 tok/s.

Buy on Amazon

RTX 4060 Ti 16 GB

Sweet spot. Runs 13B-14B at full speed. Best value.

Buy on Amazon

RTX 4090 (24 GB)

Top consumer GPU. Runs 70B models with offloading.

Buy on Amazon

Sources & methodology

VRAM and tokens-per-second figures on this page are synthesised from open community benchmarks. The sitewide formula and the full source list are on the methodology page. For this guide specifically I leaned on:

Ollama project — install commands, model names, and the local REST API behaviour are taken straight from the upstream README and docs.
LM Studio — official site for the GUI workflow described in the "Choose Your Software" section.
llama.cpp project — the inference engine behind both; CPU-only behaviour and GGUF quantization names follow its conventions.
XiongjieDai GPU-Benchmarks-on-LLM-Inference — the "RTX 4060 → 30–50 t/s, CPU → 3–8 t/s" speed ranges are sanity-checked against community llama-bench runs here.

Spot a number that does not match the linked source? Email billybobgurr@gmail.com and I will update the guide.