How to Run an LLM Locally on Windows & Mac
Running a large language model locally gives you privacy, predictable latency, and offline capability. This guide walks through deciding whether to host locally, verifying hardware/OS compatibility, choosing a model and precision, preparing files, and step-by-step Windows and Mac instructions.
- Quick answer to whether local hosting makes sense and the fastest route to a working setup.
- Practical hardware/OS checks and model/runtime recommendations for consumer machines.
- Concrete step-by-step install, quantize, run commands, plus pitfalls and a short implementation checklist.
Quick answer
Yes—if you need privacy, low latency, or offline access and have a modern GPU (or accept CPU-only tradeoffs), you can run many LLMs locally using quantized weights and a lightweight runtime (llama.cpp, GGML-based runtimes, or PyTorch with CPU/GPU). For best balance of speed and resource use, choose a quantized model (4-bit/8-bit) and a runtime that matches your OS and GPU capabilities.
Decide whether to run an LLM locally
Local hosting benefits: data privacy, faster inference without network hops, predictable cost, and full control over updates and extensions. Downsides: hardware requirements, maintenance burden, and limited model size or latency if you lack a GPU.
- Prefer local if you process sensitive data, need offline operation, or require deterministic latency.
- Prefer cloud if you need the largest models (multi-100B), quick scaling, or minimal maintenance.
- Hybrid option: run smaller local models for common queries and offload heavy tasks to cloud models.
Verify hardware & OS requirements (Windows & Mac)
Check CPU, RAM, disk, and GPU compatibility before choosing a model and runtime. Below are practical minimums and recommended specs for useful local LLM work.
| Tier | Use case | Minimum | Recommended |
|---|---|---|---|
| Basic (CPU only) | Small models, experimentation | 8 cores, 16 GB RAM, 20 GB disk | 12+ cores, 32 GB RAM, NVMe |
| Consumer GPU | Fast inference, 7B–13B models | RTX 2060 / 6–8 GB VRAM | RTX 3070 / 8–12 GB VRAM |
| High-end GPU | 13B–70B models, low latency | RTX 4090 / 24 GB VRAM | Ampere/Hopper class 24–48 GB VRAM |
OS notes:
- Windows: WSL2 + GPU passthrough (NVIDIA) or native GPU drivers work well; ensure CUDA/DirectML support for chosen runtime.
- Mac: Apple Silicon (M1/M2) can run efficient models via Metal-native runtimes (llama.cpp, GGML builds). Intel Macs can run CPU or external GPU setups but expect slower speeds.
- Disk: keep at least 2x model size free for downloads and temporary quantization files.
Choose a model, precision, and runtime
Pick a model size that fits memory limits and desired accuracy. Precision (float32, float16, int8, 4-bit) affects memory and speed; lower precision is faster but slightly lower fidelity.
- Model families: Llama 2, MPT, Falcon, and open-source LLMs each have tradeoffs in license, performance, and toolchain support.
- Precision: float32/16 for highest quality; int8/4-bit quantized formats (GGML, Q4_0/Q4_K_M, etc.) for best speed and low memory.
- Runtimes: llama.cpp / GGML for CPU & Metal; vLLM / Transformers + bitsandbytes for GPU; Ollama (macOS) offers a simpler app-like experience.
Install prerequisites (drivers, WSL, Python, Homebrew)
Install and verify drivers and tools specific to your OS and chosen runtime before downloading models.
Windows
- Install WSL2 (optional): enable WSL, install Ubuntu from Microsoft Store, set WSL2 as default. Useful for Linux-native runtimes.
- NVIDIA GPU: install latest Game Ready or Studio drivers and CUDA toolkit if using CUDA-based runtimes; verify
nvidia-smi. - Install Python 3.10+ and pip; use virtualenv or conda for isolated environments.
Mac
- Apple Silicon: install Homebrew, then install dependencies. Use Metal-enabled builds (llama.cpp with metal backend) for best performance.
- Intel Mac: Homebrew + Python; expect slower CPU-only inference unless using external GPU solutions.
- Verify Homebrew with
brew --versionand Python withpython3 --version.
Download, quantize, and prepare model files
Obtain model weights from the model distributor (obey licenses). Convert or quantize to a runtime-friendly format to reduce memory and increase throughput.
- Download: use official release pages or model hubs. Verify checksums when available.
- Convert: many models provide converters (e.g., Hugging Face -> GGML). Use the correct converter for your runtime.
- Quantize: run quantization scripts (e.g., llama.cpp
quantizeutility, or bitsandbytes quantization). Keep original files until you confirm the quantized model works.
| Format | Runtime | Notes |
|---|---|---|
| GGML (.bin) | llama.cpp, ggml tools | Fast CPU/Metal inference, good for low-memory machines |
| PyTorch (.pt, .bin) | Transformers, vLLM | Flexible; often converted for efficient inference |
| GGUF | Modern runtimes | Metadata-rich, supports quantized variants |
Run the model: Windows & Mac step-by-step
Below are concise example flows: a typical Windows (WSL/CUDA) path and a Mac (Apple Silicon + llama.cpp) path. Adjust paths, model names, and flags to your environment.
Windows (WSL + CUDA) quick flow
- Open WSL terminal (Ubuntu) and create venv:
python3 -m venv llm-venv && source llm-venv/bin/activate - Install basics:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118(match CUDA version).
- Install transformers & runtime:
pip install transformers accelerateor follow runtime docs (vLLM, gguf support).
- Place quantized model in folder and run a sample script:
python run_inference.py --model ./models/your-model - For llama.cpp builds on WSL, build from source, copy GGML .bin and run:
./main -m ./models/model.ggml.bin -p "Hello world"
Mac (Apple Silicon) quick flow
- Install Homebrew, then dependencies:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" - Clone and build llama.cpp with Metal support:
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make - Download or convert a GGML model to a
.bincompatible with llama.cpp. - Run:
./main -m ./models/model.ggml.bin -p "Summarize the following:" - For Python-based deployments, install native wheels if available or use Conda with CPU builds.
Example minimal inference command (llama.cpp):
./main -m ./models/7B.ggml.q4_0.bin -n 128 -b 8 -p "Write a haiku about rain"Common pitfalls and how to avoid them
- Insufficient RAM/VRAM — Avoid by choosing smaller model or higher quantization; monitor usage with
nvidia-smior Activity Monitor. - Mismatched drivers/CUDA versions — Match CUDA toolkit and GPU driver; check runtime docs for supported versions.
- Wrong file formats — Verify you converted to the runtime’s expected format (GGML/GGUF vs PyTorch).
- Slow token generation — Use batching, shorter context windows, or quantized models; consider compiled runtimes like llama.cpp.
- License noncompliance — Confirm model license allows local hosting and your intended use.
Implementation checklist
- Decide local vs cloud based on privacy, latency, and scale needs.
- Verify hardware (CPU cores, RAM, disk, GPU VRAM) and OS compatibility.
- Choose model family, size, and quantization level.
- Install OS prerequisites: drivers, WSL (Windows), Homebrew (Mac), Python env.
- Download model weights and verify checksums.
- Convert/quantize to runtime format and keep originals as backup.
- Run sample inference and measure latency/memory; iterate on model/precision.
- Document setup and license terms; secure model files and runtime environment.
FAQ
- Q: Can I run a 70B model on a consumer GPU?
A: Not typically; 70B models usually need >40–80 GB VRAM or multi-GPU setups. Use quantized smaller models or cloud for large models. - Q: Is quantized output noticeably worse?
A: Modern 4-bit/8-bit quantization often preserves useful quality for many tasks; evaluate on your prompts. - Q: Which runtime is simplest for macOS?
A: llama.cpp or Ollama provide the easiest paths for Apple Silicon with Metal support. - Q: How to keep the model secure on disk?
A: Use disk encryption, restrict filesystem permissions, and track who can access model folders. - Q: Where to find model converters?
A: Check the model hub repository or runtime project (llama.cpp, Hugging Face transformers) for official converters and scripts.

