How to Use Ollama Locally: Install, Run, and Optimize LLMs
Ollama provides a local-first runtime for large language models (LLMs), letting you download, run, and serve models on your machine or infrastructure. This guide walks through setup, model pull and run commands, automation patterns, and performance tuning so you can deploy LLMs reliably.
- Quick, actionable commands to install Ollama and verify it works.
- How to pull and manage model versions, storage guidance, and tags.
- Run models via CLI and API, automate with scripts/containers, and optimize resource use.
Quick answer (one paragraph)
Ollama installs via platform packages (macOS Homebrew, Linux .deb/.rpm, or container), then pull models with ollama pull <model>, run with ollama run <model> or the REST API, and optimize by pinning model tags, allocating GPUs, and using caching and lightweight containers for CI. Verify using ollama list and the health endpoint; automate via scripts or Docker for repeatable environments.
Prepare environment and prerequisites
Check system requirements: CPU cores, RAM, disk, and optional GPU support. Ollama runs on x86_64 and ARM (M1/M2) but model size and GPU support affect feasibility.
- Recommended minimums: 8 CPU cores, 16–32 GB RAM, 100+ GB disk for multiple models.
- GPU: NVIDIA (CUDA) for many models; check model docs for CUDA/cuDNN compatibility.
- Networking: allow outbound for initial model pulls if using remote registries; local-only workflows are supported once models are cached.
- Permissions: install with user or root depending on OS package manager; ensure Docker or container runtimes are installed if you plan to use containers.
Install Ollama: download, install, verify
Install methods vary by OS; pick the one that matches your environment.
- macOS (Homebrew):
brew install ollama. - Linux (.deb/.rpm): download the package and use
sudo dpkg -i ollama.deborsudo rpm -i ollama.rpm. - Docker: run the provided Ollama image for a containerized runtime.
After installing, verify the binary and basic health:
ollama version
ollama status
ollama list
Expect a short version string and a healthy status. If ollama status reports missing runtime components, follow the displayed instructions (common: GPU driver or daemon permissions).
Pull models: commands, tags, and storage tips
Pulling stores model artifacts locally so you can run offline. Use tags to lock versions and reduce surprises.
- Basic pull:
ollama pull stanford/crfm-mistral-7b - Specify tag/version:
ollama pull ghcr.io/org/model:1.2.0 - List local models:
ollama list
| Command | When to use |
|---|---|
ollama pull llm-name | Default remote registry model |
ollama pull ghcr.io/user/model:tag | Pin a specific release |
ollama pull --path /mnt/models/model | Pull into custom storage location |
Storage tips:
- Use fast NVMe for model storage to reduce load times.
- Keep large models on dedicated disks and mount into Ollama via config or environment variables.
- Prune unused models periodically:
ollama remove <model>or script retention policies.
Run models: CLI, API examples, and flags
Run interactively, in the background, or as a service. Ollama exposes a CLI and a local HTTP API for programmatic use.
CLI: quick interactive run
ollama run llama2
# or with input
ollama run llama2 --prompt "Summarize this text."
Useful flags:
--modelto specify model explicitly.--prompt-fileto read prompt from a file.--portto run the local API on a specific port.--backgroundor run via systemd/containers for long-running services.
REST API example
Start the server (if not already):
ollama serve --port 11434
Then curl a request:
curl -X POST http://localhost:11434/v1/generate \
-H "Content-Type: application/json" \
-d '{"model":"llama2","input":"Write a one-paragraph summary."}'
Handle responses as JSON; add authentication if exposing beyond localhost. Use concise prompts and max token/temperature flags per model to control output size and randomness.
Automate: scripts, containers, and CI integration
Automation makes environments reproducible and CI-friendly.
- Scripts: wrap install, pull, and health checks in shell scripts with exit codes for CI.
- Containers: use an Ollama base image, copy model artifacts or pull at build/start, and expose ports. Keep images small by pulling models at runtime or mounting external volumes.
- CI integration: cache downloaded model layers between runs, run lightweight smoke tests (
ollama list,ollama serve, basic generate), and fail fast on errors.
# example CI job (pseudo)
- install ollama
- ollama pull my-model:1.0.0
- ollama serve --background
- run smoke test curl /v1/generate
For multi-node deployments, synchronize model versions across nodes (tags) and use shared storage or an internal registry to reduce redundant downloads.
Optimize performance: resources, GPU, and caching
Tune system resources and Ollama settings to get predictable latency and throughput.
- GPU: ensure drivers and CUDA versions match model requirements. Pin models to GPU where supported via Ollama flags or config.
- Memory: monitor RAM during warm-up; set limits to avoid OOM kills.
- Concurrency: limit simultaneous model instances to avoid resource contention; use multiple smaller replicas for throughput.
- Caching: keep hot models in memory or on NVMe; leverage framework-level caches if available.
| Area | Action |
|---|---|
| Startup time | Use local NVMe and pre-warm model at boot |
| Latency | Allocate GPU memory and reduce token limits |
| Throughput | Batch requests or run multiple replicas |
Measure with simple benchmarks: fixed prompts at varying concurrency, tracking latency (p50/p95) and GPU/CPU utilization. Adjust batch sizes and model threads accordingly.
Common pitfalls and how to avoid them
- Insufficient disk space — Remedy: monitor disk and use quotas; store models on larger disks and prune unused models.
- GPU driver mismatch — Remedy: verify CUDA/cuDNN versions and update drivers; test with a small GPU-accelerated model first.
- Unpinned model versions causing drift — Remedy: use explicit tags or digests when pulling; include model tag in deployment manifests.
- Exposing API without auth — Remedy: bind to localhost, use reverse proxy with auth, or enable TLS and token auth for external access.
- CI re-downloading models each run — Remedy: cache model layers/artifacts in CI and mount persistent storage for runners.
Implementation checklist
- Verify system prerequisites (CPU, RAM, disk, GPU drivers).
- Install Ollama and confirm with
ollama versionandollama status. - Pull required models with pinned tags and confirm with
ollama list. - Run a smoke test via CLI and the REST API.
- Automate via scripts or container images; add CI caching for models.
- Implement monitoring, logging, and resource limits; tune GPU/memory settings.
- Document model versions and retention policy; schedule pruning.
FAQ
- Q: Can I run Ollama offline after pulling models?
- A: Yes—once models are pulled and cached locally, Ollama can run fully offline.
- Q: How do I use a GPU with Ollama?
- A: Install matching NVIDIA drivers and CUDA, ensure Ollama has GPU access, and start models with GPU flags or config as specified in model docs.
- Q: How can I pin model versions to avoid unexpected updates?
- A: Pull models with explicit tags or digests (e.g.,
ghcr.io/user/model:1.2.3) and reference those in deployment manifests. - Q: Is it safe to expose Ollama’s API to the internet?
- A: Not without proper authentication and TLS. Prefer internal networks, reverse proxies, and token/TLS protections if external access is required.
- Q: How do I reduce startup latency for large models?
- A: Use fast NVMe storage, pre-warm models at boot, pin them in memory if possible, and reduce unnecessary I/O by keeping artifacts locally.
