Ollama 101: Install, Pull, Run, Repeat

Ollama 101: Install, Pull, Run, Repeat

How to Use Ollama Locally: Install, Run, and Optimize LLMs

Get Ollama running locally to run LLMs securely and fast—step-by-step install, model management, automation tips, and performance tweaks. Start now.

Ollama provides a local-first runtime for large language models (LLMs), letting you download, run, and serve models on your machine or infrastructure. This guide walks through setup, model pull and run commands, automation patterns, and performance tuning so you can deploy LLMs reliably.

  • Quick, actionable commands to install Ollama and verify it works.
  • How to pull and manage model versions, storage guidance, and tags.
  • Run models via CLI and API, automate with scripts/containers, and optimize resource use.

Quick answer (one paragraph)

Ollama installs via platform packages (macOS Homebrew, Linux .deb/.rpm, or container), then pull models with ollama pull <model>, run with ollama run <model> or the REST API, and optimize by pinning model tags, allocating GPUs, and using caching and lightweight containers for CI. Verify using ollama list and the health endpoint; automate via scripts or Docker for repeatable environments.

Prepare environment and prerequisites

Check system requirements: CPU cores, RAM, disk, and optional GPU support. Ollama runs on x86_64 and ARM (M1/M2) but model size and GPU support affect feasibility.

  • Recommended minimums: 8 CPU cores, 16–32 GB RAM, 100+ GB disk for multiple models.
  • GPU: NVIDIA (CUDA) for many models; check model docs for CUDA/cuDNN compatibility.
  • Networking: allow outbound for initial model pulls if using remote registries; local-only workflows are supported once models are cached.
  • Permissions: install with user or root depending on OS package manager; ensure Docker or container runtimes are installed if you plan to use containers.

Install Ollama: download, install, verify

Install methods vary by OS; pick the one that matches your environment.

  • macOS (Homebrew): brew install ollama.
  • Linux (.deb/.rpm): download the package and use sudo dpkg -i ollama.deb or sudo rpm -i ollama.rpm.
  • Docker: run the provided Ollama image for a containerized runtime.

After installing, verify the binary and basic health:

ollama version
ollama status
ollama list

Expect a short version string and a healthy status. If ollama status reports missing runtime components, follow the displayed instructions (common: GPU driver or daemon permissions).

Pull models: commands, tags, and storage tips

Pulling stores model artifacts locally so you can run offline. Use tags to lock versions and reduce surprises.

  • Basic pull: ollama pull stanford/crfm-mistral-7b
  • Specify tag/version: ollama pull ghcr.io/org/model:1.2.0
  • List local models: ollama list
Model pull examples
CommandWhen to use
ollama pull llm-nameDefault remote registry model
ollama pull ghcr.io/user/model:tagPin a specific release
ollama pull --path /mnt/models/modelPull into custom storage location

Storage tips:

  • Use fast NVMe for model storage to reduce load times.
  • Keep large models on dedicated disks and mount into Ollama via config or environment variables.
  • Prune unused models periodically: ollama remove <model> or script retention policies.

Run models: CLI, API examples, and flags

Run interactively, in the background, or as a service. Ollama exposes a CLI and a local HTTP API for programmatic use.

CLI: quick interactive run

ollama run llama2
# or with input
ollama run llama2 --prompt "Summarize this text."

Useful flags:

  • --model to specify model explicitly.
  • --prompt-file to read prompt from a file.
  • --port to run the local API on a specific port.
  • --background or run via systemd/containers for long-running services.

REST API example

Start the server (if not already):

ollama serve --port 11434

Then curl a request:

curl -X POST http://localhost:11434/v1/generate \
  -H "Content-Type: application/json" \
  -d '{"model":"llama2","input":"Write a one-paragraph summary."}'

Handle responses as JSON; add authentication if exposing beyond localhost. Use concise prompts and max token/temperature flags per model to control output size and randomness.

Automate: scripts, containers, and CI integration

Automation makes environments reproducible and CI-friendly.

  • Scripts: wrap install, pull, and health checks in shell scripts with exit codes for CI.
  • Containers: use an Ollama base image, copy model artifacts or pull at build/start, and expose ports. Keep images small by pulling models at runtime or mounting external volumes.
  • CI integration: cache downloaded model layers between runs, run lightweight smoke tests (ollama list, ollama serve, basic generate), and fail fast on errors.
# example CI job (pseudo)
- install ollama
- ollama pull my-model:1.0.0
- ollama serve --background
- run smoke test curl /v1/generate

For multi-node deployments, synchronize model versions across nodes (tags) and use shared storage or an internal registry to reduce redundant downloads.

Optimize performance: resources, GPU, and caching

Tune system resources and Ollama settings to get predictable latency and throughput.

  • GPU: ensure drivers and CUDA versions match model requirements. Pin models to GPU where supported via Ollama flags or config.
  • Memory: monitor RAM during warm-up; set limits to avoid OOM kills.
  • Concurrency: limit simultaneous model instances to avoid resource contention; use multiple smaller replicas for throughput.
  • Caching: keep hot models in memory or on NVMe; leverage framework-level caches if available.
Performance knobs
AreaAction
Startup timeUse local NVMe and pre-warm model at boot
LatencyAllocate GPU memory and reduce token limits
ThroughputBatch requests or run multiple replicas

Measure with simple benchmarks: fixed prompts at varying concurrency, tracking latency (p50/p95) and GPU/CPU utilization. Adjust batch sizes and model threads accordingly.

Common pitfalls and how to avoid them

  • Insufficient disk space — Remedy: monitor disk and use quotas; store models on larger disks and prune unused models.
  • GPU driver mismatch — Remedy: verify CUDA/cuDNN versions and update drivers; test with a small GPU-accelerated model first.
  • Unpinned model versions causing drift — Remedy: use explicit tags or digests when pulling; include model tag in deployment manifests.
  • Exposing API without auth — Remedy: bind to localhost, use reverse proxy with auth, or enable TLS and token auth for external access.
  • CI re-downloading models each run — Remedy: cache model layers/artifacts in CI and mount persistent storage for runners.

Implementation checklist

  • Verify system prerequisites (CPU, RAM, disk, GPU drivers).
  • Install Ollama and confirm with ollama version and ollama status.
  • Pull required models with pinned tags and confirm with ollama list.
  • Run a smoke test via CLI and the REST API.
  • Automate via scripts or container images; add CI caching for models.
  • Implement monitoring, logging, and resource limits; tune GPU/memory settings.
  • Document model versions and retention policy; schedule pruning.

FAQ

Q: Can I run Ollama offline after pulling models?
A: Yes—once models are pulled and cached locally, Ollama can run fully offline.
Q: How do I use a GPU with Ollama?
A: Install matching NVIDIA drivers and CUDA, ensure Ollama has GPU access, and start models with GPU flags or config as specified in model docs.
Q: How can I pin model versions to avoid unexpected updates?
A: Pull models with explicit tags or digests (e.g., ghcr.io/user/model:1.2.3) and reference those in deployment manifests.
Q: Is it safe to expose Ollama’s API to the internet?
A: Not without proper authentication and TLS. Prefer internal networks, reverse proxies, and token/TLS protections if external access is required.
Q: How do I reduce startup latency for large models?
A: Use fast NVMe storage, pre-warm models at boot, pin them in memory if possible, and reduce unnecessary I/O by keeping artifacts locally.