How to Run Ollama on Ubuntu Server


Ollama turns any Linux server into a private AI inference endpoint. Whether you're running a homelab, a cloud VM, or a bare-metal Ubuntu box, this guide walks you through everything — from a fresh Ubuntu install to serving LLM completions over your network.

Prerequisites

  • A machine running Ubuntu 20.04, 22.04, or 24.04 (desktop or server edition)
  • sudo access
  • At least 8GB of RAM (16GB+ recommended for larger models)
  • (Optional) An NVIDIA or AMD GPU for hardware-accelerated inference


Step 1: Update Your System

Always start with a fully updated system to avoid dependency conflicts:

sudo apt update && sudo apt upgrade -y

Reboot if a kernel update was applied:

sudo reboot


Step 2: Install the Ollama CLI

Ollama provides an official install script that handles everything — binary download, PATH setup, and systemd service registration.
curl -fsSL https://ollama.com/install.sh | sh

The script will output something like:



Verify the installation:
ollama --version
# ollama version is 0.17.7

Prefer a manual install? Download the binary directly from github.com/ollama/ollama/releases, move it to /usr/local/bin/ollama, and make it executable with chmod +x /usr/local/bin/ollama.


Step 3: (Optional) Install NVIDIA GPU Drivers

Skip this step if you're running CPU-only inference. If you have an NVIDIA GPU, install the drivers and CUDA toolkit before proceeding — Ollama will detect and use them automatically.

Check if a GPU is present:
lspci | grep -i nvidia

Install the recommended NVIDIA driver:

sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot

After reboot, verify the driver is loaded:

nvidia-smi

You should see your GPU listed with its driver version and VRAM. Ollama will now offload model layers to the GPU automatically when you run a model.


Step 4: Start and Enable the Ollama Service

The installer registers Ollama as a systemd service that starts automatically on boot. Check its status:
sudo systemctl status ollama

If it's not already running, start and enable it:

sudo systemctl start ollama
sudo systemctl enable ollama

Confirm Ollama is listening on its default port:

curl http://localhost:11434
# Ollama is running


Step 5: Pull Your First Model

Now let's download an LLM. We'll use Llama 3.2 (3B) — a capable model that runs well on modest hardware:
ollama pull llama3.2

You'll see a progress bar as the model weights download (~2GB). Once done, list your locally available models:




ollama list
# NAME ID SIZE MODIFIED
# llama3.2:latest 91ab477bec9d 2.0 GB 5 seconds ago

Other popular models to try:

ollama pull mistral        # Mistral 7B — great all-rounder
ollama pull gemma2:2b # Google Gemma 2 (2B) — fast and lightweight
ollama pull codellama # Code-focused Llama variant
ollama pull llama3.1:70b # Llama 3.1 70B — needs 40GB+ RAM or a large GPU

Browse the full library at 
ollama.com/library.

Step 6: Run the Model Interactively

Test your setup with an interactive chat session:
ollama run llama3.2

You'll get a prompt:

>>> Send a message (/? for help)

Try it:

>>> Summarise what Mulesoft is in two sentences.


Type /bye to exit. The model stays cached locally for future use.

Step 7: Use the REST API

Ollama's built-in REST API lets you integrate LLMs into your own applications. By default it's only accessible on localhost— we'll open it up to the network in the next step.
Generate a completion:
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "What is Mulesoft best used for?",
"stream": false
}'


 

Chat endpoint:
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.2",
"messages": [
{ "role": "user", "content": "Give me three reasons to self-host AI models." }
],
"stream": false
}'


List loaded models:
curl http://localhost:11434/api/tags


Step 8: Expose Ollama to Your Network (Optional)

By default, Ollama only listens on 127.0.0.1. To make it accessible from other machines on your network, configure it to bind to all interfaces.
Edit the systemd service override:
sudo systemctl edit ollama
This opens a blank override file. Add the following:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Save, then reload and restart the service:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Verify it's listening on all interfaces:
ss -tlnp | grep 11434
# LISTEN 0 128 0.0.0.0:11434 0.0.0.0:*
Now you can call the API from any machine on your network:
curl http://<your-server-ip>:11434/api/generate \
-d '{"model": "llama3.2", "prompt": "Hello!", "stream": false}'


Security note: Ollama has no built-in authentication. If exposing it to a network, protect port 11434 with a firewall rule, a reverse proxy with auth (like Nginx + basic auth), or a VPN.


Step 9: (Optional) Set Up a Firewall Rule

If you're using ufw (Ubuntu's default firewall), restrict access to trusted IPs only:
# Allow only a specific IP to reach Ollama
sudo ufw allow from 192.168.1.50 to any port 11434

# Or allow the whole local subnet
sudo ufw allow from 192.168.1.0/24 to any port 11434

sudo ufw enable
sudo ufw status

Step 10: (Optional) Add a Web UI with Open WebUI

Want a ChatGPT-style interface for your server? Open WebUI connects directly to your Ollama instance and just needs Docker:
# Install Docker if not already installed
sudo apt install -y docker.io
sudo systemctl enable --now docker

# Run Open WebUI connected to Ollama
docker run -d \
--name open-webui \
-p 3000:8080 \
-e OLLAMA_BASE_URL=http://host-gateway:11434 \
--add-host=host-gateway:host-gateway \
--restart always \
ghcr.io/open-webui/open-webui:main
Open your browser and navigate to http://<your-server-ip>:3000. You'll have a full chat UI backed by your local Ollama instance.



Managing the Ollama Service

TaskCommand
Check statussudo systemctl status ollama
Startsudo systemctl start ollama
Stopsudo systemctl stop ollama
Restartsudo systemctl restart ollama
View logsjournalctl -u ollama -f
List modelsollama list
Remove a modelollama rm <model-name>


Troubleshooting

Ollama service fails to start:

journalctl -u ollama -n 50 --no-pager

curl
 returns connection refused:
 Make sure the service is running: sudo systemctl status ollama

GPU not detected after driver install: Ensure you rebooted after the driver installation, then check nvidia-smi. Ollama reads GPU availability at startup.

Slow inference on CPU: This is expected — LLMs are computationally expensive. Try a smaller model like gemma2:2b or llama3.2 (3B) for faster responses without a GPU.

Port 11434 not reachable from another machine: Check that OLLAMA_HOST=0.0.0.0 is set in the systemd override and that your firewall allows the port.
Previous Post Next Post