Large Language Model (LLM)

What is Ollama, and How to Deploy It in an Enterprise Data Stack?

Last updated on
May 12, 2026

What is Ollama?

Ollama is a designed for seamless integration of large language models like Llama 2 into local environments. It stands out by its ability to package model weights, configurations, and essential data into a single, user-friendly module simplifying the often complex process of setting up and configuring these models, especially in terms of GPU optimization. This efficiency helps developers and researchers who need to run models locally without the hassle of intricate setups and makes working with advanced models more accessible.

Watch in action

No items found.

Why is Ollama better on Shakudo?

Ollama Knowledge Base

What is Ollama?

Ollama is an open-source tool that lets you run large language models (LLMs) locally on your own infrastructure. Instead of sending data to external API providers, Ollama runs models like Llama, Mistral, and Phi directly on your servers — keeping your data private and your costs predictable.

For enterprises, this matters for three reasons:

  • Data privacy — No data leaves your environment. Prompts, responses, and documents never touch a third-party API.
  • Offline inference — Models run without internet access. Ideal for air-gapped or compliance-restricted environments.
  • No per-token costs — Once deployed, inference is free. No surprise API bills, no rate limits.

Ollama handles model downloading, loading, and serving through a simple REST API. It supports both CPU and GPU inference, and it's compatible with the OpenAI API format — so apps built for OpenAI can switch to Ollama by changing a single URL.

Key Features

  • One-command model running — Pull and run models with simple commands (ollama run llama3.1)
  • OpenAI-compatible API — Use existing OpenAI SDKs and tools with Ollama as the backend
  • GPU acceleration — NVIDIA GPU support for fast inference (A100, A10G, L4, T4)
  • CPU inference — Runs on CPU-only nodes for smaller models
  • Model management — Pull, list, remove, and customize models easily
  • REST API — Full API for integration with any application
  • Multi-model support — Run different models for different tasks on the same deployment
  • Kubernetes-native — Deploy via Helm chart with PVC storage, Istio integration, and GPU node support

Architecture

Ollama has a straightforward architecture:

Client → Ollama Server (port 11434) → Model Storage (PVC)

In a Kubernetes deployment:

The Ollama server loads models from PVC into memory (RAM or GPU VRAM) on demand, serves inference requests via the REST API, and keeps models loaded based on the KEEP_ALIVE setting.

Supported Models

**Note:** Models with "Q4" quantization are the default. Full-precision models require significantly more memory.

Ollama in the Shakudo Platform

When deployed through the Shakudo platform, Ollama is managed as a stack component. The platform handles:

  • Deployment — Helm chart-based deployment with proper resource allocation
  • Networking — Istio VirtualService for external access with SSO authentication
  • Storage — PVC provisioning for model persistence
  • GPU scheduling — Node selectors and tolerations for GPU node pools
  • Upgrades — Chart upgrades with rollback capability
  • Monitoring — Pod health, readiness probes, and log access

Ollama is typically accessible at https://ollama.<your-domain>/ with SSO authentication.

Running Your First Model - Getting Started

Step 1: Check Available Models

# Via CLI (inside the pod)
kubectl exec -n hyperplane-ollama <pod-name> -- ollama list

Step 2: Pull a Model

If no models are listed, pull one:

# Via CLI
kubectl exec -n hyperplane-ollama <pod-name> -- ollama pull llama3.1

Step 3: Run a Simple Prompt

# Via CLI
kubectl exec -n hyperplane-ollama <pod-name> -- ollama run llama3.1 "What is machine learning?"

Essential Commands


OpenAI-Compatible Endpoint

Ollama supports the OpenAI API format at /v1/chat/completions:

from openai import OpenAI

client = OpenAI(
   base_url="<http://localhost:11434/v1>",
   api_key="not-needed"
)

response = client.chat.completions.create(
   model="llama3.1",
   messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(response.choices[0].message.content)

Other Useful Endpoints

| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/api/tags` | GET | List all downloaded models |
| `/api/pull` | POST | Download a model |
| `/api/version` | GET | Get Ollama version |
| `/api/show` | POST | Show model details |

Using LangChain

from langchain_community.llms import Ollama

llm = Ollama(base_url="<http://localhost:11434>", model="llama3.1")
result = llm.invoke("What is the capital of France?")
print(result)

Model Selection Guide

This section is for customers using Ollama as a managed component inside Shakudo. Start from the Shakudo platform instead of installing or exposing Ollama manually.

1. Access the component in Shakudo

  • Sign in to your Shakudo workspace with your organization-approved account.
  • Open the workspace or environment where this component is enabled.
  • Go to the Applications or component catalog area and select Ollama.
  • If you cannot see the component, ask your workspace administrator to confirm that it is enabled for your role and environment.

2. Open the component UI

  • Use the Shakudo-provided Open, Launch, or Access action for Ollama.
  • Let Shakudo handle authentication, networking, and workspace routing. Avoid using internal service URLs unless your administrator explicitly provides them.
  • Confirm that the component opens in the expected workspace before creating or changing resources.

3. Complete a first safe use case

Open the Ollama endpoint or UI exposed through Shakudo and run a small model test, such as a short completion or embedding request, using the model that your workspace administrator has enabled.

  • Use a small non-production example first, especially when testing credentials, scans, model calls, or data connections.
  • Name the test clearly so other workspace users can recognize it as a first-run validation.

4. Monitor and validate the result

  • Check the component UI for run status, logs, traces, scan results, job history, or project activity, depending on the component.
  • Return to Shakudo if you need platform-level status, access control changes, or administrator support.
  • Record any errors, missing permissions, or unexpected results before retrying with production workloads.

5. Next steps

  • Review the use cases, administration, and troubleshooting pages in this knowledge base for deeper examples.
  • For production usage, follow your team’s Shakudo workspace policies for credentials, data access, resource limits, and approvals.

Ollama Administration & Best Practices

Model Management

Pulling Models

Models can be pulled manually or automatically:

Manual pull (recommended for production):

kubectl exec -n hyperplane-ollama <pod-name> -- ollama pull llama3.1

Auto-pull via Helm values:

ollama:
 models:
   pull:
     - llama3.1
     - mistral

Tip: In production, prefer manual pulls. Auto-pull runs on every pod start, which slows startup and can cause failures if the registry is unreachable.

Listing, Inspecting, and Removing Models

# List all downloaded models
ollama list

# Show model details (size, format, parameters)
ollama show llama3.1

# Remove unused models to free PVC space
ollama rm phi3:mini

Custom Models with Modelfile

FROM llama3.1
PARAMETER temperature 0.3
PARAMETER num_predict 512
SYSTEM You are a helpful technical assistant. Be concise and accurate.
ollama create my-assistant -f Modelfile
ollama run my-assistant "What is Kubernetes?"

Model Storage and PVC Sizing

| Models to Store | Recommended PVC Size |
|----------------|---------------------|
| 1–2 small models (7B–8B) | 30 GB |
| 3–5 mixed models | 100 GB |
| Large models (34B–70B) | 200 GB |

GPU Configuration

Enabling GPU

ollama:
 gpu:
   type: nvidia

resources:
 limits:
   nvidia.com/gpu: 1

tolerations:
 - key: nvidia.com/gpu
   operator: Exists
   effect: NoSchedule

GPU Comparison

| GPU | VRAM | Good For | Approx. Speed (8B model) |
|-----|------|----------|--------------------------|
| T4 | 16 GB | Dev/test, small models | ~30 tokens/sec |
| L4 | 24 GB | Production, moderate load | ~40 tokens/sec |
| A10G | 24 GB | Production, good throughput | ~50 tokens/sec |
| A100 | 80 GB | Large models, heavy load | ~80+ tokens/sec |

Monitoring GPU

nvidia-smi  # Inside pod or on node

DRA (Dynamic Resource Allocation): The Helm chart supports it, but leave disabled unless your cluster explicitly supports it.

Networking & Security

Service Exposure

| Method | Use Case |
|--------|----------|
| **ClusterIP + VirtualService** | Default — expose via Istio with SSO |
| **ClusterIP + port-forward** | Development and debugging |
| **NodePort** | Direct access without Istio (not recommended for production) |

Authentication

  • SSO via Keycloak — Handled by the platform's OAuth2 proxy
  • API key — Ollama doesn't require API keys by default; authentication is handled at the gateway level
  • Network policies — Restrict access to the Ollama service from specific namespaces

Istio Sidecar

The Istio sidecar is required for external routing. Verify injection:

kubectl get pods -n hyperplane-ollama -o jsonpath='{.items[*].spec.containers[*].name}'
# Should show "ollama" and "istio-proxy"

If the sidecar is missing, add to values.yaml:

podLabels:
 sidecar.istio.io/inject: "true"

Performance Tuning

Key Environment Variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `OLLAMA_NUM_PARALLEL` | 1 | Number of parallel request sequences |
| `OLLAMA_MAX_LOADED_MODELS` | 1 | Max models loaded in memory simultaneously |
| `OLLAMA_KEEP_ALIVE` | 5m | How long to keep models loaded after last request |
| `OLLAMA_MAX_VRAM` | 0 (auto) | Max VRAM to use (0 = all available) |

Tuning Recommendations

  • Increase OLLAMA_NUM_PARALLEL if you need to handle concurrent requests (requires more VRAM)
  • Increase OLLAMA_KEEP_ALIVE (e.g., 24h) to avoid cold-start delays on frequently used models
  • Set OLLAMA_MAX_VRAM if you need to reserve GPU memory for other workloads
  • Use smaller models for latency-sensitive applications

Monitoring & Observability

Health Check

curl <http://localhost:11434/api/version>

Key Metrics to Monitor

  • GPU utilizationnvidia-smi or DCGM metrics
  • Memory usage — Pod memory consumption vs limits
  • Request latency — Time to first token and total response time
  • Error rate — Failed inference requests

Log Review

kubectl logs -n hyperplane-ollama <pod-name> --tail=100
kubectl logs -n hyperplane-ollama <pod-name> | grep -i error

Upgrades & Maintenance

Upgrade Process

  1. Backup values, model list, and deployment manifest
  2. Dry run: helm upgrade --dry-run --debug
  3. Execute: helm upgrade --wait --timeout 15m
  4. Validate: check pod, version, models, inference

Key Points

  • Recreate strategy = brief downtime — Plan accordingly
  • Keep models.clean: false — Never enable cleanup during upgrades
  • Check VirtualService — May disappear after upgrades (known issue)
  • Rollback availablehelm rollback ollama -n hyperplane-ollama

Scaling Considerations

  • Vertical scaling — Move to a larger GPU (T4 → A10G → A100) for better performance
  • Horizontal scaling — Deploy multiple Ollama instances behind a load balancer
  • Model sharding — 70B+ models can be split across multiple GPUs
  • Dedicated GPU nodes — Isolate Ollama on its own node pool to avoid resource contention

Ollama Troubleshooting & FAQ

Common Issues

Model Not Loading

Problem: Model fails to load or "model not found" error.

What to check:

  • Run ollama list — is the model listed?
  • Check PVC disk space — df -h inside the pod
  • Verify the model name spelling (e.g., llama3.1 not llama-3.1)
  • Check pod logs for loading errors

Fix:

  • Pull the model again: ollama pull llama3.1
  • Free disk space by removing unused models: ollama rm <unused-model>
  • Use the exact model name from ollama list

Slow Performance / Inference

Problem: Model responses are very slow (seconds per token).

What to check:

  • Is GPU being used? Run nvidia-smi to check
  • Which model size are you running? (70B on CPU will be extremely slow)
  • How many concurrent requests? Check OLLAMA_NUM_PARALLEL
  • Check pod memory usage — may be swapping to disk

Fix:

  • Enable GPU if available (see GPU Configuration in Admin guide)
  • Use a smaller model (switch from 70B to 8B)
  • Reduce OLLAMA_NUM_PARALLEL to 1
  • Increase pod memory limits
  • Set OLLAMA_KEEP_ALIVE to keep models loaded (avoids cold start)

Out of Memory Errors

Problem: Pod is OOMKilled or returns "out of memory" error.

What to check:

  • Pod resource limits vs model size
  • How many models are loaded simultaneously
  • GPU VRAM utilization with nvidia-smi

Fix:

  • Use a smaller model (8B instead of 70B)
  • Increase memory/VRAM limits in values.yaml
  • Set OLLAMA_MAX_LOADED_MODELS: 1 to limit concurrent model loading
  • Set OLLAMA_MAX_VRAM to prevent Ollama from using all GPU memory
  • Reduce OLLAMA_NUM_PARALLEL

API Not Responding / 404 Error

Problem: curl to Ollama API returns connection refused, 404, or white screen.

What to check:

  • Is the pod running? kubectl get pods -n hyperplane-ollama
  • Does the service exist? kubectl get svc -n hyperplane-ollama
  • Does the VirtualService exist? kubectl get virtualservice -n hyperplane-ollama
  • Is the port correct? (Should be 11434, not 8080)

Fix:

  • If pod is not running: check logs with kubectl logs
  • If VirtualService is missing: this is a known issue after upgrades — create it manually:

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
 name: ollama-vs
 namespace: hyperplane-ollama
spec:
 gateways:
 - hyperplane-istio/ingress-gateway
 hosts:
 - ollama.<your-domain>
 http:
 - match:
   - uri:
       prefix: /
   route:
   - destination:
       host: ollama
       port:
         number: 11434

  • If port is wrong: ensure VirtualService routes to port 11434

GPU Not Being Used

Problem: Ollama runs on CPU even though GPU nodes are available.

What to check:

  • Run nvidia-smi on the GPU node — is it functional?

Why is Ollama better on Shakudo?

Why is Ollama better on Shakudo?

Core Shakudo Features

Own Your AI

Keep data sovereign, protect IP, and avoid vendor lock-in with infra-agnostic deployments.

Faster Time-to-Value

Pre-built templates and automated DevOps accelerate time-to-value.
integrate

Flexible with Experts

Operating system and dedicated support ensure seamless adoption of the latest and greatest tools.
See Shakudo in Action
Neal Gilmore
Get Started >