What is Ollama?
Ollama is an open-source tool that lets you run large language models (LLMs) locally on your own infrastructure. Instead of sending data to external API providers, Ollama runs models like Llama, Mistral, and Phi directly on your servers — keeping your data private and your costs predictable.
For enterprises, this matters for three reasons:
- Data privacy — No data leaves your environment. Prompts, responses, and documents never touch a third-party API.
- Offline inference — Models run without internet access. Ideal for air-gapped or compliance-restricted environments.
- No per-token costs — Once deployed, inference is free. No surprise API bills, no rate limits.
Ollama handles model downloading, loading, and serving through a simple REST API. It supports both CPU and GPU inference, and it's compatible with the OpenAI API format — so apps built for OpenAI can switch to Ollama by changing a single URL.
Key Features
- One-command model running — Pull and run models with simple commands (
ollama run llama3.1) - OpenAI-compatible API — Use existing OpenAI SDKs and tools with Ollama as the backend
- GPU acceleration — NVIDIA GPU support for fast inference (A100, A10G, L4, T4)
- CPU inference — Runs on CPU-only nodes for smaller models
- Model management — Pull, list, remove, and customize models easily
- REST API — Full API for integration with any application
- Multi-model support — Run different models for different tasks on the same deployment
- Kubernetes-native — Deploy via Helm chart with PVC storage, Istio integration, and GPU node support
Architecture
Ollama has a straightforward architecture:
Client → Ollama Server (port 11434) → Model Storage (PVC)
In a Kubernetes deployment:
The Ollama server loads models from PVC into memory (RAM or GPU VRAM) on demand, serves inference requests via the REST API, and keeps models loaded based on the KEEP_ALIVE setting.
Supported Models
**Note:** Models with "Q4" quantization are the default. Full-precision models require significantly more memory.
Ollama in the Shakudo Platform
When deployed through the Shakudo platform, Ollama is managed as a stack component. The platform handles:
- Deployment — Helm chart-based deployment with proper resource allocation
- Networking — Istio VirtualService for external access with SSO authentication
- Storage — PVC provisioning for model persistence
- GPU scheduling — Node selectors and tolerations for GPU node pools
- Upgrades — Chart upgrades with rollback capability
- Monitoring — Pod health, readiness probes, and log access
Ollama is typically accessible at https://ollama.<your-domain>/ with SSO authentication.
Running Your First Model - Getting Started
Step 1: Check Available Models
# Via CLI (inside the pod)
kubectl exec -n hyperplane-ollama <pod-name> -- ollama list
Step 2: Pull a Model
If no models are listed, pull one:
# Via CLI
kubectl exec -n hyperplane-ollama <pod-name> -- ollama pull llama3.1
Step 3: Run a Simple Prompt
# Via CLI
kubectl exec -n hyperplane-ollama <pod-name> -- ollama run llama3.1 "What is machine learning?"
Essential Commands
OpenAI-Compatible Endpoint
Ollama supports the OpenAI API format at /v1/chat/completions:
from openai import OpenAI
client = OpenAI(
base_url="<http://localhost:11434/v1>",
api_key="not-needed"
)
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print(response.choices[0].message.content)
Other Useful Endpoints
| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/api/tags` | GET | List all downloaded models |
| `/api/pull` | POST | Download a model |
| `/api/version` | GET | Get Ollama version |
| `/api/show` | POST | Show model details |
Using LangChain
from langchain_community.llms import Ollama
llm = Ollama(base_url="<http://localhost:11434>", model="llama3.1")
result = llm.invoke("What is the capital of France?")
print(result)
Model Selection Guide
This section is for customers using Ollama as a managed component inside Shakudo. Start from the Shakudo platform instead of installing or exposing Ollama manually.
1. Access the component in Shakudo
- Sign in to your Shakudo workspace with your organization-approved account.
- Open the workspace or environment where this component is enabled.
- Go to the Applications or component catalog area and select Ollama.
- If you cannot see the component, ask your workspace administrator to confirm that it is enabled for your role and environment.
2. Open the component UI
- Use the Shakudo-provided Open, Launch, or Access action for Ollama.
- Let Shakudo handle authentication, networking, and workspace routing. Avoid using internal service URLs unless your administrator explicitly provides them.
- Confirm that the component opens in the expected workspace before creating or changing resources.
3. Complete a first safe use case
Open the Ollama endpoint or UI exposed through Shakudo and run a small model test, such as a short completion or embedding request, using the model that your workspace administrator has enabled.
- Use a small non-production example first, especially when testing credentials, scans, model calls, or data connections.
- Name the test clearly so other workspace users can recognize it as a first-run validation.
4. Monitor and validate the result
- Check the component UI for run status, logs, traces, scan results, job history, or project activity, depending on the component.
- Return to Shakudo if you need platform-level status, access control changes, or administrator support.
- Record any errors, missing permissions, or unexpected results before retrying with production workloads.
5. Next steps
- Review the use cases, administration, and troubleshooting pages in this knowledge base for deeper examples.
- For production usage, follow your team’s Shakudo workspace policies for credentials, data access, resource limits, and approvals.
Ollama Administration & Best Practices
Model Management
Pulling Models
Models can be pulled manually or automatically:
Manual pull (recommended for production):
kubectl exec -n hyperplane-ollama <pod-name> -- ollama pull llama3.1
Auto-pull via Helm values:
ollama:
models:
pull:
- llama3.1
- mistral
Tip: In production, prefer manual pulls. Auto-pull runs on every pod start, which slows startup and can cause failures if the registry is unreachable.
Listing, Inspecting, and Removing Models
# List all downloaded models
ollama list
# Show model details (size, format, parameters)
ollama show llama3.1
# Remove unused models to free PVC space
ollama rm phi3:mini
Custom Models with Modelfile
FROM llama3.1
PARAMETER temperature 0.3
PARAMETER num_predict 512
SYSTEM You are a helpful technical assistant. Be concise and accurate.
ollama create my-assistant -f Modelfile
ollama run my-assistant "What is Kubernetes?"
Model Storage and PVC Sizing
| Models to Store | Recommended PVC Size |
|----------------|---------------------|
| 1–2 small models (7B–8B) | 30 GB |
| 3–5 mixed models | 100 GB |
| Large models (34B–70B) | 200 GB |
GPU Configuration
Enabling GPU
ollama:
gpu:
type: nvidia
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
GPU Comparison
| GPU | VRAM | Good For | Approx. Speed (8B model) |
|-----|------|----------|--------------------------|
| T4 | 16 GB | Dev/test, small models | ~30 tokens/sec |
| L4 | 24 GB | Production, moderate load | ~40 tokens/sec |
| A10G | 24 GB | Production, good throughput | ~50 tokens/sec |
| A100 | 80 GB | Large models, heavy load | ~80+ tokens/sec |
Monitoring GPU
nvidia-smi # Inside pod or on node
DRA (Dynamic Resource Allocation): The Helm chart supports it, but leave disabled unless your cluster explicitly supports it.
Networking & Security
Service Exposure
| Method | Use Case |
|--------|----------|
| **ClusterIP + VirtualService** | Default — expose via Istio with SSO |
| **ClusterIP + port-forward** | Development and debugging |
| **NodePort** | Direct access without Istio (not recommended for production) |
Authentication
- SSO via Keycloak — Handled by the platform's OAuth2 proxy
- API key — Ollama doesn't require API keys by default; authentication is handled at the gateway level
- Network policies — Restrict access to the Ollama service from specific namespaces
Istio Sidecar
The Istio sidecar is required for external routing. Verify injection:
kubectl get pods -n hyperplane-ollama -o jsonpath='{.items[*].spec.containers[*].name}'
# Should show "ollama" and "istio-proxy"
If the sidecar is missing, add to values.yaml:
podLabels:
sidecar.istio.io/inject: "true"
Performance Tuning
Key Environment Variables
| Variable | Default | Purpose |
|----------|---------|---------|
| `OLLAMA_NUM_PARALLEL` | 1 | Number of parallel request sequences |
| `OLLAMA_MAX_LOADED_MODELS` | 1 | Max models loaded in memory simultaneously |
| `OLLAMA_KEEP_ALIVE` | 5m | How long to keep models loaded after last request |
| `OLLAMA_MAX_VRAM` | 0 (auto) | Max VRAM to use (0 = all available) |
Tuning Recommendations
- Increase
OLLAMA_NUM_PARALLELif you need to handle concurrent requests (requires more VRAM) - Increase
OLLAMA_KEEP_ALIVE(e.g.,24h) to avoid cold-start delays on frequently used models - Set
OLLAMA_MAX_VRAMif you need to reserve GPU memory for other workloads - Use smaller models for latency-sensitive applications
Monitoring & Observability
Health Check
curl <http://localhost:11434/api/version>
Key Metrics to Monitor
- GPU utilization —
nvidia-smior DCGM metrics - Memory usage — Pod memory consumption vs limits
- Request latency — Time to first token and total response time
- Error rate — Failed inference requests
Log Review
kubectl logs -n hyperplane-ollama <pod-name> --tail=100
kubectl logs -n hyperplane-ollama <pod-name> | grep -i error
Upgrades & Maintenance
Upgrade Process
- Backup values, model list, and deployment manifest
- Dry run:
helm upgrade --dry-run --debug - Execute:
helm upgrade --wait --timeout 15m - Validate: check pod, version, models, inference
Key Points
- Recreate strategy = brief downtime — Plan accordingly
- Keep
models.clean: false— Never enable cleanup during upgrades - Check VirtualService — May disappear after upgrades (known issue)
- Rollback available —
helm rollback ollama -n hyperplane-ollama
Scaling Considerations
- Vertical scaling — Move to a larger GPU (T4 → A10G → A100) for better performance
- Horizontal scaling — Deploy multiple Ollama instances behind a load balancer
- Model sharding — 70B+ models can be split across multiple GPUs
- Dedicated GPU nodes — Isolate Ollama on its own node pool to avoid resource contention
Ollama Troubleshooting & FAQ
Common Issues
Model Not Loading
Problem: Model fails to load or "model not found" error.
What to check:
- Run
ollama list— is the model listed? - Check PVC disk space —
df -hinside the pod - Verify the model name spelling (e.g.,
llama3.1notllama-3.1) - Check pod logs for loading errors
Fix:
- Pull the model again:
ollama pull llama3.1 - Free disk space by removing unused models:
ollama rm <unused-model> - Use the exact model name from
ollama list
Slow Performance / Inference
Problem: Model responses are very slow (seconds per token).
What to check:
- Is GPU being used? Run
nvidia-smito check - Which model size are you running? (70B on CPU will be extremely slow)
- How many concurrent requests? Check
OLLAMA_NUM_PARALLEL - Check pod memory usage — may be swapping to disk
Fix:
- Enable GPU if available (see GPU Configuration in Admin guide)
- Use a smaller model (switch from 70B to 8B)
- Reduce
OLLAMA_NUM_PARALLELto 1 - Increase pod memory limits
- Set
OLLAMA_KEEP_ALIVEto keep models loaded (avoids cold start)
Out of Memory Errors
Problem: Pod is OOMKilled or returns "out of memory" error.
What to check:
- Pod resource limits vs model size
- How many models are loaded simultaneously
- GPU VRAM utilization with
nvidia-smi
Fix:
- Use a smaller model (8B instead of 70B)
- Increase memory/VRAM limits in values.yaml
- Set
OLLAMA_MAX_LOADED_MODELS: 1to limit concurrent model loading - Set
OLLAMA_MAX_VRAMto prevent Ollama from using all GPU memory - Reduce
OLLAMA_NUM_PARALLEL
API Not Responding / 404 Error
Problem: curl to Ollama API returns connection refused, 404, or white screen.
What to check:
- Is the pod running?
kubectl get pods -n hyperplane-ollama - Does the service exist?
kubectl get svc -n hyperplane-ollama - Does the VirtualService exist?
kubectl get virtualservice -n hyperplane-ollama - Is the port correct? (Should be 11434, not 8080)
Fix:
- If pod is not running: check logs with
kubectl logs - If VirtualService is missing: this is a known issue after upgrades — create it manually:
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: ollama-vs
namespace: hyperplane-ollama
spec:
gateways:
- hyperplane-istio/ingress-gateway
hosts:
- ollama.<your-domain>
http:
- match:
- uri:
prefix: /
route:
- destination:
host: ollama
port:
number: 11434
- If port is wrong: ensure VirtualService routes to port 11434
GPU Not Being Used
Problem: Ollama runs on CPU even though GPU nodes are available.
What to check:
- Run
nvidia-smion the GPU node — is it functional?

