Langfuse Overview
Langfuse is an open-source observability and analytics platform purpose-built for LLM applications. It gives your team a single place to see every prompt, response, trace, latency number, token count, and cost — across all your AI-powered tools and services, from production runs to debugging sessions.
In Shakudo environments, Langfuse sits in the observability layer, downstream of LiteLLM and alongside applications like Dify and AgentFlow. Every model call that flows through LiteLLM can be logged to Langfuse automatically, giving you a live trace of exactly what your AI stack is doing.
What Problem Does Langfuse Solve?
LLM applications are hard to debug and expensive to operate without visibility. When something goes wrong — a bad response, a spike in latency, unexpected token costs — you need to know which prompt caused it, which model handled it, and how long it took. Without Langfuse, that information lives in scattered logs or does not exist at all.
- Captures full traces: prompt, model, response, latency, token count, and cost in one view
- Lets you compare prompt versions and evaluate quality over time
- Shows which models and agents are most expensive or slow
- Enables your team to replay and debug individual LLM calls
How Langfuse Fits in the Shakudo Stack
Langfuse is the observability layer for all AI activity in the environment:
- LiteLLM logs every model call to Langfuse via a success/failure callback — no app code changes needed
- Dify can send traces to Langfuse directly via its built-in Langfuse integration
- Custom Python or JavaScript apps use the Langfuse SDK to instrument their own LLM calls
- LangChain and LlamaIndex apps work with Langfuse via native callback handlers
- Langfuse stores trace data in PostgreSQL and exports artifacts to MinIO
Key Concepts
- Trace: a complete record of one logical operation — e.g. a user query from start to finish, including all LLM calls, tool uses, and sub-steps inside it.
- Span: a single step within a trace (one LLM call, one function, one retrieval). Spans can nest.
- Observation: the individual data points inside a span: input, output, latency, token count, cost.
- Session: a group of traces belonging to the same user conversation or workflow run.
- Score: a human or automated evaluation attached to a trace or span (e.g. pass/fail, 1-5 rating).
- Prompt Management: versioned prompts stored in Langfuse and linked to the traces that use them.
What Langfuse Is Not
- Not an LLM gateway. It does not route model calls — use LiteLLM for that.
- Not a log aggregation platform. It is purpose-built for LLM traces, not general application logs.
- Not a feature store or model registry. It tracks prompt/response quality, not model weights.
Administration & Best Practices
This page covers how to keep Langfuse stable, organised, and cost-efficient in a production Shakudo environment.
Project and API Key Organisation
Separate observability data by creating one Langfuse project per environment:
- production: all live traffic — strict access
- staging: pre-production validation
- dev: developer experiments and testing
Create one API key pair per application or team so usage is attributable and keys can be rotated independently. Revoke old keys via Settings > API Keys.
Tagging Traces for Observability
Always include user_id, session_id, and metadata on traces to enable filtering, cost attribution, and debugging:
trace = client.trace(
name="workflow-name",
user_id="[email protected]",
session_id="session-abc-123",
metadata={"team": "risk", "env": "prod", "version": "v2.1"}
)
Without tags, you cannot filter who or what triggered a given trace.
Data Retention and Storage Management
Langfuse stores trace data in PostgreSQL and media files in MinIO. Both grow over time:
- Set a retention policy on the PostgreSQL langfuse database to delete old traces
- Configure TTL on the MinIO bucket to auto-expire old files
- Langfuse v3 supports configurable data retention — check Settings > Data Retention in the UI
# Monitor PostgreSQL DB size
kubectl exec -it langfuse-postgresql-0 -n hyperplane-langfuse -- \\
psql -U langfuse -c "SELECT pg_size_pretty(pg_database_size('langfuse'));"
# Monitor MinIO bucket size
mc du shakudo-minio/langfuse-<env>
Keep-Alive Timeout (GCP/GKE Production Fix)
On GCP with Cloud Load Balancer, the default Node.js keep-alive timeout (5s) is shorter than the load balancer idle timeout (600s). This causes 502 errors on long requests.
Fix (already included in the Deployment Runbook):
LANGFUSE_HTTP_KEEPALIVE_TIMEOUT_MS: "620000"
LANGFUSE_HTTP_HEADERS_TIMEOUT_MS: "621000"
Always verify these are set after upgrades — they can be reset if values.yaml is regenerated from defaults.
Upgrades
Update image.tag in values.yaml to the new Langfuse version and redeploy:
helm upgrade langfuse . \\
--namespace hyperplane-langfuse \\
--values values.yaml \\
--timeout 10m \\
--wait
Langfuse v3 runs database migrations automatically on startup. Always back up the PostgreSQL database before upgrading.
Security Basics
Secrets
- Store NEXTAUTH_SECRET, SALT, database password, and MinIO credentials in a Kubernetes secret
- Never put secrets in values.yaml or ConfigMaps in plain text
Access control
- Langfuse v3 has built-in RBAC at the organisation and project level
- Use project-level roles (Owner, Admin, Member, Viewer) to limit who can see traces
- For SSO/OIDC integration, configure AUTH_CUSTOM_CLIENT_ID and related env vars
Network
- Expose Langfuse only on cluster-internal DNS unless external access is explicitly needed
- If external access is required, use an Istio VirtualService or ingress with authentication
Backup Strategy
- PostgreSQL: schedule regular pg_dump and upload to MinIO or off-cluster storage
- MinIO: include the langfuse-<env> bucket in the cluster backup policy
- API keys: if the database is lost, all API keys are lost — keep a secure record of public keys
Troubleshooting & FAQ
Use this page during live debugging. Format: Problem -> What to check -> Fix.
Deployment Issues
Pod stuck in CrashLoopBackOff
- Check: kubectl logs deployment/langfuse -n hyperplane-langfuse
- Common causes: DATABASE_URL incorrect, missing NEXTAUTH_SECRET or SALT, PostgreSQL not ready
- Fix: confirm all required env vars are in the Kubernetes secret and referenced in envFrom. Wait for postgresql pod to be Running before the main pod starts.
UI loads but shows database error
- Check: Langfuse could connect to the service but failed the migration or query
- Fix: verify DATABASE_URL and DIRECT_URL both point to the correct PostgreSQL host and database. Check pod logs for Prisma migration errors.
502 or timeout errors on requests
- Check: GCP/GKE environments — keepAlive timeout is shorter than load balancer idle timeout
- Fix: set LANGFUSE_HTTP_KEEPALIVE_TIMEOUT_MS=620000 and LANGFUSE_HTTP_HEADERS_TIMEOUT_MS=621000 in values.yaml and redeploy
Trace Ingestion Issues
Traces not appearing in the UI
- Check: POST to /api/public/ingestion returns errors — look in the response body for specific failures
- Check: verify the Authorization header uses the correct public/secret key pair for the project
- Fix: re-run the Step 8 validation curl command. Confirm the keys match the project in the UI.
LiteLLM traces not appearing in Langfuse
- Check: LiteLLM litellmConfig includes success_callback and failure_callback with "langfuse"
- Check: langfuse_host, langfuse_public_key, and langfuse_secret_key are set and correct
- Fix: check LiteLLM pod logs for Langfuse callback errors. Confirm LiteLLM pod can reach the Langfuse service on port 3000.
MinIO export error when downloading traces
- Check: LANGFUSE_S3_ENDPOINT must use the cluster-internal MinIO DNS, not an external URL
- Check: LANGFUSE_S3_FORCE_PATH_STYLE must be "true" for MinIO compatibility
- Fix: update the MinIO env vars and rollout restart. Run the Step 3 MinIO health check to confirm connectivity.
Performance Issues
Langfuse UI is slow to load traces
- Check: PostgreSQL pod CPU/memory — it is the primary data store
- Check: number of traces in the database — very large tables slow down queries
- Fix: add database indexes on frequently queried fields. Set a data retention policy to purge old traces.
Trace ingestion throughput is low
- Check: Langfuse default deployment is single-replica — high-volume environments may need more replicas
- Fix: scale the deployment: kubectl scale deployment/langfuse -n hyperplane-langfuse --replicas=2
Frequently Asked Questions
Q: How do I reset the admin password?
Langfuse uses email-based sign-in with magic links by default. If email is not configured, reset the password directly in PostgreSQL by updating the users table. Contact the Shakudo team for a guided reset.
Q: Can I use Langfuse without LiteLLM?
Yes. Any application can send traces to Langfuse via the SDK, REST API, or framework callbacks (LangChain, LlamaIndex, Dify). LiteLLM integration is the most automatic path but is not required.
Q: How do I delete traces to manage storage?
Use the Langfuse UI: filter traces by date, project, or other criteria and delete them in bulk. Or use the API: DELETE /api/public/traces with filters. Set a retention policy in Settings > Data Retention to automate this.
Q: Will traces still appear if Langfuse is temporarily down?
If Langfuse is unavailable, LiteLLM and SDK clients will log errors but continue serving model requests. Traces generated during the outage are lost — there is no built-in queue or replay. For high-availability trace requirements, run Langfuse with multiple replicas.
Q: How do I upgrade Langfuse to a new version?
Update image.tag in values.yaml to the new version and run helm upgrade. Langfuse v3 handles database migrations automatically on startup. Always back up the PostgreSQL database first and check the Langfuse changelog for breaking changes.

