Large Language Model (LLM)

What is Langfuse, and How to Deploy It in an Enterprise Data Stack?

Last updated on
May 12, 2026

What is Langfuse?

Langfuse is an open-source LLM engineering platform that's gaining traction among tech-savvy teams for its robust observability and tracing capabilities. Its core differentiator lies in its production-ready, asynchronous architecture that doesn't compromise application performance. Organizations appreciate Langfuse for its ability to streamline debugging, analysis, and iteration of LLM applications, offering features like model-based evaluations, user feedback collection, and manual annotations. Teams using Langfuse report significant improvements in application latency and quality, with one European fintech halving their application's response time through Langfuse's tracing insights. The platform's open-core model, extensive integrations, and focus on data ownership also make it an attractive choice for enterprises looking to maintain control over their LLM infrastructure while benefiting from advanced observability tools.

Watch Langfuse in action

Why is Langfuse better on Shakudo?

While Langfuse is powerful on its own, deploying it on Shakudo's data and AI operating system takes it to the next level.

By running Langfuse on Shakudo, you get the best of both worlds: Langfuse's cutting-edge LLM tools and Shakudo's seamless deployment and integration capabilities. Instead of wrestling with complex setups or worrying about security, you can have Langfuse up and running in minutes with just a few clicks. Shakudo's automated DevOps ensures your Langfuse instance is always optimized and secure, while its deep integration ecosystem allows you to effortlessly connect Langfuse with your existing AI stack. This combination not only saves you time and resources but also provides a level of operational efficiency and scalability that's hard to achieve with other solutions or self-deployment.

Langfuse Knowledge Base

Langfuse Overview

Langfuse is an open-source observability and analytics platform purpose-built for LLM applications. It gives your team a single place to see every prompt, response, trace, latency number, token count, and cost — across all your AI-powered tools and services, from production runs to debugging sessions.

In Shakudo environments, Langfuse sits in the observability layer, downstream of LiteLLM and alongside applications like Dify and AgentFlow. Every model call that flows through LiteLLM can be logged to Langfuse automatically, giving you a live trace of exactly what your AI stack is doing.

What Problem Does Langfuse Solve?

LLM applications are hard to debug and expensive to operate without visibility. When something goes wrong — a bad response, a spike in latency, unexpected token costs — you need to know which prompt caused it, which model handled it, and how long it took. Without Langfuse, that information lives in scattered logs or does not exist at all.

  • Captures full traces: prompt, model, response, latency, token count, and cost in one view
  • Lets you compare prompt versions and evaluate quality over time
  • Shows which models and agents are most expensive or slow
  • Enables your team to replay and debug individual LLM calls

How Langfuse Fits in the Shakudo Stack

Langfuse is the observability layer for all AI activity in the environment:

  • LiteLLM logs every model call to Langfuse via a success/failure callback — no app code changes needed
  • Dify can send traces to Langfuse directly via its built-in Langfuse integration
  • Custom Python or JavaScript apps use the Langfuse SDK to instrument their own LLM calls
  • LangChain and LlamaIndex apps work with Langfuse via native callback handlers
  • Langfuse stores trace data in PostgreSQL and exports artifacts to MinIO

Key Concepts

  • Trace: a complete record of one logical operation — e.g. a user query from start to finish, including all LLM calls, tool uses, and sub-steps inside it.
  • Span: a single step within a trace (one LLM call, one function, one retrieval). Spans can nest.
  • Observation: the individual data points inside a span: input, output, latency, token count, cost.
  • Session: a group of traces belonging to the same user conversation or workflow run.
  • Score: a human or automated evaluation attached to a trace or span (e.g. pass/fail, 1-5 rating).
  • Prompt Management: versioned prompts stored in Langfuse and linked to the traces that use them.

What Langfuse Is Not

  • Not an LLM gateway. It does not route model calls — use LiteLLM for that.
  • Not a log aggregation platform. It is purpose-built for LLM traces, not general application logs.
  • Not a feature store or model registry. It tracks prompt/response quality, not model weights.

Administration & Best Practices

This page covers how to keep Langfuse stable, organised, and cost-efficient in a production Shakudo environment.

Project and API Key Organisation

Separate observability data by creating one Langfuse project per environment:

  • production: all live traffic — strict access
  • staging: pre-production validation
  • dev: developer experiments and testing

Create one API key pair per application or team so usage is attributable and keys can be rotated independently. Revoke old keys via Settings > API Keys.

Tagging Traces for Observability

Always include user_id, session_id, and metadata on traces to enable filtering, cost attribution, and debugging:

trace = client.trace(
   name="workflow-name",
   user_id="[email protected]",
   session_id="session-abc-123",
   metadata={"team": "risk", "env": "prod", "version": "v2.1"}
)

Without tags, you cannot filter who or what triggered a given trace.

Data Retention and Storage Management

Langfuse stores trace data in PostgreSQL and media files in MinIO. Both grow over time:

  • Set a retention policy on the PostgreSQL langfuse database to delete old traces
  • Configure TTL on the MinIO bucket to auto-expire old files
  • Langfuse v3 supports configurable data retention — check Settings > Data Retention in the UI

# Monitor PostgreSQL DB size
kubectl exec -it langfuse-postgresql-0 -n hyperplane-langfuse -- \\
 psql -U langfuse -c "SELECT pg_size_pretty(pg_database_size('langfuse'));"

# Monitor MinIO bucket size
mc du shakudo-minio/langfuse-<env>

Keep-Alive Timeout (GCP/GKE Production Fix)

On GCP with Cloud Load Balancer, the default Node.js keep-alive timeout (5s) is shorter than the load balancer idle timeout (600s). This causes 502 errors on long requests.

Fix (already included in the Deployment Runbook):

LANGFUSE_HTTP_KEEPALIVE_TIMEOUT_MS: "620000"
LANGFUSE_HTTP_HEADERS_TIMEOUT_MS:  "621000"

Always verify these are set after upgrades — they can be reset if values.yaml is regenerated from defaults.

Upgrades

Update image.tag in values.yaml to the new Langfuse version and redeploy:

helm upgrade langfuse . \\
 --namespace hyperplane-langfuse \\
 --values values.yaml \\
 --timeout 10m \\
 --wait

Langfuse v3 runs database migrations automatically on startup. Always back up the PostgreSQL database before upgrading.

Security Basics

Secrets

  • Store NEXTAUTH_SECRET, SALT, database password, and MinIO credentials in a Kubernetes secret
  • Never put secrets in values.yaml or ConfigMaps in plain text

Access control

  • Langfuse v3 has built-in RBAC at the organisation and project level
  • Use project-level roles (Owner, Admin, Member, Viewer) to limit who can see traces
  • For SSO/OIDC integration, configure AUTH_CUSTOM_CLIENT_ID and related env vars

Network

  • Expose Langfuse only on cluster-internal DNS unless external access is explicitly needed
  • If external access is required, use an Istio VirtualService or ingress with authentication

Backup Strategy

  • PostgreSQL: schedule regular pg_dump and upload to MinIO or off-cluster storage
  • MinIO: include the langfuse-<env> bucket in the cluster backup policy
  • API keys: if the database is lost, all API keys are lost — keep a secure record of public keys

Troubleshooting & FAQ

Use this page during live debugging. Format: Problem -> What to check -> Fix.

Deployment Issues

Pod stuck in CrashLoopBackOff

  • Check: kubectl logs deployment/langfuse -n hyperplane-langfuse
  • Common causes: DATABASE_URL incorrect, missing NEXTAUTH_SECRET or SALT, PostgreSQL not ready
  • Fix: confirm all required env vars are in the Kubernetes secret and referenced in envFrom. Wait for postgresql pod to be Running before the main pod starts.

UI loads but shows database error

  • Check: Langfuse could connect to the service but failed the migration or query
  • Fix: verify DATABASE_URL and DIRECT_URL both point to the correct PostgreSQL host and database. Check pod logs for Prisma migration errors.

502 or timeout errors on requests

  • Check: GCP/GKE environments — keepAlive timeout is shorter than load balancer idle timeout
  • Fix: set LANGFUSE_HTTP_KEEPALIVE_TIMEOUT_MS=620000 and LANGFUSE_HTTP_HEADERS_TIMEOUT_MS=621000 in values.yaml and redeploy

Trace Ingestion Issues

Traces not appearing in the UI

  • Check: POST to /api/public/ingestion returns errors — look in the response body for specific failures
  • Check: verify the Authorization header uses the correct public/secret key pair for the project
  • Fix: re-run the Step 8 validation curl command. Confirm the keys match the project in the UI.

LiteLLM traces not appearing in Langfuse

  • Check: LiteLLM litellmConfig includes success_callback and failure_callback with "langfuse"
  • Check: langfuse_host, langfuse_public_key, and langfuse_secret_key are set and correct
  • Fix: check LiteLLM pod logs for Langfuse callback errors. Confirm LiteLLM pod can reach the Langfuse service on port 3000.

MinIO export error when downloading traces

  • Check: LANGFUSE_S3_ENDPOINT must use the cluster-internal MinIO DNS, not an external URL
  • Check: LANGFUSE_S3_FORCE_PATH_STYLE must be "true" for MinIO compatibility
  • Fix: update the MinIO env vars and rollout restart. Run the Step 3 MinIO health check to confirm connectivity.

Performance Issues

Langfuse UI is slow to load traces

  • Check: PostgreSQL pod CPU/memory — it is the primary data store
  • Check: number of traces in the database — very large tables slow down queries
  • Fix: add database indexes on frequently queried fields. Set a data retention policy to purge old traces.

Trace ingestion throughput is low

  • Check: Langfuse default deployment is single-replica — high-volume environments may need more replicas
  • Fix: scale the deployment: kubectl scale deployment/langfuse -n hyperplane-langfuse --replicas=2

Frequently Asked Questions

Q: How do I reset the admin password?

Langfuse uses email-based sign-in with magic links by default. If email is not configured, reset the password directly in PostgreSQL by updating the users table. Contact the Shakudo team for a guided reset.

Q: Can I use Langfuse without LiteLLM?

Yes. Any application can send traces to Langfuse via the SDK, REST API, or framework callbacks (LangChain, LlamaIndex, Dify). LiteLLM integration is the most automatic path but is not required.

Q: How do I delete traces to manage storage?

Use the Langfuse UI: filter traces by date, project, or other criteria and delete them in bulk. Or use the API: DELETE /api/public/traces with filters. Set a retention policy in Settings > Data Retention to automate this.

Q: Will traces still appear if Langfuse is temporarily down?

If Langfuse is unavailable, LiteLLM and SDK clients will log errors but continue serving model requests. Traces generated during the outage are lost — there is no built-in queue or replay. For high-availability trace requirements, run Langfuse with multiple replicas.

Q: How do I upgrade Langfuse to a new version?

Update image.tag in values.yaml to the new version and run helm upgrade. Langfuse v3 handles database migrations automatically on startup. Always back up the PostgreSQL database first and check the Langfuse changelog for breaking changes.

Why is Langfuse better on Shakudo?

Why is Langfuse better on Shakudo?

Core Shakudo Features

Own Your AI

Keep data sovereign, protect IP, and avoid vendor lock-in with infra-agnostic deployments.

Faster Time-to-Value

Pre-built templates and automated DevOps accelerate time-to-value.
integrate

Flexible with Experts

Operating system and dedicated support ensure seamless adoption of the latest and greatest tools.
See Shakudo in Action
Neal Gilmore
Get Started >