Overview
DataHub is an open-source metadata platform used to catalog data assets, document ownership, track lineage, and make datasets easier to discover and govern.
In a Shakudo environment, DataHub sits at the data discovery and governance layer. It connects to warehouses, BI tools, orchestration tools, and databases so teams can understand what data exists, who owns it, and how it is used.
This page is written for onboarding and deployment calls. It focuses on what customers need to understand, provide, validate, and troubleshoot in a real environment.
Where it fits in the stack
- Primary role: DataHub provides a reusable platform capability rather than a one-off application.
- Typical deployment model: Kubernetes + Helm, with customer-specific values and secrets.
- Typical access model: private internal endpoint or customer-approved external route.
- Typical support model: validate deployment health first, then validate user workflow and integrations.
Getting Started
Start with one safe workflow in DataHub before enabling production usage. The goal is to prove connectivity, permissions, and operational ownership.
What the customer needs to provide
- metadata sources such as warehouses, databases, Airbyte, dbt, Superset, or Kafka
- ingestion credentials with read-only metadata access
- search/index backend such as Elasticsearch or OpenSearch
- Kafka and SQL metadata store configuration, either bundled or external
- initial admin users and ownership model
First workflow
- Open the DataHub UI
- Create or import the first ingestion source
- Run ingestion against one safe source first, such as a staging database
- Review datasets, schemas, ownership, and glossary terms
- Add owners, tags, and documentation for high-value assets
- Schedule ingestion after the initial result is validated
Administration and Best Practices
Use these practices to keep DataHub reliable after the initial deployment.
- Start with a small number of high-value sources before cataloging everything
- Use read-only ingestion credentials
- Define owner and domain conventions before asking teams to contribute
- Schedule metadata ingestion during low-traffic windows
- Monitor Elasticsearch/OpenSearch storage because metadata indexes grow over time
- Back up DataHub metadata store before upgrades
Troubleshooting & FAQ
Use this section during customer debugging calls. Format: Problem → What to check → Fix.
Ingestion job fails
- What to check: Check connector credentials, network access, and the ingestion pod logs
- Fix: Fix the source config and rerun the ingestion recipe manually
Assets do not appear in search
- What to check: Check GMS health, search backend health, and whether ingestion completed
- Fix: Restart ingestion and confirm Elasticsearch/OpenSearch indexes are healthy
Lineage is missing
- What to check: Check whether the source supports lineage and whether dbt/BI metadata was ingested
- Fix: Add the relevant source connector or dbt manifest ingestion
UI loads but metadata pages error
- What to check: Check DataHub GMS logs and metadata store connectivity
- Fix: Restart GMS after confirming database and Kafka are healthy

