Data Lineage¶
Data lineage tracks the origin, movement, and transformation of data through your pipelines. DataKit uses OpenLineage to capture and visualize lineage automatically.
What is Lineage?¶
Lineage answers critical questions:
| Question | Lineage Answer |
|---|---|
| Where did this data come from? | Upstream sources and transformations |
| What depends on this data? | Downstream consumers |
| What happened to this run? | Job execution events and status |
| Is this data fresh? | Last successful run timestamp |
How It Works¶
┌─────────────────────────────────────────────────────────────────┐
│ Lineage Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ Pipeline │────────▶│ Lineage │───────▶│ Marquez │ │
│ │ Runtime │ events │ Collector │ store │ UI │ │
│ └──────────────┘ └──────────────┘ └──────────┘ │
│ │ ▲ │ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │
│ │ dk.yaml │ │ OpenLineage │ │ Lineage │ │
│ │ manifest │────────▶│ Events │ │ Graph │ │
│ └──────────────┘ defines └──────────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Automatic Lineage¶
When you run dk run, the CLI automatically:
- Reads the manifest to understand inputs/outputs
- Emits START event when pipeline begins
- Emits COMPLETE/FAIL event when pipeline ends
- Sends events to Marquez (or configured backend)
No code changes required in your pipeline!
Manual Lineage (Optional)¶
For custom lineage, use the OpenLineage SDK in your pipeline:
from openlineage.client import OpenLineageClient
from openlineage.client.run import Run, Job
client = OpenLineageClient.from_environment()
# Emit custom lineage event
client.emit(
RunEvent(
eventType=RunState.RUNNING,
job=Job(namespace="analytics", name="my-pipeline"),
run=Run(runId=str(uuid.uuid4())),
inputs=[...],
outputs=[...]
)
)
Viewing Lineage¶
Local Development¶
With dk dev up, Marquez is available at http://localhost:5000:
CLI Lineage Command¶
Not Yet Implemented
The dk lineage CLI command is planned but not yet available. Use the Marquez Web UI at http://localhost:3000 to view lineage graphs.
Planned usage:
Output:
Lineage for: my-package
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Upstream:
├─ kafka://production/user-events
└─ postgres://users-db/users
Downstream:
├─ s3://analytics-bucket/processed/
└─ dashboard/user-metrics
Last Run:
Status: COMPLETE
Started: 2025-01-22T10:00:00Z
Finished: 2025-01-22T10:05:32Z
Duration: 5m 32s
Marquez UI¶
The Marquez UI provides a visual lineage graph:
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
│ user-events │────▶│ my-pipeline │────▶│ processed/ │
│ (kafka) │ │ │ │ (s3) │
└─────────────┘ └─────────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ user-metrics│
│ (dashboard) │
└─────────────┘
OpenLineage Events¶
The DK CLI emits standard OpenLineage events:
Run Events¶
| Event Type | When Emitted |
|---|---|
START | Pipeline begins execution |
RUNNING | Periodic heartbeat (optional) |
COMPLETE | Pipeline finished successfully |
FAIL | Pipeline failed with error |
ABORT | Pipeline was cancelled |
Event Structure¶
{
"eventType": "COMPLETE",
"eventTime": "2025-01-22T10:05:32.000Z",
"job": {
"namespace": "analytics",
"name": "my-package"
},
"run": {
"runId": "run-abc123"
},
"inputs": [
{
"namespace": "kafka",
"name": "production/user-events"
}
],
"outputs": [
{
"namespace": "s3",
"name": "analytics-bucket/processed/"
}
]
}
Lineage Configuration¶
Backend Configuration¶
Configure the OpenLineage backend in your environment:
# Environment variables
export OPENLINEAGE_URL=http://localhost:5000/api/v1/lineage
export OPENLINEAGE_API_KEY=your-api-key # if required
Or in ~/.dk/config.yaml:
Supported Backends¶
| Backend | Description |
|---|---|
| Marquez | Default, open-source lineage server |
| DataHub | LinkedIn's data catalog |
| OpenMetadata | Open-source metadata platform |
| Custom | Any OpenLineage-compatible endpoint |
Lineage Best Practices¶
1. Meaningful Names¶
Use descriptive, consistent names:
# Good
metadata:
name: user-events-to-s3-parquet
namespace: analytics
# Bad
metadata:
name: pipeline1
namespace: default
2. Document Inputs/Outputs¶
Include descriptions in your manifest:
inputs:
- name: user-events
type: kafka-topic
description: "Real-time user behavior events from web app"
3. Use Namespaces¶
Group related packages:
metadata:
namespace: analytics # All analytics pipelines
# or
namespace: ml-training # All ML training jobs
4. Tag Sensitive Data¶
Use classification for governance:
Troubleshooting Lineage¶
Events Not Appearing¶
- Check Marquez is running:
dk dev status - Verify endpoint:
echo $OPENLINEAGE_URL - Check connectivity:
curl $OPENLINEAGE_URL/api/v1/namespaces
Missing Connections¶
If upstream/downstream links are missing:
- Ensure consistent naming across packages
- Check that binding references match
- Verify packages are in the same namespace
Stale Lineage¶
If lineage shows old data:
# Planned: dk lineage my-package --refresh
# For now, check directly in the Marquez UI at http://localhost:3000
Next Steps¶
- Governance - How lineage enables governance
- Environments - Lineage across environments
- Troubleshooting - Common lineage issues