Data Lineage¶

Data lineage tracks the origin, movement, and transformation of data through your pipelines. DataKit uses OpenLineage to capture and visualize lineage automatically.

What is Lineage?¶

Lineage answers critical questions:

Question	Lineage Answer
Where did this data come from?	Upstream sources and transformations
What depends on this data?	Downstream consumers
What happened to this run?	Job execution events and status
Is this data fresh?	Last successful run timestamp

How It Works¶

┌─────────────────────────────────────────────────────────────────┐
│                     Lineage Flow                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐         ┌──────────────┐        ┌──────────┐ │
│  │   Pipeline   │────────▶│   Lineage    │───────▶│ Marquez  │ │
│  │   Runtime    │ events  │   Collector  │ store  │    UI    │ │
│  └──────────────┘         └──────────────┘        └──────────┘ │
│         │                        ▲                      │       │
│         │                        │                      │       │
│         ▼                        │                      ▼       │
│  ┌──────────────┐         ┌──────────────┐        ┌──────────┐ │
│  │ dk.yaml      │         │  OpenLineage │        │ Lineage  │ │
│  │ manifest     │────────▶│    Events    │        │  Graph   │ │
│  └──────────────┘ defines └──────────────┘        └──────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Automatic Lineage¶

When you run dk run, the CLI automatically:

Reads the manifest to understand inputs/outputs
Emits START event when pipeline begins
Emits COMPLETE/FAIL event when pipeline ends
Sends events to Marquez (or configured backend)

No code changes required in your pipeline!

Manual Lineage (Optional)¶

For custom lineage, use the OpenLineage SDK in your pipeline:

from openlineage.client import OpenLineageClient
from openlineage.client.run import Run, Job

client = OpenLineageClient.from_environment()

# Emit custom lineage event
client.emit(
    RunEvent(
        eventType=RunState.RUNNING,
        job=Job(namespace="analytics", name="my-pipeline"),
        run=Run(runId=str(uuid.uuid4())),
        inputs=[...],
        outputs=[...]
    )
)

Viewing Lineage¶

Local Development¶

With dk dev up, Marquez is available at http://localhost:5000:

dk dev up
dk run ./my-package
# Open http://localhost:5000 to see lineage

CLI Lineage Command¶

Not Yet Implemented

The dk lineage CLI command is planned but not yet available. Use the Marquez Web UI at http://localhost:3000 to view lineage graphs.

Planned usage:

dk lineage my-package

Output:

Lineage for: my-package
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Upstream:
  ├─ kafka://production/user-events
  └─ postgres://users-db/users

Downstream:
  ├─ s3://analytics-bucket/processed/
  └─ dashboard/user-metrics

Last Run:
  Status: COMPLETE
  Started: 2025-01-22T10:00:00Z
  Finished: 2025-01-22T10:05:32Z
  Duration: 5m 32s

Marquez UI¶

The Marquez UI provides a visual lineage graph:

┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
│ user-events │────▶│  my-pipeline    │────▶│ processed/  │
│   (kafka)   │     │                 │     │   (s3)      │
└─────────────┘     └─────────────────┘     └─────────────┘
                            │
                            ▼
                    ┌─────────────┐
                    │ user-metrics│
                    │ (dashboard) │
                    └─────────────┘

OpenLineage Events¶

The DK CLI emits standard OpenLineage events:

Run Events¶

Event Type	When Emitted
`START`	Pipeline begins execution
`RUNNING`	Periodic heartbeat (optional)
`COMPLETE`	Pipeline finished successfully
`FAIL`	Pipeline failed with error
`ABORT`	Pipeline was cancelled

Event Structure¶

{
  "eventType": "COMPLETE",
  "eventTime": "2025-01-22T10:05:32.000Z",
  "job": {
    "namespace": "analytics",
    "name": "my-package"
  },
  "run": {
    "runId": "run-abc123"
  },
  "inputs": [
    {
      "namespace": "kafka",
      "name": "production/user-events"
    }
  ],
  "outputs": [
    {
      "namespace": "s3",
      "name": "analytics-bucket/processed/"
    }
  ]
}

Lineage Configuration¶

Backend Configuration¶

Configure the OpenLineage backend in your environment:

# Environment variables
export OPENLINEAGE_URL=http://localhost:5000/api/v1/lineage
export OPENLINEAGE_API_KEY=your-api-key  # if required

Or in ~/.dk/config.yaml:

lineage:
  backend: marquez
  endpoint: http://localhost:5000/api/v1/lineage
  api_key: your-api-key

Supported Backends¶

Backend	Description
Marquez	Default, open-source lineage server
DataHub	LinkedIn's data catalog
OpenMetadata	Open-source metadata platform
Custom	Any OpenLineage-compatible endpoint

Lineage Best Practices¶

1. Meaningful Names¶

Use descriptive, consistent names:

# Good
metadata:
  name: user-events-to-s3-parquet
  namespace: analytics

# Bad
metadata:
  name: pipeline1
  namespace: default

2. Document Inputs/Outputs¶

Include descriptions in your manifest:

inputs:
  - name: user-events
    type: kafka-topic
    description: "Real-time user behavior events from web app"

3. Use Namespaces¶

Group related packages:

metadata:
  namespace: analytics     # All analytics pipelines
  # or
  namespace: ml-training   # All ML training jobs

4. Tag Sensitive Data¶

Use classification for governance:

outputs:
  - name: customer-data
    classification:
      pii: true
      sensitivity: confidential

Troubleshooting Lineage¶

Events Not Appearing¶

Check Marquez is running: dk dev status
Verify endpoint: echo $OPENLINEAGE_URL
Check connectivity: curl $OPENLINEAGE_URL/api/v1/namespaces

Missing Connections¶

If upstream/downstream links are missing:

Ensure consistent naming across packages
Check that binding references match
Verify packages are in the same namespace

Stale Lineage¶

If lineage shows old data:

# Planned: dk lineage my-package --refresh
# For now, check directly in the Marquez UI at http://localhost:3000

Next Steps¶

Governance - How lineage enables governance
Environments - Lineage across environments
Troubleshooting - Common lineage issues