Quickstart¶
This tutorial walks you through the complete Data Kit workflow: from creating a new data package to promoting it to an environment.
Time to complete: ~10 minutes
What You'll Build¶
A simple data processing model that:
- Processes input data
- Transforms the data
- Outputs results
Step 1: Create a New Package¶
Initialize a new data package:
This creates the following structure:
my-first-pipeline/
├── dk.yaml # Package manifest
├── main.py # Your pipeline code
└── requirements.txt # Python dependencies
Let's look at the generated manifest:
apiVersion: datakit.infoblox.dev/v1alpha1
kind: Transform
metadata:
name: my-first-pipeline
namespace: default
version: 0.1.0
labels:
team: my-team
spec:
runtime: generic-python
mode: batch
image: "${REGISTRY}/my-first-pipeline:${VERSION}"
inputs:
- dataset: my-first-pipeline-input
outputs:
- dataset: my-first-pipeline-output
resources:
cpu: "1"
memory: "2Gi"
Step 2: Start Local Development¶
Start the local development stack. This deploys four services as Helm charts into a local k3d cluster:
- Redpanda — Kafka-compatible streaming (port 19092)
- LocalStack — AWS-compatible S3 (port 4566)
- PostgreSQL — Relational database (port 5432)
- Marquez — Data lineage tracking (ports 5000, 3000)
Each chart includes init jobs that automatically create topics, buckets, database schemas, and lineage namespaces — no manual setup required.
Seed Data
If your input assets declare dev.seed data, you can load it now: bash dk dev seed This creates tables and inserts sample rows into the local PostgreSQL. Seed data is also loaded automatically before each dk run.
Check the status:
Expected output:
Local Development Stack
───────────────────────
Chart Status Ports
redpanda healthy 19092, 18081
localstack healthy 4566
postgres healthy 5432
marquez healthy 5000, 3000
Endpoints:
Kafka: localhost:19092
Schema Registry: http://localhost:18081
S3 API: http://localhost:4566
PostgreSQL: localhost:5432
Marquez API: http://localhost:5000
Marquez Web: http://localhost:3000
View Lineage
Open http://localhost:3000 in your browser to see the Marquez lineage UI.
Chart Customization
You can override chart versions or Helm values via the config system: bash dk config set dev.charts.redpanda.version 25.2.0 dk config set dev.charts.postgres.values.primary.resources.limits.memory 1Gi
Step 3: Validate Your Package¶
Check your package for errors:
Expected output:
Step 4: Run Locally¶
Execute your pipeline against the local stack:
You'll see output like:
▶ Starting pipeline: my-first-pipeline
→ Emitting START lineage event
→ Pulling container image...
→ Running pipeline...
→ Processing 100 messages
→ Writing to s3://local/output/
→ Emitting COMPLETE lineage event
✓ Pipeline completed successfully in 12.3s
Run ID: run-abc123
Step 5: Check Lineage¶
View the lineage for your run:
Not Yet Implemented
The dk lineage command is planned but not yet implemented. For now, use the Marquez UI directly.
Open the Marquez UI to view the lineage graph:
- Marquez Web UI: http://localhost:3000 — Visual lineage graph
- Marquez API: http://localhost:5000 — REST API for querying lineage
You can also check the logs from your run:
Step 6: Build the Package¶
Create an OCI artifact from your package:
Output:
▶ Building package: my-first-pipeline
→ Validating manifest...
→ Bundling files...
→ Creating OCI artifact...
✓ Built: my-first-pipeline:v0.1.0
Artifact: ghcr.io/my-org/my-first-pipeline:v0.1.0
Size: 2.3 MB
Step 7: Publish to Registry¶
Push the artifact to your OCI registry:
Authentication Required
You may need to authenticate with your registry first: bash docker login ghcr.io
Step 8: Promote to Environment¶
Deploy to the development environment:
This creates a GitOps PR that will be reviewed and merged.
Step 9: Check Status¶
Monitor your package across environments:
Output:
Package: my-first-pipeline
Environment Version Status Last Run
─────────── ─────── ────── ────────
dev v0.1.0 running 2 min ago
int - - -
prod - - -
Step 10: Clean Up¶
When you're done, stop the local stack:
Summary¶
You've completed the full DK workflow:
| Step | Command | What It Does |
|---|---|---|
| 1 | dk init | Create a new data package |
| 2 | dk dev up | Start local infrastructure |
| 2b | dk dev seed | Load sample data into local stores |
| 3 | dk lint | Validate manifests |
| 4 | dk run | Execute locally (auto-seeds if needed) |
| 5 | dk lineage | View data lineage*(not yet implemented — use Marquez UI)* |
| 6 | dk logs | Stream logs from a run |
| 7 | dk build | Create OCI artifact |
| 8 | dk publish | Push to registry |
| 9 | dk promote | Deploy to environment |
| 10 | dk status | Check deployment status |
| 11 | dk dev down | Stop local stack |
Next Steps¶
Now that you understand the basics:
- Concepts - Deep dive into data packages, manifests, and lineage
- Tutorials - Build more complex pipelines
- CLI Reference - Complete command documentation
Congratulations!
You've successfully created, run, and published your first data package!
CloudQuery Model Quickstart¶
This section walks you through creating and running a CloudQuery sync model.
What is a CloudQuery Model?
CloudQuery models use the CloudQuery CLI to sync data between sources and destinations. The dk init --runtime cloudquery command generates a config.yaml file that you run with cloudquery sync.
Prerequisites¶
In addition to the standard prerequisites, you need:
- CloudQuery CLI — Install with
brew install cloudquery/tap/cloudquery(macOS) or see CloudQuery docs
Step 1: Create a CloudQuery Model¶
This creates:
dk.yaml— Package manifest with CloudQuery configurationconfig.yaml— CloudQuery sync configuration
Step 2: Configure the Sync¶
Edit config.yaml to configure your source and destination:
kind: source
spec:
name: my-source
registry: cloudquery
path: cloudquery/postgresql
tables: ["public.my_table"]
destinations: ["my-destination"]
spec:
connection_string: "${CONNECTION_STRING}"
---
kind: destination
spec:
name: my-destination
registry: cloudquery
path: cloudquery/s3
spec:
bucket: "my-data-lake"
path: "raw/my-sync/{{TABLE}}/{{UUID}}.parquet"
Step 3: Start Local Dev Stack¶
This starts PostgreSQL and LocalStack for local testing.
Step 4: Run the Sync¶
CloudQuery models are run directly with the CloudQuery CLI:
export CONNECTION_STRING="postgres://postgres:postgres@localhost:5432/postgres"
cloudquery sync config.yaml
Why not dk run?
CloudQuery models use configuration files rather than application code. The CloudQuery CLI handles the actual sync execution.