Tutorial: Local Development¶
This tutorial covers the local development stack in detail, including how to use it effectively for pipeline development and testing.
Prerequisites: Complete the Quickstart tutorial.
Time: ~20 minutes
What You'll Learn¶
- Start and configure the development stack
- Use local Kafka for testing
- Work with MinIO (S3-compatible storage)
- View lineage in Marquez
- Debug pipeline issues locally
The Development Stack¶
The dk dev command manages a local k3d cluster with these services:
| Service | Port | Purpose |
|---|---|---|
| Redpanda | 19092 | Kafka-compatible streaming |
| LocalStack | 4566 | S3-compatible storage |
| Marquez | 5000 | Lineage tracking |
| PostgreSQL | 5432 | Marquez database |
Starting the Stack¶
Basic Start¶
This starts all services in the foreground. Press Ctrl+C to stop.
Background Start¶
For development sessions, run in background:
With Custom Timeout¶
If services are slow to start:
Checking Status¶
View the status of all services:
Example output:
Local Development Stack
━━━━━━━━━━━━━━━━━━━━━━━
Service Status Port Health
─────── ────── ──── ──────
kafka running 9092 healthy
minio running 9000 healthy
marquez running 5000 healthy
postgres running 5432 healthy
Services: 4/4 running
Stack: healthy
Endpoints:
Kafka: localhost:9092
MinIO API: http://localhost:9000
MinIO UI: http://localhost:9001
Marquez: http://localhost:5000
Working with Kafka¶
Accessing Kafka¶
The local Kafka is accessible at localhost:9092.
Creating Topics¶
Topics are auto-created by default, but you can create them explicitly:
docker exec -it dk-kafka kafka-topics \
--bootstrap-server localhost:9092 \
--create \
--topic user-events \
--partitions 3 \
--replication-factor 1
Listing Topics¶
Producing Test Messages¶
echo '{"id": "123", "event": "test"}' | docker exec -i dk-kafka kafka-console-producer \
--bootstrap-server localhost:9092 \
--topic user-events
Consuming Messages¶
docker exec -it dk-kafka kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic user-events \
--from-beginning
Working with MinIO (S3)¶
Accessing MinIO¶
- API: http://localhost:9000
- Console: http://localhost:9001
- Credentials: minioadmin/minioadmin
Using the MinIO Console¶
- Open http://localhost:9001 in your browser
- Login with
minioadmin/minioadmin - Browse buckets and objects
Using MinIO CLI (mc)¶
Install the MinIO client:
# macOS
brew install minio/stable/mc
# Linux
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/
Configure for local stack:
Common operations:
# List buckets
mc ls local
# Create bucket
mc mb local/my-bucket
# Upload file
mc cp myfile.txt local/my-bucket/
# List objects
mc ls local/my-bucket/
# Download file
mc cp local/my-bucket/myfile.txt ./downloaded.txt
Using AWS CLI¶
Configure AWS CLI for MinIO:
# Create profile
aws configure --profile local
# Access Key: minioadmin
# Secret Key: minioadmin
# Region: us-east-1
# Use with endpoint override
aws --profile local --endpoint-url http://localhost:9000 s3 ls
Working with Marquez¶
Viewing Lineage¶
Open the Marquez UI at http://localhost:5000.
The UI shows:
- Jobs: Data packages and their runs
- Datasets: Inputs and outputs
- Lineage Graph: Visual data flow
Using the Marquez API¶
# List namespaces
curl http://localhost:5000/api/v1/namespaces
# List jobs in a namespace
curl http://localhost:5000/api/v1/namespaces/default/jobs
# Get job details
curl http://localhost:5000/api/v1/namespaces/default/jobs/my-pipeline
# List datasets
curl http://localhost:5000/api/v1/namespaces/default/datasets
Lineage Events¶
When you run dk run, these events are emitted:
{
"eventType": "START",
"eventTime": "2025-01-22T10:00:00.000Z",
"job": {
"namespace": "default",
"name": "my-pipeline"
},
"run": {
"runId": "run-abc123"
},
"inputs": [...],
"outputs": [...]
}
Running Pipelines Locally¶
Basic Run¶
With Local Store Overrides¶
Create environment-specific Store manifests for local development:
apiVersion: datakit.infoblox.dev/v1alpha1
kind: Store
metadata:
name: local-events
spec:
connector: kafka
connection:
bootstrapServers: localhost:9092
secrets:
groupId: my-pipeline-dev
apiVersion: datakit.infoblox.dev/v1alpha1
kind: Store
metadata:
name: local-output
spec:
connector: s3
connection:
bucket: test-bucket
endpoint: http://localhost:9000
region: us-east-1
secrets:
accessKeyId: minioadmin
secretAccessKey: minioadmin
With Environment Variables¶
Dry Run¶
See what would run without executing:
Debugging¶
View Container Logs¶
# Check pod logs
kubectl --context k3d-dk-local logs -l app=redpanda
kubectl --context k3d-dk-local logs -l app=marquez
Access Container Shell¶
Check Resource Usage¶
Common Issues¶
Kafka Not Connecting¶
# Check if Kafka is healthy
dk dev status
# Check Redpanda logs
kubectl --context k3d-dk-local logs -l app=redpanda
MinIO Access Denied¶
Marquez Events Missing¶
# Check Marquez health
curl http://localhost:5000/api/v1/namespaces
# Check recent runs
curl http://localhost:5000/api/v1/namespaces/default/jobs
Customizing the Stack¶
Override Configuration¶
Chart versions and Helm values can be overridden via dk config:
dk config set dev.charts.redpanda.version 25.2.0
dk config set dev.charts.postgres.values.primary.resources.limits.memory 1Gi
Stopping the Stack¶
Stop and Keep Data¶
Data persists in persistent volumes.
Stop and Remove Data¶
Removes all data (topics, objects, lineage).
Best Practices¶
1. Use Local Store Configurations¶
Create environment-specific Store manifests for development:
# Development — uses local store manifests
dk run
# Production builds use environment-specific stores
dk build && dk publish
2. Clean Up Regularly¶
Remove old data when testing:
3. Use Seed Data for Testing¶
Declare sample data directly in your DataSet YAML instead of manually creating tables or loading fixtures:
Then seed the database:
Use named profiles for different test scenarios:
dev:
seed:
inline:
- { id: 1, name: "alice" } # default
profiles:
edge-cases:
inline:
- { id: -1, name: "" }
- { id: 999, name: "O'Reilly" }
empty: {}
# Load edge-case data
dk dev seed --profile edge-cases
# Reset to an empty table
dk dev seed --profile empty --clean
# The default profile is used automatically during dk run
dk run
Seed runs are idempotent — unchanged data is skipped automatically via checksum tracking.
4. Use Lineage for Debugging¶
Check Marquez to verify:
- Pipeline ran successfully
- Correct inputs were consumed
- Outputs were produced
Summary¶
You've learned how to:
- Start and manage the development stack
- Work with local Kafka
- Use MinIO for S3 storage
- Load seed data and switch between profiles
- View lineage in Marquez
- Debug common issues
- Customize the stack
Next Steps¶
- Kafka to S3 - Build a complete pipeline
- Promoting Packages - Deploy to environments
- Troubleshooting - Common issues