Tutorial: Local Development¶

This tutorial covers the local development stack in detail, including how to use it effectively for pipeline development and testing.

Prerequisites: Complete the Quickstart tutorial.

Time: ~20 minutes

What You'll Learn¶

Start and configure the development stack
Use local Kafka for testing
Work with MinIO (S3-compatible storage)
View lineage in Marquez
Debug pipeline issues locally

The Development Stack¶

The dk dev command manages a local k3d cluster with these services:

Service	Port	Purpose
Redpanda	19092	Kafka-compatible streaming
LocalStack	4566	S3-compatible storage
Marquez	5000	Lineage tracking
PostgreSQL	5432	Marquez database

Starting the Stack¶

Basic Start¶

dk dev up

This starts all services in the foreground. Press Ctrl+C to stop.

Background Start¶

For development sessions, run in background:

dk dev up --detach

With Custom Timeout¶

If services are slow to start:

dk dev up --timeout 120s

Checking Status¶

View the status of all services:

dk dev status

Example output:

Local Development Stack
━━━━━━━━━━━━━━━━━━━━━━━

Service     Status    Port     Health
───────     ──────    ────     ──────
kafka       running   9092     healthy
minio       running   9000     healthy
marquez     running   5000     healthy
postgres    running   5432     healthy

Services: 4/4 running
Stack: healthy

Endpoints:
  Kafka:      localhost:9092
  MinIO API:  http://localhost:9000
  MinIO UI:   http://localhost:9001
  Marquez:    http://localhost:5000

Working with Kafka¶

Accessing Kafka¶

The local Kafka is accessible at localhost:9092.

Creating Topics¶

Topics are auto-created by default, but you can create them explicitly:

docker exec -it dk-kafka kafka-topics \
  --bootstrap-server localhost:9092 \
  --create \
  --topic user-events \
  --partitions 3 \
  --replication-factor 1

Listing Topics¶

docker exec -it dk-kafka kafka-topics \
  --bootstrap-server localhost:9092 \
  --list

Producing Test Messages¶

echo '{"id": "123", "event": "test"}' | docker exec -i dk-kafka kafka-console-producer \
  --bootstrap-server localhost:9092 \
  --topic user-events

Consuming Messages¶

docker exec -it dk-kafka kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic user-events \
  --from-beginning

Working with MinIO (S3)¶

Accessing MinIO¶

API: http://localhost:9000
Console: http://localhost:9001
Credentials: minioadmin/minioadmin

Using the MinIO Console¶

Open http://localhost:9001 in your browser
Login with minioadmin / minioadmin
Browse buckets and objects

Using MinIO CLI (mc)¶

Install the MinIO client:

# macOS
brew install minio/stable/mc

# Linux
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/

Configure for local stack:

mc alias set local http://localhost:9000 minioadmin minioadmin

Common operations:

# List buckets
mc ls local

# Create bucket
mc mb local/my-bucket

# Upload file
mc cp myfile.txt local/my-bucket/

# List objects
mc ls local/my-bucket/

# Download file
mc cp local/my-bucket/myfile.txt ./downloaded.txt

Using AWS CLI¶

Configure AWS CLI for MinIO:

# Create profile
aws configure --profile local
# Access Key: minioadmin
# Secret Key: minioadmin
# Region: us-east-1

# Use with endpoint override
aws --profile local --endpoint-url http://localhost:9000 s3 ls

Working with Marquez¶

Viewing Lineage¶

Open the Marquez UI at http://localhost:5000.

The UI shows:

Jobs: Data packages and their runs
Datasets: Inputs and outputs
Lineage Graph: Visual data flow

Using the Marquez API¶

# List namespaces
curl http://localhost:5000/api/v1/namespaces

# List jobs in a namespace
curl http://localhost:5000/api/v1/namespaces/default/jobs

# Get job details
curl http://localhost:5000/api/v1/namespaces/default/jobs/my-pipeline

# List datasets
curl http://localhost:5000/api/v1/namespaces/default/datasets

Lineage Events¶

When you run dk run, these events are emitted:

{
  "eventType": "START",
  "eventTime": "2025-01-22T10:00:00.000Z",
  "job": {
    "namespace": "default",
    "name": "my-pipeline"
  },
  "run": {
    "runId": "run-abc123"
  },
  "inputs": [...],
  "outputs": [...]
}

Running Pipelines Locally¶

Basic Run¶

dk run ./my-pipeline

With Local Store Overrides¶

Create environment-specific Store manifests for local development:

store/local-events.yaml

apiVersion: datakit.infoblox.dev/v1alpha1
kind: Store
metadata:
  name: local-events
spec:
  connector: kafka
  connection:
    bootstrapServers: localhost:9092

  secrets:
    groupId: my-pipeline-dev

store/local-output.yaml

apiVersion: datakit.infoblox.dev/v1alpha1
kind: Store
metadata:
  name: local-output
spec:
  connector: s3
  connection:
    bucket: test-bucket
    endpoint: http://localhost:9000
    region: us-east-1
  secrets:
    accessKeyId: minioadmin
    secretAccessKey: minioadmin

dk run ./my-pipeline

With Environment Variables¶

dk run ./my-pipeline \
  --env DEBUG=true \
  --env BATCH_SIZE=100

Dry Run¶

See what would run without executing:

dk run ./my-pipeline --dry-run

Debugging¶

View Container Logs¶

# Check pod logs
kubectl --context k3d-dk-local logs -l app=redpanda
kubectl --context k3d-dk-local logs -l app=marquez

Access Container Shell¶

docker exec -it dk-kafka /bin/bash

Check Resource Usage¶

docker stats

Common Issues¶

Kafka Not Connecting¶

# Check if Kafka is healthy
dk dev status

# Check Redpanda logs
kubectl --context k3d-dk-local logs -l app=redpanda

MinIO Access Denied¶

# Verify credentials
mc admin info local

# Check bucket policy
mc anonymous get local/my-bucket

Marquez Events Missing¶

# Check Marquez health
curl http://localhost:5000/api/v1/namespaces

# Check recent runs
curl http://localhost:5000/api/v1/namespaces/default/jobs

Customizing the Stack¶

Override Configuration¶

Chart versions and Helm values can be overridden via dk config:

dk config set dev.charts.redpanda.version 25.2.0
dk config set dev.charts.postgres.values.primary.resources.limits.memory 1Gi

Stopping the Stack¶

Stop and Keep Data¶

dk dev down

Data persists in persistent volumes.

Stop and Remove Data¶

dk dev down --volumes

Removes all data (topics, objects, lineage).

Best Practices¶

1. Use Local Store Configurations¶

Create environment-specific Store manifests for development:

# Development — uses local store manifests
dk run

# Production builds use environment-specific stores
dk build && dk publish

2. Clean Up Regularly¶

Remove old data when testing:

dk dev down --volumes
dk dev up

3. Use Seed Data for Testing¶

Declare sample data directly in your DataSet YAML instead of manually creating tables or loading fixtures:

dataset/source.yaml

spec:
  dev:
    seed:
      inline:
        - { id: 1, name: "alice" }
        - { id: 2, name: "bob" }

Then seed the database:

dk dev seed

Use named profiles for different test scenarios:

dev:
  seed:
    inline:
      - { id: 1, name: "alice" }      # default
    profiles:
      edge-cases:
        inline:
          - { id: -1, name: "" }
          - { id: 999, name: "O'Reilly" }
      empty: {}

# Load edge-case data
dk dev seed --profile edge-cases

# Reset to an empty table
dk dev seed --profile empty --clean

# The default profile is used automatically during dk run
dk run

Seed runs are idempotent — unchanged data is skipped automatically via checksum tracking.

4. Use Lineage for Debugging¶

Check Marquez to verify:

Pipeline ran successfully
Correct inputs were consumed
Outputs were produced

Summary¶

You've learned how to:

Next Steps¶

Kafka to S3 - Build a complete pipeline
Promoting Packages - Deploy to environments
Troubleshooting - Common issues