Skip to content

Tutorial: Local Development

This tutorial covers the local development stack in detail, including how to use it effectively for pipeline development and testing.

Prerequisites: Complete the Quickstart tutorial.

Time: ~20 minutes

What You'll Learn

  • Start and configure the development stack
  • Use local Kafka for testing
  • Work with MinIO (S3-compatible storage)
  • View lineage in Marquez
  • Debug pipeline issues locally

The Development Stack

The dk dev command manages a local k3d cluster with these services:

Service Port Purpose
Redpanda 19092 Kafka-compatible streaming
LocalStack 4566 S3-compatible storage
Marquez 5000 Lineage tracking
PostgreSQL 5432 Marquez database

Starting the Stack

Basic Start

dk dev up

This starts all services in the foreground. Press Ctrl+C to stop.

Background Start

For development sessions, run in background:

dk dev up --detach

With Custom Timeout

If services are slow to start:

dk dev up --timeout 120s

Checking Status

View the status of all services:

dk dev status

Example output:

Local Development Stack
━━━━━━━━━━━━━━━━━━━━━━━

Service     Status    Port     Health
───────     ──────    ────     ──────
kafka       running   9092     healthy
minio       running   9000     healthy
marquez     running   5000     healthy
postgres    running   5432     healthy

Services: 4/4 running
Stack: healthy

Endpoints:
  Kafka:      localhost:9092
  MinIO API:  http://localhost:9000
  MinIO UI:   http://localhost:9001
  Marquez:    http://localhost:5000

Working with Kafka

Accessing Kafka

The local Kafka is accessible at localhost:9092.

Creating Topics

Topics are auto-created by default, but you can create them explicitly:

docker exec -it dk-kafka kafka-topics \
  --bootstrap-server localhost:9092 \
  --create \
  --topic user-events \
  --partitions 3 \
  --replication-factor 1

Listing Topics

docker exec -it dk-kafka kafka-topics \
  --bootstrap-server localhost:9092 \
  --list

Producing Test Messages

echo '{"id": "123", "event": "test"}' | docker exec -i dk-kafka kafka-console-producer \
  --bootstrap-server localhost:9092 \
  --topic user-events

Consuming Messages

docker exec -it dk-kafka kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic user-events \
  --from-beginning

Working with MinIO (S3)

Accessing MinIO

Using the MinIO Console

  1. Open http://localhost:9001 in your browser
  2. Login with minioadmin / minioadmin
  3. Browse buckets and objects

Using MinIO CLI (mc)

Install the MinIO client:

# macOS
brew install minio/stable/mc

# Linux
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/

Configure for local stack:

mc alias set local http://localhost:9000 minioadmin minioadmin

Common operations:

# List buckets
mc ls local

# Create bucket
mc mb local/my-bucket

# Upload file
mc cp myfile.txt local/my-bucket/

# List objects
mc ls local/my-bucket/

# Download file
mc cp local/my-bucket/myfile.txt ./downloaded.txt

Using AWS CLI

Configure AWS CLI for MinIO:

# Create profile
aws configure --profile local
# Access Key: minioadmin
# Secret Key: minioadmin
# Region: us-east-1

# Use with endpoint override
aws --profile local --endpoint-url http://localhost:9000 s3 ls

Working with Marquez

Viewing Lineage

Open the Marquez UI at http://localhost:5000.

The UI shows:

  • Jobs: Data packages and their runs
  • Datasets: Inputs and outputs
  • Lineage Graph: Visual data flow

Using the Marquez API

# List namespaces
curl http://localhost:5000/api/v1/namespaces

# List jobs in a namespace
curl http://localhost:5000/api/v1/namespaces/default/jobs

# Get job details
curl http://localhost:5000/api/v1/namespaces/default/jobs/my-pipeline

# List datasets
curl http://localhost:5000/api/v1/namespaces/default/datasets

Lineage Events

When you run dk run, these events are emitted:

{
  "eventType": "START",
  "eventTime": "2025-01-22T10:00:00.000Z",
  "job": {
    "namespace": "default",
    "name": "my-pipeline"
  },
  "run": {
    "runId": "run-abc123"
  },
  "inputs": [...],
  "outputs": [...]
}

Running Pipelines Locally

Basic Run

dk run ./my-pipeline

With Local Store Overrides

Create environment-specific Store manifests for local development:

store/local-events.yaml
apiVersion: datakit.infoblox.dev/v1alpha1
kind: Store
metadata:
  name: local-events
spec:
  connector: kafka
  connection:
    bootstrapServers: localhost:9092

  secrets:
    groupId: my-pipeline-dev
store/local-output.yaml
apiVersion: datakit.infoblox.dev/v1alpha1
kind: Store
metadata:
  name: local-output
spec:
  connector: s3
  connection:
    bucket: test-bucket
    endpoint: http://localhost:9000
    region: us-east-1
  secrets:
    accessKeyId: minioadmin
    secretAccessKey: minioadmin
dk run ./my-pipeline

With Environment Variables

dk run ./my-pipeline \
  --env DEBUG=true \
  --env BATCH_SIZE=100

Dry Run

See what would run without executing:

dk run ./my-pipeline --dry-run

Debugging

View Container Logs

# Check pod logs
kubectl --context k3d-dk-local logs -l app=redpanda
kubectl --context k3d-dk-local logs -l app=marquez

Access Container Shell

docker exec -it dk-kafka /bin/bash

Check Resource Usage

docker stats

Common Issues

Kafka Not Connecting

# Check if Kafka is healthy
dk dev status

# Check Redpanda logs
kubectl --context k3d-dk-local logs -l app=redpanda

MinIO Access Denied

# Verify credentials
mc admin info local

# Check bucket policy
mc anonymous get local/my-bucket

Marquez Events Missing

# Check Marquez health
curl http://localhost:5000/api/v1/namespaces

# Check recent runs
curl http://localhost:5000/api/v1/namespaces/default/jobs

Customizing the Stack

Override Configuration

Chart versions and Helm values can be overridden via dk config:

dk config set dev.charts.redpanda.version 25.2.0
dk config set dev.charts.postgres.values.primary.resources.limits.memory 1Gi

Stopping the Stack

Stop and Keep Data

dk dev down

Data persists in persistent volumes.

Stop and Remove Data

dk dev down --volumes

Removes all data (topics, objects, lineage).

Best Practices

1. Use Local Store Configurations

Create environment-specific Store manifests for development:

# Development — uses local store manifests
dk run

# Production builds use environment-specific stores
dk build && dk publish

2. Clean Up Regularly

Remove old data when testing:

dk dev down --volumes
dk dev up

3. Use Seed Data for Testing

Declare sample data directly in your DataSet YAML instead of manually creating tables or loading fixtures:

dataset/source.yaml
spec:
  dev:
    seed:
      inline:
        - { id: 1, name: "alice" }
        - { id: 2, name: "bob" }

Then seed the database:

dk dev seed

Use named profiles for different test scenarios:

dev:
  seed:
    inline:
      - { id: 1, name: "alice" }      # default
    profiles:
      edge-cases:
        inline:
          - { id: -1, name: "" }
          - { id: 999, name: "O'Reilly" }
      empty: {}
# Load edge-case data
dk dev seed --profile edge-cases

# Reset to an empty table
dk dev seed --profile empty --clean

# The default profile is used automatically during dk run
dk run

Seed runs are idempotent — unchanged data is skipped automatically via checksum tracking.

4. Use Lineage for Debugging

Check Marquez to verify:

  • Pipeline ran successfully
  • Correct inputs were consumed
  • Outputs were produced

Summary

You've learned how to:

  • Start and manage the development stack
  • Work with local Kafka
  • Use MinIO for S3 storage
  • Load seed data and switch between profiles
  • View lineage in Marquez
  • Debug common issues
  • Customize the stack

Next Steps