Frequently Asked Questions¶
Answers to common questions about DataKit.
General Questions¶
What is DataKit?¶
DataKit is a system for building, publishing, and operating data pipelines with built-in governance, lineage tracking, and GitOps-based deployment. It provides:
- Data Packages: Self-contained, versioned bundles for data pipelines
- Local Development: Docker-based development stack
- Lineage Tracking: Automatic OpenLineage integration
- GitOps Deployment: Environment promotion through pull requests
What problem does DataKit solve?¶
DataKit addresses common challenges in data engineering:
| Challenge | DataKit Solution |
|---|---|
| Pipeline deployment complexity | GitOps-based promotion workflow |
| Lack of data lineage | Automatic OpenLineage events |
| Configuration drift | Immutable OCI artifacts |
| Governance gaps | Built-in classification and policies |
| Development friction | Local Docker stack |
How is DataKit different from Airflow/Dagster/Prefect?¶
DataKit is complementary to orchestrators, not a replacement:
| Aspect | DataKit | Orchestrators |
|---|---|---|
| Focus | Packaging & deployment | Workflow scheduling |
| Unit of work | Data package | Task/DAG |
| Runtime | OCI containers | Python/containers |
| Lineage | Native OpenLineage | Plugin-based |
DataKit packages can be scheduled by any orchestrator.
What languages/runtimes are supported?¶
DataKit supports any containerized runtime:
- Python (most common)
- Java/Scala (Spark, Flink)
- Go
- Node.js
- Any language that runs in a container
Development Questions¶
How do I start developing locally?¶
# 1. Install dk CLI
make build
export PATH=$PATH:$(pwd)/bin
# 2. Start local stack
dk dev up
# 3. Create a Transform package
dk init my-pipeline --runtime generic-python
# 4. Run locally
dk run ./my-pipeline
See the Quickstart for details.
What's included in the local development stack?¶
| Service | Purpose | Port |
|---|---|---|
| Kafka | Message streaming | 9092 |
| MinIO | S3-compatible storage | 9000, 9001 |
| Marquez | Lineage tracking | 5000 |
| PostgreSQL | Marquez database | 5432 |
Can I use my own Kafka/S3 locally?¶
Yes! Override connection details in your Store manifests:
apiVersion: datakit.infoblox.dev/v1alpha1
kind: Store
metadata:
name: my-kafka
spec:
connector: kafka
connection:
bootstrap-servers: my-kafka:9092
Assets reference the Store by name — no additional configuration is needed.
How do I add custom services to the dev stack?¶
Use dk config to customize chart versions and Helm values:
dk config set dev.charts.redpanda.version 25.2.0
dk config set dev.charts.postgres.values.primary.resources.limits.memory 1Gi
How do I persist data between runs?¶
Data is stored in Docker volumes. To reset:
Package Questions¶
What files are in a data package?¶
| File/Directory | Purpose | Required |
|---|---|---|
dk.yaml | Transform manifest (runtime, inputs, outputs, schedule) | Yes |
connector/ | Connector definitions (technology types) | No |
store/ | Store definitions (instances with connection details) | No |
dataset/ | DataSet definitions (data contracts with schema) | No |
dataset-group/ | DataSetGroup definitions (bundled DataSets) | No |
The dk.yaml file is a Transform manifest that references DataSets by name. DataSets reference Stores, and Stores reference Connectors.
What runtimes are available?¶
| Runtime | Description |
|---|---|
cloudquery | CloudQuery SDK sync |
generic-go | Go container |
generic-python | Python container |
dbt | dbt transformations |
How do I version packages?¶
Packages use semantic versioning:
# Build with version
dk build --tag v1.0.0
# Increment for changes
v1.0.0 → v1.0.1 # Bug fix
v1.0.0 → v1.1.0 # New feature
v1.0.0 → v2.0.0 # Breaking change
Can I publish private packages?¶
Yes! Push to a private registry:
# Use private registry
dk publish --registry ghcr.io/my-private-org
# Ensure authentication
docker login ghcr.io
Deployment Questions¶
How does promotion work?¶
DataKit uses GitOps for deployment:
dk promotecreates a PR in the GitOps repository- PR is reviewed and approved
- After merge, ArgoCD syncs to Kubernetes
- Package runs in the target environment
What are the standard environments?¶
| Environment | Approval | Purpose |
|---|---|---|
| dev | Auto-merge | Development |
| int | 1 approval | Integration testing |
| prod | 2 approvals | Production |
Can I skip environments?¶
Not recommended, but possible:
# This will work but triggers a warning
dk promote my-pkg v1.0.0 --to prod
# Warning: Skipping dev and int environments
How do I rollback a deployment?¶
# Rollback to previous version
dk rollback my-pkg --env prod
# Rollback to specific version
dk rollback my-pkg --to v1.0.0 --env prod
How do I know what version is deployed?¶
Lineage Questions¶
What is data lineage?¶
Lineage tracks:
- Where data comes from (upstream)
- Where data goes to (downstream)
- What transformations were applied
- When the pipeline ran
How does DataKit track lineage?¶
DataKit automatically emits OpenLineage events:
- Reads inputs/outputs from
dk.yaml - Emits START event when pipeline begins
- Emits COMPLETE/FAIL event when pipeline ends
- Events sent to Marquez (or configured backend)
Where can I view lineage?¶
- Local: http://localhost:3000 (Marquez Web UI)
- CLI:
dk lineage my-pipeline(not yet implemented) - Production: Your organization's lineage backend
Can I use a different lineage backend?¶
Yes! Configure in ~/.dk/config.yaml:
Why isn't my lineage showing up?¶
Common causes:
- Marquez not running:
dk dev status - Wrong endpoint: Check
OPENLINEAGE_URL - Pipeline never completed successfully
Governance Questions¶
What data classification levels are available?¶
| Level | Description |
|---|---|
internal | Internal use only |
confidential | Limited access, may contain PII |
restricted | Highly sensitive |
How do I mark data as containing PII?¶
Use the classification and pii fields on DataSet schemas:
apiVersion: datakit.infoblox.dev/v1alpha1
kind: DataSet
metadata:
name: customer-data
spec:
store: warehouse
table: public.customers
classification: confidential
schema:
- name: email
type: string
pii: true
- name: name
type: string
pii: true
Are classification policies enforced?¶
Yes! dk lint enforces policies:
How do I view what packages handle PII?¶
Troubleshooting Questions¶
dk command not found¶
Add the binary to your PATH:
dk dev up fails¶
- Check Docker is running:
docker info - Check port conflicts:
lsof -i :9092 - Clean up:
dk dev down --volumes
Pipeline can't connect to Kafka¶
- Wait for Kafka to be ready:
dk dev status - Check Kafka logs:
kubectl --context k3d-dk-local logs -l app=redpanda - Verify bootstrap server:
localhost:9092
Push to registry fails¶
- Check authentication:
docker login ghcr.io - Check permissions on the registry
- Check network connectivity
Promotion PR not created¶
- Check
GITHUB_TOKENis set - Verify GitOps repository URL in config
- Check network connectivity
More Questions?¶
If your question isn't answered here:
- Check Common Issues
- Search GitHub Issues
- Ask in #datakit-support on Slack
- Open a new issue on GitHub
See Also¶
- Common Issues - Detailed troubleshooting
- CLI Reference - Command documentation
- Quickstart - Getting started guide