DataKit Architecture¶
This document describes the high-level architecture of DataKit.
Overview¶
DataKit is a Kubernetes-native data pipeline platform that enables teams to contribute reusable, versioned "data packages" with a complete developer workflow.
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DataKit Architecture │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ │
│ │ Developer │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ DK CLI │ │
│ │ ┌──────┐ ┌─────┐ ┌─────┐ ┌──────┐ ┌───────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ init │ │ dev │ │ run │ │ lint │ │ build │ │ publish │ │ promote │ │ │
│ │ └──────┘ └─────┘ └─────┘ └──────┘ └───────┘ └─────────┘ └─────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ SDK │ │ OCI Registry │ │ GitOps │ │
│ │ • Validate │ │ • Store Pkgs │ │ • Kustomize │ │
│ │ • Lineage │ │ • Immutability │ │ • ArgoCD │ │
│ │ • Registry │ │ • Versioning │ │ • Environments │ │
│ │ • Runner │ │ │ │ │ │
│ │ • Catalog │ │ │ │ │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ Kubernetes Cluster │ │
│ │ ┌────────────────────────────────────┐ │ │
│ │ │ Platform Controller │ │ │
│ │ │ • PackageDeployment CRD │ │ │
│ │ │ • Pull OCI Artifacts │ │ │
│ │ │ • Create Jobs │ │ │
│ │ │ • Emit Metrics │ │ │
│ │ └────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
Components¶
1. CLI (cli/)¶
The command-line interface for interacting with the platform.
Responsibilities: - Package scaffolding (dk init) - Local development (dk dev, dk run) - Validation (dk lint, dk test) - Publishing (dk build, dk publish) - Promotion (dk promote) - Observability (dk status, dk logs)
Technology: Go, Cobra
2. SDK (sdk/)¶
Core libraries used by the CLI and controller.
2.1 Validate (sdk/validate/)¶
- Manifest validation (dk.yaml, connector, store, DataSet manifests)
- PII classification validation
- Schema validation
2.2 Lineage (sdk/lineage/)¶
- OpenLineage event types
- Marquez emitter implementation
- Event builder pattern
2.3 Registry (sdk/registry/)¶
- OCI artifact management using ORAS
- Bundler for creating artifacts
- Client for push/pull operations
2.4 Runner (sdk/runner/)¶
- Local execution via Docker
- Lineage emission integration
- Run tracking
2.5 Catalog (sdk/catalog/)¶
- Data catalog record types
- Marquez integration
- Metadata management
3. Contracts (contracts/)¶
Shared types and schemas for the five manifest kinds.
// Transform is a unit of computation that reads/writes Assets.
type Transform struct {
APIVersion string
Kind string
Metadata TransformMetadata
Spec TransformSpec
}
// DataSetRef is a reference to a named DataSet.
type DataSetRef struct {
DataSet string // DataSet name (mutually exclusive with Tags)
Tags map[string]string // Match DataSets by labels
Version string // Semver range constraint
Cell string // Cell qualifier
}
// DataSetManifest represents a data contract in a Store.
type DataSetManifest struct {
APIVersion string
Kind string
Metadata DataSetMetadata
Spec DataSetSpec
}
4. Platform Controller (platform/controller/)¶
Kubernetes controller for managing data packages.
CRDs: - PackageDeployment: Represents a deployed data package
Reconciliation Loop: 1. Watch for PackageDeployment changes 2. Pull OCI artifact from registry 3. Extract pipeline configuration 4. Create/update Kubernetes Jobs 5. Monitor execution and emit metrics
5. GitOps (gitops/)¶
Environment definitions using Kustomize.
gitops/
├── base/
│ ├── kustomization.yaml
│ └── crds/
├── environments/
│ ├── dev/
│ │ └── kustomization.yaml
│ ├── int/
│ │ └── kustomization.yaml
│ └── prod/
│ └── kustomization.yaml
└── argocd/
└── applicationset.yaml
Data Flow¶
Local Development¶
Developer → dk init → Creates dk.yaml
→ dk dev up → Deploys embedded Helm charts to k3d
(Redpanda, LocalStack, PostgreSQL, Marquez)
Init jobs seed topics, buckets, schemas, namespaces
→ dk run → Builds container, runs locally
→ Lineage events → Marquez
Helm Chart Deployment Mechanism¶
The dk dev up command uses a uniform Helm chart deployment mechanism:
- Embedded Charts: All dev dependency charts are embedded in the CLI binary via Go's
embed.FS(sdk/localdev/charts/) - Chart Registry: A
DefaultChartsregistry defines each chart's port-forwarding rules, health labels, display endpoints, and timeouts - Uniform Deployment:
charts.DeployCharts()extracts charts to a temp directory and runshelm upgrade --installin parallel - Init Jobs: Each chart includes Helm hook jobs (post-install/post-upgrade) that automatically create required resources (Kafka topics, S3 buckets, DB schemas, Marquez namespaces)
- Config Overrides: Users can override chart versions (
dev.charts.<name>.version) or Helm values (dev.charts.<name>.values.<path>) via the hierarchical config system - Upstream Subcharts: Redpanda and PostgreSQL wrap upstream Helm charts as subcharts, inheriting production-quality templates while providing dev-appropriate value overrides
Adding a new dev dependency requires only:
- Creating a chart directory under
sdk/localdev/charts/<name>/ - Registering a
ChartDefinitionin theDefaultChartsslice inembed.go - No changes to deployment, health-checking, port-forwarding, or CLI code
Publish & Promote¶
Developer → dk build → Validates & bundles artifact
→ dk publish → Pushes to OCI registry (digest-based)
→ dk promote → Creates PR to gitops repo
→ PR merged → ArgoCD syncs
→ Controller → Pulls artifact, creates Job
Lineage Tracking¶
Runner emits OpenLineage events:
START → Job begins execution
COMPLETE → Job finished successfully
FAIL → Job failed with error
Events sent to:
Marquez (local) → http://localhost:5000/api/v1/lineage
Marquez (prod) → Configured via environment
Key Design Decisions¶
1. OCI for Package Storage¶
Rationale: - Immutable by design (content-addressable) - Existing tooling (Docker registries, Harbor) - Standard format with ecosystem support
2. GitOps for Promotion¶
Rationale: - Auditable change history - Declarative desired state - Rollback = git revert - No direct cluster access needed
3. OpenLineage for Lineage¶
Rationale: - Industry standard - Marquez integration - Vendor neutral
4. Go Monorepo¶
Rationale: - Independent versioning per module - Shared contracts - Single CI pipeline - Clear dependency direction
Dependency Graph¶
Scaling Considerations¶
| Aspect | MVP | Scale Target |
|---|---|---|
| Packages | 10-50 | 500+ |
| Environments | 3 | 10+ |
| Concurrent runs | 10 | 100+ |
| OCI artifact size | <500MB | <1GB |
Security Model¶
- No Secrets in Code: All secrets via Kubernetes/external-secrets
- PII Metadata: Required classification on outputs
- Immutable Artifacts: No modification after publish
- PR-based Promotion: Audit trail for all changes
- RBAC: Kubernetes RBAC for controller
Observability¶
Metrics¶
dk_run_total{status,package,namespace}dk_run_duration_seconds{package,namespace}dk_controller_reconcile_total{result}
Logging¶
- Structured JSON with slog
- Correlation IDs for tracing
- Levels: DEBUG, INFO, WARN, ERROR
Dashboards¶
- Pipeline health:
dashboards/pipeline-health.json - Controller metrics:
dashboards/controller.json