Skip to content

Data Packages

A data package is the fundamental unit of work in DataKit. It's a self-contained, versioned bundle that includes everything needed to run a data pipeline.

What is a Data Package?

Think of a data package as a "container" for your data pipeline:

Aspect Description
Self-contained All configuration, code, and metadata in one place
Versioned Immutable versions tracked in an OCI registry
Portable Same package runs locally and in production
Observable Built-in lineage tracking and metadata

Core Concepts

The data platform uses four core concepts to separate concerns:

Concept What it represents Who creates it
Connector A technology type (Postgres, S3, Kafka) Platform team
Store A named instance of a Connector with connection details Infra / SRE
DataSet A data contract (table, S3 prefix, topic) in a Store Data engineer
Transform A unit of computation that reads/writes DataSets Data engineer

Package Structure

A Transform package — the deployable unit — follows this structure:

my-pipeline/
├── dk.yaml           # Transform manifest (required)
├── src/              # Source code (generic-go / generic-python)
│   └── main.py
└── tests/            # Tests (optional)
    └── test_pipeline.py

For CloudQuery runtimes, no source code is needed — the Connector's plugin images handle execution.

dk.yaml (Manifest)

The manifest is the heart of every package:

dk.yaml
apiVersion: datakit.infoblox.dev/v1alpha1
kind: Transform
metadata:
  name: my-kafka-pipeline
  namespace: analytics
  version: 1.0.0
  labels:
    team: data-engineering
    domain: events
spec:
  runtime: generic-python       # cloudquery | generic-go | generic-python | dbt
  mode: batch                   # batch | streaming
  image: myorg/my-pipeline:v1.0.0
  timeout: 30m

  inputs:
    - dataset: raw-events       # references a DataSet by name

  outputs:
    - dataset: processed-events # references a DataSet by name

See Also

Full manifest schema in Manifests Reference.

Runtime Configuration

The spec section of a Transform defines how the container runs:

dk.yaml (spec section)
spec:
  runtime: generic-python
  image: myorg/my-pipeline:v1.0.0       # Required: container image
  timeout: 30m                           # Max execution time
  env:                                   # Environment variables
    - name: LOG_LEVEL
      value: info
  resources:                             # Resource limits
    cpu: "1"
    memory: "2Gi"

Overriding at Runtime

You can override configuration values without modifying dk.yaml:

# Override image for local testing
dk run ./my-pipeline --set spec.image=local:dev

# Apply environment-specific overrides
dk run ./my-pipeline -f production.yaml

# Combine both (--set takes precedence)
dk run ./my-pipeline -f production.yaml --set spec.timeout=1h

# Preview merged configuration
dk show ./my-pipeline -f production.yaml --set spec.image=new:v2

Runtimes

The DK CLI supports the following runtimes:

Generic Python

A containerised Python pipeline. This is the default runtime.

dk init my-pipeline --runtime generic-python

Generic Go

A containerised Go pipeline.

dk init my-pipeline --runtime generic-go

CloudQuery

A CloudQuery sync that uses Connector plugin images to move data between Stores. No application code required.

dk init my-sync --runtime cloudquery

dbt

A dbt transformation project.

dk init my-transforms --runtime dbt

Package Lifecycle

┌─────────────────────────────────────────────────────────────────────┐
│                        Package Lifecycle                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌───────┐ │
│  │ Create  │ → │ Develop │ → │  Build  │ → │ Publish │ → │Promote│ │
│  │(dk init)│   │(dk dev) │   │(dk build│   │(dk push)│   │(dk ↑) │ │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘   └───────┘ │
│       │             │             │             │             │     │
│       ▼             ▼             ▼             ▼             ▼     │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌───────┐ │
│  │  Local  │   │  Local  │   │   OCI   │   │   OCI   │   │  K8s  │ │
│  │  Files  │   │  Stack  │   │Artifact │   │Registry │   │ Env   │ │
│  └─────────┘   └─────────┘   └─────────┘   └─────────┘   └───────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

1. Create

Initialize a new package with templates:

dk init analytics-pipeline --runtime generic-python

2. Develop

Iterate locally with the dev stack:

dk dev up          # Start local services
dk run ./package   # Test pipeline
dk dev down        # Stop services

3. Build

Package as an OCI artifact:

dk build ./package
# Output: analytics-pipeline:v1.0.0

4. Publish

Push to registry:

dk publish ./package
# Pushes to: ghcr.io/org/analytics-pipeline:v1.0.0

5. Promote

Deploy to an environment:

dk promote analytics-pipeline v1.0.0 --to dev

Versioning

Packages use semantic versioning:

Version Part When to Increment
Major (X.0.0) Breaking changes to inputs/outputs
Minor (0.X.0) New features, backward compatible
Patch (0.0.X) Bug fixes, no behavior change

Versions are immutable once published:

# Publish version 1.0.0
dk build --version v1.0.0
dk publish

# Cannot overwrite - must increment
dk build --version v1.0.1
dk publish

Inputs and Outputs

Declaring Inputs

Inputs declare which DataSets a Transform reads:

inputs:
  - dataset: users            # references a DataSet by name

Declaring Outputs

Outputs declare which DataSets a Transform produces:

outputs:
  - dataset: users-parquet    # references a DataSet by name
    classification:
      pii: false
      sensitivity: internal

At execution time, the runner resolves each DataSet → Store → Connector to obtain connection details and credentials.

Supported Runtimes

Runtime Description
cloudquery CloudQuery SDK sync
generic-go Go container
generic-python Python container
dbt dbt transformations

Next Steps