DataSets¶

A DataSet is a named data contract that describes a table, S3 prefix, or Kafka topic living in a Store. DataSets are declarative metadata — they define what data exists, where it lives, its schema, and how it is classified, without containing any runtime logic.

What is a DataSet?¶

Aspect	Description
Declarative	No code — just a YAML file describing the data contract
Store-bound	Every DataSet references a Store that provides connection details
Schema-aware	Declares column names, types, PII flags, and lineage links
Classified	Carries data sensitivity classification (public, internal, confidential, restricted)
Versioned	Semantic version tracks contract evolution
Scoped	Each DataSet belongs to a namespace and is referenced by Transforms

DataSet Structure¶

DataSets live in the dataset/ directory (or any directory you choose) alongside other manifests:

my-pipeline/
├── connector/
│   ├── postgres.yaml
│   └── s3.yaml
├── store/
│   ├── warehouse.yaml
│   └── lake-raw.yaml
├── dataset/
│   ├── users.yaml
│   ├── users-parquet.yaml
│   ├── orders.yaml
│   └── orders-parquet.yaml
├── dataset-group/
│   └── pg-snapshot.yaml
└── dk.yaml                  # Transform manifest

DataSet manifest¶

Every DataSet is defined by a YAML file with kind: DataSet:

dataset/users.yaml

apiVersion: datakit.infoblox.dev/v1alpha1
kind: DataSet
metadata:
  name: users                         # DNS-safe, 1-63 characters
  namespace: default
  version: 1.0.0                      # Semantic version
  labels:
    team: data-engineering
    domain: identity
spec:
  store: warehouse                    # References a Store by name
  table: public.users                 # Table name (for relational stores)
  classification: confidential        # Data sensitivity level
  schema:
    - name: id
      type: integer
    - name: email
      type: string
      pii: true
    - name: created_at
      type: timestamp

Location Types¶

A DataSet must specify at least one of table, prefix, or topic to identify where the data lives within the Store:

Field	Store type	Example
`table`	Relational databases (Postgres, Snowflake)	`public.users`
`prefix`	Object stores (S3)	`data/users/`
`topic`	Streaming platforms (Kafka)	`user-events`

Schema and Lineage¶

DataSets declare their columns in the spec.schema array. Each column can optionally carry a from field that links it to a column in another DataSet, establishing column-level lineage:

dataset/users-parquet.yaml

apiVersion: datakit.infoblox.dev/v1alpha1
kind: DataSet
metadata:
  name: users-parquet
  namespace: default
spec:
  store: lake-raw
  prefix: data/users/
  format: parquet
  classification: confidential
  schema:
    - name: id
      type: integer
      from: users.id                  # Lineage: came from users.id
    - name: email
      type: string
      pii: true
      from: users.email               # Lineage: came from users.email
    - name: created_at
      type: timestamp
      from: users.created_at

Classification¶

Every DataSet should declare a classification level:

Level	Description
`public`	Non-sensitive, publicly shareable data
`internal`	Internal-only, no PII
`confidential`	Contains PII or sensitive business data
`restricted`	Highly regulated data (financial, health)

When any column has pii: true, classification is required (enforced by dk lint).

Seed Data for Local Development¶

DataSets can declare sample data in a dev.seed section. This data is loaded into the backing database during local development so that your pipeline has real rows to process without manual SQL or external fixtures.

dataset/source.yaml

spec:
  store: warehouse
  table: example_table
  schema:
    - name: id
      type: integer
    - name: name
      type: string
  dev:
    seed:
      inline:
        - { id: 1, name: "alice" }
        - { id: 2, name: "bob" }

How It Works¶

dk dev seed (or the auto-seed that runs before dk run) reads every input DataSet in the package.
For each DataSet with a dev.seed section, it generates CREATE TABLE IF NOT EXISTS + INSERT statements and executes them against the local PostgreSQL instance via kubectl exec.
A SHA-256 checksum of the seed data is stored in a _dk_seed_meta table. On subsequent runs, unchanged data is skipped automatically.
When data does change, the table is TRUNCATEd before inserting so the contents always match the seed spec — no duplicates, no stale rows.

Seed Profiles¶

You can define named profiles for different test scenarios under dev.seed.profiles. Each profile has its own inline rows or seed file:

dataset/source.yaml

dev:
  seed:
    inline:
      - { id: 1, name: "alice" }      # default profile
    profiles:
      large-dataset:
        file: testdata/large.csv       # file-based profile
      edge-cases:
        inline:
          - { id: -1, name: "" }
          - { id: 999, name: "O'Reilly" }
      empty: {}                        # empty table for testing

Activate a profile with dk dev seed --profile <name>:

# Default seed data
dk dev seed

# Switch to the edge-cases profile
dk dev seed --profile edge-cases

# Force re-seed even if unchanged
dk dev seed --force

Seed files

Seed files can be CSV or JSON. Place them in your package directory and reference them with a relative path (e.g., testdata/data.csv).

CLI Commands¶

Command	Description
`dk dataset list`	List all DataSets in the project
`dk dataset show <name>`	Show full DataSet details
`dk dataset validate [path]`	Validate DataSet configuration
`dk dev seed`	Load seed data into local dev stores
`dk dev seed --profile <name>`	Use a named seed profile

Relationship to Other Concepts¶

Connector (technology type)
       │
       ▼
  Store (named instance with credentials)
       │
       ▼
  DataSet (data contract: table/prefix/topic + schema)
       │
       ▼
  Transform (dk.yaml — reads input DataSets, writes output DataSets)
       │
       ▼
  Build → Publish → Promote

Stores provide infrastructure-specific connection details (buckets, topics, databases)
DataSets declare what data lives in a Store — the schema, classification, and lineage
Transforms reference DataSets by name in their spec.inputs and spec.outputs
DataSetGroups bundle multiple DataSets produced by a single materialisation