DataSets¶
A DataSet is a named data contract that describes a table, S3 prefix, or Kafka topic living in a Store. DataSets are declarative metadata — they define what data exists, where it lives, its schema, and how it is classified, without containing any runtime logic.
What is a DataSet?¶
| Aspect | Description |
|---|---|
| Declarative | No code — just a YAML file describing the data contract |
| Store-bound | Every DataSet references a Store that provides connection details |
| Schema-aware | Declares column names, types, PII flags, and lineage links |
| Classified | Carries data sensitivity classification (public, internal, confidential, restricted) |
| Versioned | Semantic version tracks contract evolution |
| Scoped | Each DataSet belongs to a namespace and is referenced by Transforms |
DataSet Structure¶
DataSets live in the dataset/ directory (or any directory you choose) alongside other manifests:
my-pipeline/
├── connector/
│ ├── postgres.yaml
│ └── s3.yaml
├── store/
│ ├── warehouse.yaml
│ └── lake-raw.yaml
├── dataset/
│ ├── users.yaml
│ ├── users-parquet.yaml
│ ├── orders.yaml
│ └── orders-parquet.yaml
├── dataset-group/
│ └── pg-snapshot.yaml
└── dk.yaml # Transform manifest
DataSet manifest¶
Every DataSet is defined by a YAML file with kind: DataSet:
apiVersion: datakit.infoblox.dev/v1alpha1
kind: DataSet
metadata:
name: users # DNS-safe, 1-63 characters
namespace: default
version: 1.0.0 # Semantic version
labels:
team: data-engineering
domain: identity
spec:
store: warehouse # References a Store by name
table: public.users # Table name (for relational stores)
classification: confidential # Data sensitivity level
schema:
- name: id
type: integer
- name: email
type: string
pii: true
- name: created_at
type: timestamp
Location Types¶
A DataSet must specify at least one of table, prefix, or topic to identify where the data lives within the Store:
| Field | Store type | Example |
|---|---|---|
table | Relational databases (Postgres, Snowflake) | public.users |
prefix | Object stores (S3) | data/users/ |
topic | Streaming platforms (Kafka) | user-events |
Schema and Lineage¶
DataSets declare their columns in the spec.schema array. Each column can optionally carry a from field that links it to a column in another DataSet, establishing column-level lineage:
apiVersion: datakit.infoblox.dev/v1alpha1
kind: DataSet
metadata:
name: users-parquet
namespace: default
spec:
store: lake-raw
prefix: data/users/
format: parquet
classification: confidential
schema:
- name: id
type: integer
from: users.id # Lineage: came from users.id
- name: email
type: string
pii: true
from: users.email # Lineage: came from users.email
- name: created_at
type: timestamp
from: users.created_at
Classification¶
Every DataSet should declare a classification level:
| Level | Description |
|---|---|
public | Non-sensitive, publicly shareable data |
internal | Internal-only, no PII |
confidential | Contains PII or sensitive business data |
restricted | Highly regulated data (financial, health) |
When any column has pii: true, classification is required (enforced by dk lint).
Seed Data for Local Development¶
DataSets can declare sample data in a dev.seed section. This data is loaded into the backing database during local development so that your pipeline has real rows to process without manual SQL or external fixtures.
spec:
store: warehouse
table: example_table
schema:
- name: id
type: integer
- name: name
type: string
dev:
seed:
inline:
- { id: 1, name: "alice" }
- { id: 2, name: "bob" }
How It Works¶
dk dev seed(or the auto-seed that runs beforedk run) reads every input DataSet in the package.- For each DataSet with a
dev.seedsection, it generatesCREATE TABLE IF NOT EXISTS+INSERTstatements and executes them against the local PostgreSQL instance viakubectl exec. - A SHA-256 checksum of the seed data is stored in a
_dk_seed_metatable. On subsequent runs, unchanged data is skipped automatically. - When data does change, the table is
TRUNCATEd before inserting so the contents always match the seed spec — no duplicates, no stale rows.
Seed Profiles¶
You can define named profiles for different test scenarios under dev.seed.profiles. Each profile has its own inline rows or seed file:
dev:
seed:
inline:
- { id: 1, name: "alice" } # default profile
profiles:
large-dataset:
file: testdata/large.csv # file-based profile
edge-cases:
inline:
- { id: -1, name: "" }
- { id: 999, name: "O'Reilly" }
empty: {} # empty table for testing
Activate a profile with dk dev seed --profile <name>:
# Default seed data
dk dev seed
# Switch to the edge-cases profile
dk dev seed --profile edge-cases
# Force re-seed even if unchanged
dk dev seed --force
Seed files
Seed files can be CSV or JSON. Place them in your package directory and reference them with a relative path (e.g., testdata/data.csv).
CLI Commands¶
| Command | Description |
|---|---|
dk dataset list | List all DataSets in the project |
dk dataset show <name> | Show full DataSet details |
dk dataset validate [path] | Validate DataSet configuration |
dk dev seed | Load seed data into local dev stores |
dk dev seed --profile <name> | Use a named seed profile |
Relationship to Other Concepts¶
Connector (technology type)
│
▼
Store (named instance with credentials)
│
▼
DataSet (data contract: table/prefix/topic + schema)
│
▼
Transform (dk.yaml — reads input DataSets, writes output DataSets)
│
▼
Build → Publish → Promote
- Stores provide infrastructure-specific connection details (buckets, topics, databases)
- DataSets declare what data lives in a Store — the schema, classification, and lineage
- Transforms reference DataSets by name in their
spec.inputsandspec.outputs - DataSetGroups bundle multiple DataSets produced by a single materialisation
See Also¶
- Data Packages — The container for DataSets
- Manifests — dk.yaml and manifest reference
- CLI Reference — Complete CLI documentation
- Manifest Schema — Schema reference