Skip to content

Concepts

Understand the core concepts that power DataKit. This section explains the fundamental building blocks you'll use when building data pipelines.

Core Concepts

🏗 Overview

High-level architecture and how the components fit together.

Architecture Overview →

📦 Data Packages

Self-contained units of data processing with metadata and code.

Learn about Data Packages →

📄 Manifests

Configuration files that define your package: Transform, DataSet, Connector, and Store manifests.

Understand Manifests →

🧩 DataSets

Data contracts — tables, S3 prefixes, topics — with schema and column-level lineage.

Learn about DataSets →

⚙ Pipelines

Reactive dependency graphs derived from Transform and DataSet manifests.

Pipelines →

🔗 Lineage

Track data flow and dependencies with OpenLineage and Marquez.

Explore Lineage →

🛡 Governance

PII classification, compliance metadata, and data protection.

Data Governance →

🌎 Environments

Development, integration, and production workflows.

Environment Workflow →

💠 Cells & Stores

Infrastructure contexts that separate what runs from where it runs.

Learn about Cells →

Learning Path

We recommend reading the concepts in this order:

  1. Overview - Start with the big picture
  2. Data Packages - Understand the core unit of work
  3. Manifests - Learn how to configure packages
  4. DataSets - Data contracts with schema and classification
  5. Pipelines - Understand the reactive dependency graph
  6. Lineage - Track data flow
  7. Governance - Classify and protect data
  8. Environments - Deploy across stages
  9. Cells & Stores - Understand the Package × Cell model

Key Principles

DataKit is built on these principles:

Principle Description
Developer Experience First Simple happy path: bootstrap → run → validate → publish → promote
Immutability Released artifacts cannot be modified; versions are permanent
Separation of Concerns Connectors, Stores, DataSets, and Transforms have distinct ownership
Security by Default PII classification is required, not optional
Observability Every operation emits metrics and lineage events