Esc
Start typing to search...

Data Lineage

Keel automatically tracks data provenance for every DataFrame. Each column records where it came from, what transformations were applied, and which parent DataFrames contributed to it. This lineage is available both in display output and through programmatic access.

Automatic Tracking

When you print a DataFrame, lineage appears below the data. It shows parent operations, column origins, and global operations — with no extra code required:

-- norun
-- tags: dataframe, lineage, provenance
-- Lineage appears automatically when printing a DataFrame
import DataFrame

let sales =
    DataFrame.fromRecords
        [ { product = "Laptop", revenue = 1200 }
        , { product = "Phone", revenue = 800 }
        ]

let result =
    sales
        |> DataFrame.filterGt "revenue" 500
        |> DataFrame.select ["product", "revenue"]

-- Printing the DataFrame shows data AND lineage:
--
-- shape: (2, 2)
-- ...
-- Lineage:
--   Derived from: df#... (select)
--   revenue: from records
--   product: from records
-- Global operations: 1
result
Try it

Source Paths

DataFrame.sourcePath returns the file path a DataFrame was read from, or Nothing for DataFrames created in memory:

-- DataFrame.sourcePath returns Nothing for fromRecords
import DataFrame

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        ]

DataFrame.sourcePath df
Try it

For DataFrames read with readCsv, readJson, or readParquet, this returns Just "/path/to/file.csv".

Parent Tracking (DAG)

Every DataFrame gets a unique UUID. Derived DataFrames reference their parents, forming a directed acyclic graph (DAG). DataFrame.parents returns a list of records, each with id, name, operation, and lineage fields. Parent records embed the full lineage of the parent DataFrame.

Root DataFrames have no parents:

-- Root DataFrames have no parents
import DataFrame

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        ]

DataFrame.parents df
Try it

Derived DataFrames record which operation created them. You can count parents to verify the DAG structure:

-- Derived DataFrames track parent operations
import DataFrame
import List

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        ]

let selected = df |> DataFrame.select ["name"]

-- Each parent record has id, name, operation, and lineage fields
List.length (DataFrame.parents selected)
Try it

Column Lineage

DataFrame.columnLineage returns lineage for a single column as Maybe Record. The record contains name, origin, transformations, and dependencies:

-- norun
-- tags: dataframe, lineage
-- DataFrame.columnLineage returns origin info for a column
import DataFrame

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        ]

-- Returns Just { name, origin, transformations, dependencies }
-- origin.type is "FromRecords" for columns from DataFrame.fromRecords
DataFrame.columnLineage "name" df
Try it

After a rename, the transformation history records the operation:

-- norun
-- tags: dataframe, lineage
-- After rename, the transformation tracks the operation
import DataFrame

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        ]

let renamed = df |> DataFrame.rename "name" "person"

-- The "person" column's lineage shows:
--   origin.type = "FromRecords" (original source)
--   transformations = [{ operation = "rename", description = "Renamed 'name' to 'person'" }]
DataFrame.columnLineage "person" renamed
Try it

Origin Types

Each column's origin describes where it came from. The type field identifies the origin kind.

File

Columns read from CSV, JSON, or Parquet files. Origin includes path and originalName.

FromRecords

Columns from DataFrame.fromRecords or DataFrame.fromLists. A simple marker with no additional fields.

Computed

Columns created by withColumn or expressions. Origin includes operation and sourceColumns.

Aggregated

Columns produced by groupBy + agg. Origin includes sourceColumn, aggregationFunc, and groupByColumns:

-- norun
-- tags: dataframe, lineage, aggregation
-- Aggregated columns track source and function
import DataFrame

let df =
    DataFrame.fromRecords
        [ { category = "A", value = 10 }
        , { category = "A", value = 20 }
        , { category = "B", value = 30 }
        ]

let grouped = df |> DataFrame.groupBy ["category"]
let specs = [("value", "mean")]
let agged = grouped |> DataFrame.agg specs

-- The "value" column's lineage shows:
--   origin.type = "Aggregated"
--   origin.aggregationFunc = "mean"
--   origin.groupByColumns = ["category"]
DataFrame.columnLineage "value" agged
Try it

JoinedFrom

Columns brought in from the right side of a join. Origin includes sourceDataFrame and originalName:

-- norun
-- tags: dataframe, lineage, join
-- Joined columns track their source DataFrame
import DataFrame

let users =
    DataFrame.fromRecords
        [ { id = 1, name = "Alice" }
        , { id = 2, name = "Bob" }
        ]

let scores =
    DataFrame.fromRecords
        [ { id = 1, score = 95 }
        , { id = 2, score = 87 }
        ]

let joined = DataFrame.join "id" "id" scores users

-- The "score" column's lineage shows:
--   origin.type = "JoinedFrom"
--   origin.sourceDataFrame = "right"
--   origin.originalName = "score"
DataFrame.columnLineage "score" joined
Try it

Transformations and Global Operations

Lineage separates per-column transformations from global operations.

Per-column transformations are recorded on each affected column: select, drop, rename, withColumn, agg, join, concat. Each transformation has an operation name and a description:

-- norun
-- tags: dataframe, lineage
-- Columns track their transformation history
import DataFrame

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        ]

let selected = df |> DataFrame.select ["name", "age"]

-- Each column's transformations list records operations applied:
--   [{ operation = "select", description = "Selected columns: name, age" }]
DataFrame.columnLineage "name" selected
Try it

Global operations affect all rows without changing column structure: filter, sort, head, tail, unique, sample, groupBy. They are tracked in the top-level globalOperations list:

-- norun
-- tags: dataframe, lineage
-- Global operations (filter, sort) are tracked separately
import DataFrame

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        , { name = "Carol", age = 35 }
        ]

let result =
    df
        |> DataFrame.filterGt "age" 20
        |> DataFrame.sort "age"

-- The lineage record's globalOperations list contains:
--   [{ operation = "filterGt", description = "Filtered where age > 20" },
--    { operation = "sort", description = "Sorted by age (ascending)" }]
DataFrame.lineage result
Try it

Multi-Source Operations

Joins produce two parents and merge source paths from both DataFrames:

-- Join produces two parents in the DAG
import DataFrame
import List

let users =
    DataFrame.fromRecords
        [ { id = 1, name = "Alice" }
        , { id = 2, name = "Bob" }
        ]

let scores =
    DataFrame.fromRecords
        [ { id = 1, score = 95 }
        , { id = 2, score = 87 }
        ]

let joined = DataFrame.join "id" "id" scores users

List.length (DataFrame.parents joined)
Try it

DataFrame.concat produces N parents (one per input DataFrame) and deduplicates source paths.

Full Lineage Record

DataFrame.lineage returns the complete lineage record with all fields:

-- norun
-- tags: dataframe, lineage
-- Full lineage record structure
import DataFrame

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        ]

let result =
    df
        |> DataFrame.filterGt "age" 20
        |> DataFrame.select ["name"]

let lineage = DataFrame.lineage result

-- lineage is a Record with these fields:
--   id : String               -- unique UUID for this DataFrame
--   columns : Record          -- per-column lineage (keyed by column name)
--     name : Record
--       name : String             -- current column name
--       origin : Record           -- where column came from
--         type : String           -- "File", "FromRecords", "Computed", etc.
--         ...                     -- type-specific fields
--       transformations : [Record] -- list of operations applied
--         operation : String      -- e.g. "select", "rename"
--         description : String    -- human-readable description
--       dependencies : [String]   -- source column names
--   globalOperations : [Record]  -- operations affecting all rows
--   sourcePaths : [String]       -- file paths from read operations
--   parents : [Record]           -- parent DataFrames in DAG
--     id : String                -- parent UUID
--     name : String              -- e.g. "df#a1b2c3d4"
--     operation : String         -- e.g. "select", "filterGt"
--     lineage : Record           -- embedded parent lineage (recursive)
lineage
Try it

Next Steps