Esc
Start typing to search...

Data Lineage

Keel automatically tracks data provenance for every DataFrame. Each column records where it came from, what transformations were applied, and which parent DataFrames contributed to it. This lineage is available both in display output and through programmatic access.

Automatic Tracking

When you print a DataFrame, lineage appears below the data. It shows parent operations, column origins, and global operations — with no extra code required:

-- norun
-- tags: dataframe, lineage, provenance
-- Lineage appears automatically when printing a DataFrame
import DataFrame
import DataFrame.Expr as Expr
import Result

let sales =
    DataFrame.fromRecords
        [ { product = "Laptop", revenue = 1200 }
        , { product = "Phone", revenue = 800 }
        ]

let filtered =
    sales
        |> DataFrame.filter (@revenue |> Expr.gt 500)
        |> Result.withDefault sales

let result =
    filtered
        |> DataFrame.select [@product, @revenue]
        |> Result.withDefault filtered

-- Printing the DataFrame shows data AND lineage:
--
-- shape: (2, 2)
-- ...
-- Lineage:
-- Derived from: df#... (select)
-- revenue: from records
-- product: from records
-- Global operations: 1
result
Try it

Source Paths

DataFrame.sourcePath returns the file path a DataFrame was read from, or Nothing for DataFrames created in memory:

-- DataFrame.sourcePath returns Nothing for fromRecords
import DataFrame

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        ]

DataFrame.sourcePath df
Try it

For DataFrames read with readCsv, readJson, or readParquet, this returns Just "/path/to/file.csv".

Parent Tracking (DAG)

Every DataFrame gets a unique UUID. Derived DataFrames reference their parents, forming a directed acyclic graph (DAG). DataFrame.parents returns a list of records, each with id, name, operation, and lineage fields. Parent records embed the full lineage of the parent DataFrame.

Root DataFrames have no parents:

-- Root DataFrames have no parents
import DataFrame

let df = DataFrame.fromRecords [{ name = "Alice", age = 30 }]

DataFrame.parents df
Try it

Derived DataFrames record which operation created them. You can count parents to verify the DAG structure:

-- Derived DataFrames track parent operations
import DataFrame
import List
import Result

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        ]

let selected =
    case df |> DataFrame.select [@name] of
        Ok d -> d
        Err _ -> DataFrame.fromRecords []

-- Each parent record has id, name, operation, and lineage fields
List.length (DataFrame.parents selected)
Try it

Column Lineage

DataFrame.columnLineage returns lineage for a single column as Maybe Record. The record contains name, origin, transformations, and dependencies:

-- norun
-- tags: dataframe, lineage
-- DataFrame.columnLineage returns origin info for a column
import DataFrame

let df = DataFrame.fromRecords [{ name = "Alice", age = 30 }]

-- Returns Just { name, origin, transformations, dependencies }
-- origin.type is "FromRecords" for columns from DataFrame.fromRecords
DataFrame.columnLineage @name df
Try it

After a rename, the transformation history records the operation:

-- norun
-- tags: dataframe, lineage
-- After rename, the transformation tracks the operation
import DataFrame
import Result

let df = DataFrame.fromRecords [{ name = "Alice", age = 30 }]

let renamed =
    case (df |> DataFrame.rename @name "person") of
        Ok d -> d
        Err _ -> DataFrame.fromRecords [{ person = "Alice", age = 30 }]

-- The "person" column's lineage shows:
-- origin.type = "FromRecords" (original source)
-- transformations = [{ operation = "rename", description = "Renamed 'name' to 'person'" }]
DataFrame.columnLineage @person renamed
Try it

Origin Types

Each column's origin describes where it came from. The type field identifies the origin kind.

File

Columns read from CSV, JSON, or Parquet files. Origin includes path and originalName.

FromRecords

Columns from DataFrame.fromRecords or DataFrame.fromLists. A simple marker with no additional fields.

Computed

Columns created by withColumn or expressions. Origin includes operation and sourceColumns.

Aggregated

Columns produced by groupBy + agg. Origin includes sourceColumn, aggregationFunc, and groupByColumns:

import DataFrame
import DataFrame.Expr exposing col
import List

import DataFrame.Expr as Expr

let df =
    DataFrame.fromRecords
        [ { category = "A", value = 10 }
        , { category = "A", value = 20 }
        , { category = "B", value = 30 }
        ]

let avgExpr =
    col @value
        |> Expr.mean
        |> Expr.named "value"

let agged =
    df
        |> DataFrame.groupBy [@category]
        |> DataFrame.agg [avgExpr]

-- The "value" column exists in the aggregated result
DataFrame.columns agged |> List.nth 1
Try it

JoinedFrom

Columns brought in from the right side of a join. Origin includes sourceDataFrame and originalName:

-- norun
-- tags: dataframe, lineage, join
-- Joined columns track their source DataFrame
import DataFrame

let users =
    DataFrame.fromRecords
        [ { id = 1, name = "Alice" }
        , { id = 2, name = "Bob" }
        ]

let scores =
    DataFrame.fromRecords
        [ { id = 1, score = 95 }
        , { id = 2, score = 87 }
        ]

let joined =
    case (DataFrame.join [@id] [@id] JoinType::OneToOne scores users) of
        Ok df -> df
        Err _ -> DataFrame.fromRecords []

-- The "score" column's lineage shows:
-- origin.type = "JoinedFrom"
-- origin.sourceDataFrame = "right"
-- origin.originalName = "score"
DataFrame.columnLineage @score joined
Try it

Transformations and Global Operations

Lineage separates per-column transformations from global operations.

Per-column transformations are recorded on each affected column: select, drop, rename, withColumn, agg, join, concat. Each transformation has an operation name and a description:

-- norun
-- tags: dataframe, lineage
-- Columns track their transformation history
import DataFrame
import Result

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        ]

let selected =
    df
        |> DataFrame.select [@name, @age]
        |> Result.withDefault df

-- Each column's transformations list records operations applied:
-- [{ operation = "select", description = "Selected columns: name, age" }]
DataFrame.columnLineage @name selected
Try it

Global operations affect all rows without changing column structure: filter, sort, head, tail, unique, sample, groupBy. They are tracked in the top-level globalOperations list:

-- norun
-- tags: dataframe, lineage
-- Global operations (filter, sort) are tracked separately
import DataFrame

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        , { name = "Carol", age = 35 }
        ]

let sorted =
    case (df |> DataFrame.sort [@age]) of
        Ok df2 -> df2
        Err _ -> DataFrame.fromRecords []

let result = sorted

-- The lineage record's globalOperations list contains:
-- [{ operation = "filter", description = "Filtered via Expr" },
-- { operation = "sort", description = "Sorted by age (ascending)" }]
DataFrame.lineage result
Try it

Multi-Source Operations

Joins produce two parents and merge source paths from both DataFrames:

-- Join produces two parents in the DAG
import DataFrame
import List

let users =
    DataFrame.fromRecords
        [ { id = 1, name = "Alice" }
        , { id = 2, name = "Bob" }
        ]

let scores =
    DataFrame.fromRecords
        [ { id = 1, score = 95 }
        , { id = 2, score = 87 }
        ]

let joined =
    case (DataFrame.join [@id] [@id] JoinType::OneToOne scores users) of
        Ok df -> df
        Err _ -> DataFrame.fromRecords []

List.length (DataFrame.parents joined)
Try it

DataFrame.concat produces N parents (one per input DataFrame) and deduplicates source paths.

Full Lineage Record

DataFrame.lineage returns the complete lineage record with all fields:

-- norun
-- tags: dataframe, lineage
-- Full lineage record structure
import DataFrame
import DataFrame.Expr as Expr
import Result

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        ]

let filtered =
    df
        |> DataFrame.filter (@age |> Expr.gt 20)
        |> Result.withDefault df

let result =
    case filtered |> DataFrame.select [@name] of
        Ok d -> d
        Err _ -> DataFrame.fromRecords []

let lineage = DataFrame.lineage result

-- lineage is a Record with these fields:
-- id : String               -- unique UUID for this DataFrame
-- columns : Record          -- per-column lineage (keyed by column name)
-- name : Record
-- name : String             -- current column name
-- origin : Record           -- where column came from
-- type : String           -- "File", "FromRecords", "Computed", etc.
-- ...                     -- type-specific fields
-- transformations : [Record] -- list of operations applied
-- operation : String      -- e.g. "select", "rename"
-- description : String    -- human-readable description
-- dependencies : [String]   -- source column names
-- globalOperations : [Record]  -- operations affecting all rows
-- sourcePaths : [String]       -- file paths from read operations
-- parents : [Record]           -- parent DataFrames in DAG
-- id : String                -- parent UUID
-- name : String              -- e.g. "df#a1b2c3d4"
-- operation : String         -- e.g. "select", "filter"
-- lineage : Record           -- embedded parent lineage (recursive)
lineage
Try it

Lineage Registry Lookups

Keel maintains a global lineage registry keyed by DataFrame UUID. Two functions let you query it programmatically.

lineageById

DataFrame.lineageById : String -> Maybe Record — looks up a DataFrame's lineage record by its unique UUID. Returns Nothing if the UUID is not in the registry:

-- lineageById looks up a DataFrame in the lineage registry by its UUID
import DataFrame
import Maybe

let df =
    DataFrame.fromRecords
        [ { name = "Alice", age = 30 }
        , { name = "Bob", age = 25 }
        ]

-- lineageById returns Maybe Record — Nothing if the id is not found
DataFrame.lineageById "nonexistent-id"
Try it

To get the UUID of a DataFrame you already hold, read (DataFrame.lineage df).id.

lineageByName

DataFrame.lineageByName : String -> [Record] — searches the registry by name prefix. Returns all matching lineage records as a list, or an empty list if there are no matches:

-- lineageByName searches the registry by name prefix, returns a list
import DataFrame
import List

-- lineageByName returns [Record] — empty list if no match
List.length (DataFrame.lineageByName "nonexistent-name")
Try it

DataFrame names in the registry take the form "df#<short-uuid>". Pass a prefix such as "df#a1b2" to narrow the search.

Next Steps