Data Lineage
Keel automatically tracks data provenance for every DataFrame. Each column records where it came from, what transformations were applied, and which parent DataFrames contributed to it. This lineage is available both in display output and through programmatic access.
Automatic Tracking
When you print a DataFrame, lineage appears below the data. It shows parent operations, column origins, and global operations — with no extra code required:
-- norun
-- tags: dataframe, lineage, provenance
-- Lineage appears automatically when printing a DataFrame
import DataFrame
let sales =
DataFrame.fromRecords
[ { product = "Laptop", revenue = 1200 }
, { product = "Phone", revenue = 800 }
]
let result =
sales
|> DataFrame.filterGt "revenue" 500
|> DataFrame.select ["product", "revenue"]
-- Printing the DataFrame shows data AND lineage:
--
-- shape: (2, 2)
-- ...
-- Lineage:
-- Derived from: df#... (select)
-- revenue: from records
-- product: from records
-- Global operations: 1
result
Try itSource Paths
DataFrame.sourcePath returns the file path a DataFrame was read from, or Nothing for DataFrames created in memory:
-- DataFrame.sourcePath returns Nothing for fromRecords
import DataFrame
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
]
DataFrame.sourcePath df
Try itFor DataFrames read with readCsv, readJson, or readParquet, this returns Just "/path/to/file.csv".
Parent Tracking (DAG)
Every DataFrame gets a unique UUID. Derived DataFrames reference their parents, forming a directed acyclic graph (DAG). DataFrame.parents returns a list of records, each with id, name, operation, and lineage fields. Parent records embed the full lineage of the parent DataFrame.
Root DataFrames have no parents:
-- Root DataFrames have no parents
import DataFrame
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
]
DataFrame.parents df
Try itDerived DataFrames record which operation created them. You can count parents to verify the DAG structure:
-- Derived DataFrames track parent operations
import DataFrame
import List
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
]
let selected = df |> DataFrame.select ["name"]
-- Each parent record has id, name, operation, and lineage fields
List.length (DataFrame.parents selected)
Try itColumn Lineage
DataFrame.columnLineage returns lineage for a single column as Maybe Record. The record contains name, origin, transformations, and dependencies:
-- norun
-- tags: dataframe, lineage
-- DataFrame.columnLineage returns origin info for a column
import DataFrame
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
]
-- Returns Just { name, origin, transformations, dependencies }
-- origin.type is "FromRecords" for columns from DataFrame.fromRecords
DataFrame.columnLineage "name" df
Try itAfter a rename, the transformation history records the operation:
-- norun
-- tags: dataframe, lineage
-- After rename, the transformation tracks the operation
import DataFrame
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
]
let renamed = df |> DataFrame.rename "name" "person"
-- The "person" column's lineage shows:
-- origin.type = "FromRecords" (original source)
-- transformations = [{ operation = "rename", description = "Renamed 'name' to 'person'" }]
DataFrame.columnLineage "person" renamed
Try itOrigin Types
Each column's origin describes where it came from. The type field identifies the origin kind.
File
Columns read from CSV, JSON, or Parquet files. Origin includes path and originalName.
FromRecords
Columns from DataFrame.fromRecords or DataFrame.fromLists. A simple marker with no additional fields.
Computed
Columns created by withColumn or expressions. Origin includes operation and sourceColumns.
Aggregated
Columns produced by groupBy + agg. Origin includes sourceColumn, aggregationFunc, and groupByColumns:
-- norun
-- tags: dataframe, lineage, aggregation
-- Aggregated columns track source and function
import DataFrame
let df =
DataFrame.fromRecords
[ { category = "A", value = 10 }
, { category = "A", value = 20 }
, { category = "B", value = 30 }
]
let grouped = df |> DataFrame.groupBy ["category"]
let specs = [("value", "mean")]
let agged = grouped |> DataFrame.agg specs
-- The "value" column's lineage shows:
-- origin.type = "Aggregated"
-- origin.aggregationFunc = "mean"
-- origin.groupByColumns = ["category"]
DataFrame.columnLineage "value" agged
Try itJoinedFrom
Columns brought in from the right side of a join. Origin includes sourceDataFrame and originalName:
-- norun
-- tags: dataframe, lineage, join
-- Joined columns track their source DataFrame
import DataFrame
let users =
DataFrame.fromRecords
[ { id = 1, name = "Alice" }
, { id = 2, name = "Bob" }
]
let scores =
DataFrame.fromRecords
[ { id = 1, score = 95 }
, { id = 2, score = 87 }
]
let joined = DataFrame.join "id" "id" scores users
-- The "score" column's lineage shows:
-- origin.type = "JoinedFrom"
-- origin.sourceDataFrame = "right"
-- origin.originalName = "score"
DataFrame.columnLineage "score" joined
Try itTransformations and Global Operations
Lineage separates per-column transformations from global operations.
Per-column transformations are recorded on each affected column: select, drop, rename, withColumn, agg, join, concat. Each transformation has an operation name and a description:
-- norun
-- tags: dataframe, lineage
-- Columns track their transformation history
import DataFrame
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
]
let selected = df |> DataFrame.select ["name", "age"]
-- Each column's transformations list records operations applied:
-- [{ operation = "select", description = "Selected columns: name, age" }]
DataFrame.columnLineage "name" selected
Try itGlobal operations affect all rows without changing column structure: filter, sort, head, tail, unique, sample, groupBy. They are tracked in the top-level globalOperations list:
-- norun
-- tags: dataframe, lineage
-- Global operations (filter, sort) are tracked separately
import DataFrame
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
, { name = "Carol", age = 35 }
]
let result =
df
|> DataFrame.filterGt "age" 20
|> DataFrame.sort "age"
-- The lineage record's globalOperations list contains:
-- [{ operation = "filterGt", description = "Filtered where age > 20" },
-- { operation = "sort", description = "Sorted by age (ascending)" }]
DataFrame.lineage result
Try itMulti-Source Operations
Joins produce two parents and merge source paths from both DataFrames:
-- Join produces two parents in the DAG
import DataFrame
import List
let users =
DataFrame.fromRecords
[ { id = 1, name = "Alice" }
, { id = 2, name = "Bob" }
]
let scores =
DataFrame.fromRecords
[ { id = 1, score = 95 }
, { id = 2, score = 87 }
]
let joined = DataFrame.join "id" "id" scores users
List.length (DataFrame.parents joined)
Try itDataFrame.concat produces N parents (one per input DataFrame) and deduplicates source paths.
Full Lineage Record
DataFrame.lineage returns the complete lineage record with all fields:
-- norun
-- tags: dataframe, lineage
-- Full lineage record structure
import DataFrame
let df =
DataFrame.fromRecords
[ { name = "Alice", age = 30 }
, { name = "Bob", age = 25 }
]
let result =
df
|> DataFrame.filterGt "age" 20
|> DataFrame.select ["name"]
let lineage = DataFrame.lineage result
-- lineage is a Record with these fields:
-- id : String -- unique UUID for this DataFrame
-- columns : Record -- per-column lineage (keyed by column name)
-- name : Record
-- name : String -- current column name
-- origin : Record -- where column came from
-- type : String -- "File", "FromRecords", "Computed", etc.
-- ... -- type-specific fields
-- transformations : [Record] -- list of operations applied
-- operation : String -- e.g. "select", "rename"
-- description : String -- human-readable description
-- dependencies : [String] -- source column names
-- globalOperations : [Record] -- operations affecting all rows
-- sourcePaths : [String] -- file paths from read operations
-- parents : [Record] -- parent DataFrames in DAG
-- id : String -- parent UUID
-- name : String -- e.g. "df#a1b2c3d4"
-- operation : String -- e.g. "select", "filterGt"
-- lineage : Record -- embedded parent lineage (recursive)
lineage
Try itNext Steps
- Learn about DataFrame Expressions for composable column operations
- See the DataFrame stdlib reference for the complete function list