Data Contracts
Data contracts validate DataFrame column values against enum types at compile time, catching invalid data before runtime. This is especially useful for survey data, coded variables, and any dataset where columns should only contain specific values.
String Enum Contracts
The simplest form of data contract uses an enum type to validate string column values:
import DataFrame
enum City = Berlin | Munich | Hamburg
let data : DataFrame { name: String, city: City, .. } = DataFrame.readCsv "users.csv"
Try itIf the CSV file contains a city value that's not Berlin, Munich, or Hamburg, you get a compile-time error:
Error: Data contract violation in column 'city' (file: users.csv)
Invalid values: ["Cologne", "Frankfurt"]
Valid values: Berlin, Munich, Hamburg
This catches data quality issues before your program ever runs.
Display-Mapped Variants
Real-world data often contains values with spaces, special characters, or different formatting than Keel identifiers allow. Use display strings to map variant names to actual data values:
import DataFrame
enum Country
= Germany
| BurkinaFaso "Burkina Faso"
| UnitedKingdom "United Kingdom"
| CoteDIvoire "Côte d'Ivoire"
let data : DataFrame { country: Country, .. } = DataFrame.readCsv "countries.csv"
Try itThe variant names (BurkinaFaso, UnitedKingdom) are valid Keel identifiers, while the display strings ("Burkina Faso", "United Kingdom") match the actual values in your data.
Display matching is case-sensitive: "burkina faso" will not match "Burkina Faso".
Int+Label Contracts (Value Labels)
Survey data and statistical datasets often use integer codes with associated labels (like STATA value labels). Keel supports this with value-label contracts:
import DataFrame
enum Education = Primary | Secondary | Tertiary | Postgraduate
let survey : DataFrame {
respondent_id: Int,
education: [(1, Education::Primary), (2, Education::Secondary), (3, Education::Tertiary), (4, Education::Postgraduate)]
} = DataFrame.readCsv "survey.csv"
Try itThis validates that the education column contains only the integers 1, 2, 3, or 4, and provides semantic meaning to each code.
Type-Safe Value Labels with ValueLabel
Instead of repeating integer-to-label mappings in type annotations, you can define them once using the ValueLabel type in an enum definition:
import DataFrame
import ValueLabelSet
enum Education
= Primary (ValueLabel 1 "Primary")
| Secondary (ValueLabel 2 "Secondary")
| Tertiary (ValueLabel 3 "Tertiary")
| Postgraduate (ValueLabel 4 "Postgraduate")
let labels = ValueLabelSet.fromType Education
let survey = case DataFrame.readCsv "survey.csv" of
Ok df -> df
Err _ -> DataFrame.fromRecords []
let labeled = DataFrame.setValueLabels @education labels survey
Try itThis approach has several advantages over the inline contract syntax:
- The mapping is defined once and reused across multiple DataFrames
- The compiler validates that integer codes are unique
ValueLabelSet.fromTypeextracts the mapping at compile time- The enum variants are usable as values in pattern matching and functions
See Types: Variants with Value Labels for more on defining ValueLabel enums.
Shorthand Syntax for ValueLabel Enums
When an enum's variants all carry ValueLabel metadata, you can use the enum name directly as the column type — no need to write out the full mapping:
-- norun
import DataFrame
enum Education
= Primary (ValueLabel 1 "Primary")
| Secondary (ValueLabel 2 "Secondary")
| Tertiary (ValueLabel 3 "Tertiary")
| Postgraduate (ValueLabel 4 "Postgraduate")
let survey : DataFrame { education: Education } = DataFrame.readDta "survey.dta"
Try itThis is exactly equivalent to writing:
let survey : DataFrame {
education: [(1, Education::Primary), (2, Education::Secondary),
(3, Education::Tertiary), (4, Education::Postgraduate)]
} = DataFrame.readDta "survey.dta"
The shorthand requires a file with embedded value labels (Stata .dta). On CSV, JSON, or plain Parquet files — which have no embedded labels — both the shorthand and the explicit syntax produce the same compile-time error asking for a file with value labels.
A plain enum (no ValueLabel variants) used on an integer column is still a type error: only enums explicitly annotated with ValueLabel qualify for this shorthand.
Non-Exhaustive Value Labels
Use .. to allow additional values not in the mapping:
import DataFrame
enum YesNo = No | Yes
let data : DataFrame {
response: [(0, YesNo::No), (1, YesNo::Yes), ..]
} = DataFrame.readCsv "responses.csv"
Try itValues 0 and 1 are labeled, but other integers (e.g., -1 for "missing", 9 for "refused") are allowed without causing a contract violation.
Open vs Closed Schemas
Data contracts work with both open and closed schemas:
import DataFrame
enum Status = Active | Inactive | Pending
let exact : DataFrame { id: Int, name: String, status: Status } = DataFrame.readCsv "users.csv"
let flexible : DataFrame { status: Status, .. } = DataFrame.readCsv "users.csv"
Try itWith a closed schema, extra columns in the file cause an error. With an open schema (..), extra columns are ignored.
Multiple Contract Columns
You can apply contracts to multiple columns in the same schema:
import DataFrame
enum City = Berlin | Munich | Hamburg
enum Gender = Male | Female | Other
let data : DataFrame { city: City, gender: Gender, .. } = DataFrame.readCsv "demographics.csv"
Try itBoth columns are validated, and all violations are reported together.
Error Messages
Contract violations produce detailed error messages:
Error: Data contract violation in column 'city' (file: data.csv)
Invalid values: ["Cologne", "Frankfurt"]
Valid values: Berlin, Munich, Hamburg
Hint: Check that column values match the enum variants.
Note: Data contracts are validated at compile time for literal file paths.
For int+label contracts:
Error: Data contract violation in column 'education' (file: survey.csv)
Invalid values: [5, 6, 99]
Valid mappings: (1, Primary), (2, Secondary), (3, Tertiary), (4, Postgraduate)
When Validation Happens
Contract validation runs at compile time whenever the file path is evaluable at compile time — a string literal, a constant variable, or a constant concatenation. Validation is skipped only when the path is computed at runtime (for example, a function parameter).
For a full explanation of when compile-time checking applies, why runtime paths are treated differently, and how the two-pass process (schema inference + full contract scan) works, see DataFrame Type Safety: When Checking Happens and Performance Characteristics.
Use Cases
Survey Data Validation
import DataFrame
enum AgeGroup
= Under18 "Under 18"
| Young "18-34"
| Middle "35-54"
| Senior "55+"
enum Satisfaction = VeryDissatisfied | Dissatisfied | Neutral | Satisfied | VerySatisfied
let survey : DataFrame {
age_group: AgeGroup,
satisfaction: [(1, Satisfaction::VeryDissatisfied), (2, Satisfaction::Dissatisfied), (3, Satisfaction::Neutral), (4, Satisfaction::Satisfied), (5, Satisfaction::VerySatisfied)]
} = DataFrame.readCsv "survey_results.csv"
Try itGeographic Data
import DataFrame
enum Region
= NorthEast "North East"
| NorthWest "North West"
| Midlands
| SouthEast "South East"
| SouthWest "South West"
| London
let sales : DataFrame { region: Region, amount: Float, .. } = DataFrame.readCsv "regional_sales.csv"
Try itSTATA-Style Value Labels
-- norun
import DataFrame
enum Employment
= Employed (ValueLabel 1 "Employed")
| Unemployed (ValueLabel 2 "Unemployed")
| NotInLaborForce (ValueLabel 3 "Not in Labor Force")
| Retired (ValueLabel 4 "Retired")
let labor : DataFrame {
person_id: Int,
emp_status: Employment
} = DataFrame.readDta "labor_force.dta"
Contracts in Task Declarations
Data contract annotations work in task expecting and task exposing declarations, not just let bindings. This lets you enforce column types at the boundary between task files.
task exposing — a task that reads a Stata file and exposes the result with a ValueLabel enum column type uses the annotation as the authoritative type. Even though the file's column is stored as integers, the declared type (e.g., DataFrame { gender: Gender }) is what the caller sees. The compiler accepts this without a type mismatch error.
task expecting — a task that receives a DataFrame can declare enum column types in its expecting clause. The caller must pass a DataFrame whose column type is compatible with the declared enum type.
Newtypes are fully supported: type SurveyData = DataFrame { gender: Gender } can appear as both an expecting parameter type and an exposing output type.
See Tasks for the full task syntax reference.
Best Practices
- Define domain enums for all categorical variables in your data
- Use
ValueLabelannotations on integer-coded enums — then the column type is just the enum name, no mapping to repeat - Use display strings for values that don't map to valid identifiers
- Use open schemas (
..) during exploration, closed schemas for production - Document your contracts — they serve as schema documentation
- Use non-exhaustive contracts (
..in value-label lists) when you expect unknown codes
Next Steps
- DataFrame Type Safety — schema annotations, column type checking, and compile-time performance
- Learn about DataFrame operations for working with validated data
- See Types for more on enums and newtypes