Esc
Start typing to search...

Data Contracts

Data contracts validate DataFrame column values against enum types at compile time, catching invalid data before runtime. This is especially useful for survey data, coded variables, and any dataset where columns should only contain specific values.

String Enum Contracts

The simplest form of data contract uses an enum type to validate string column values:

import DataFrame

enum City = Berlin | Munich | Hamburg

let data : DataFrame { name: String, city: City, .. } = DataFrame.readCsv "users.csv"
Try it

If the CSV file contains a city value that's not Berlin, Munich, or Hamburg, you get a compile-time error:

Error: Data contract violation in column 'city' (file: users.csv)
       Invalid values: ["Cologne", "Frankfurt"]
       Valid values: Berlin, Munich, Hamburg

This catches data quality issues before your program ever runs.

Display-Mapped Variants

Real-world data often contains values with spaces, special characters, or different formatting than Keel identifiers allow. Use display strings to map variant names to actual data values:

import DataFrame

enum Country
    = Germany
    | BurkinaFaso "Burkina Faso"
    | UnitedKingdom "United Kingdom"
    | CoteDIvoire "Côte d'Ivoire"

let data : DataFrame { country: Country, .. } = DataFrame.readCsv "countries.csv"
Try it

The variant names (BurkinaFaso, UnitedKingdom) are valid Keel identifiers, while the display strings ("Burkina Faso", "United Kingdom") match the actual values in your data.

Display matching is case-sensitive: "burkina faso" will not match "Burkina Faso".

Int+Label Contracts (Value Labels)

Survey data and statistical datasets often use integer codes with associated labels (like STATA value labels). Keel supports this with value-label contracts:

import DataFrame

enum Education = Primary | Secondary | Tertiary | Postgraduate

let survey : DataFrame {
    respondent_id: Int,
    education: [(1, Education::Primary), (2, Education::Secondary), (3, Education::Tertiary), (4, Education::Postgraduate)]
} = DataFrame.readCsv "survey.csv"
Try it

This validates that the education column contains only the integers 1, 2, 3, or 4, and provides semantic meaning to each code.

Type-Safe Value Labels with ValueLabel

Instead of repeating integer-to-label mappings in type annotations, you can define them once using the ValueLabel type in an enum definition:

import DataFrame
import ValueLabelSet

enum Education
    = Primary (ValueLabel 1 "Primary")
    | Secondary (ValueLabel 2 "Secondary")
    | Tertiary (ValueLabel 3 "Tertiary")
    | Postgraduate (ValueLabel 4 "Postgraduate")

let labels = ValueLabelSet.fromType Education

let survey = case DataFrame.readCsv "survey.csv" of
    Ok df -> df
    Err _  -> DataFrame.fromRecords []
let labeled = DataFrame.setValueLabels @education labels survey
Try it

This approach has several advantages over the inline contract syntax:

  • The mapping is defined once and reused across multiple DataFrames
  • The compiler validates that integer codes are unique
  • ValueLabelSet.fromType extracts the mapping at compile time
  • The enum variants are usable as values in pattern matching and functions

See Types: Variants with Value Labels for more on defining ValueLabel enums.

Shorthand Syntax for ValueLabel Enums

When an enum's variants all carry ValueLabel metadata, you can use the enum name directly as the column type — no need to write out the full mapping:

-- norun

import DataFrame

enum Education
    = Primary     (ValueLabel 1 "Primary")
    | Secondary   (ValueLabel 2 "Secondary")
    | Tertiary    (ValueLabel 3 "Tertiary")
    | Postgraduate (ValueLabel 4 "Postgraduate")

let survey : DataFrame { education: Education } = DataFrame.readDta "survey.dta"
Try it

This is exactly equivalent to writing:

let survey : DataFrame {
    education: [(1, Education::Primary), (2, Education::Secondary),
                (3, Education::Tertiary), (4, Education::Postgraduate)]
} = DataFrame.readDta "survey.dta"

The shorthand requires a file with embedded value labels (Stata .dta). On CSV, JSON, or plain Parquet files — which have no embedded labels — both the shorthand and the explicit syntax produce the same compile-time error asking for a file with value labels.

A plain enum (no ValueLabel variants) used on an integer column is still a type error: only enums explicitly annotated with ValueLabel qualify for this shorthand.

Non-Exhaustive Value Labels

Use .. to allow additional values not in the mapping:

import DataFrame

enum YesNo = No | Yes

let data : DataFrame {
    response: [(0, YesNo::No), (1, YesNo::Yes), ..]
} = DataFrame.readCsv "responses.csv"
Try it

Values 0 and 1 are labeled, but other integers (e.g., -1 for "missing", 9 for "refused") are allowed without causing a contract violation.

Open vs Closed Schemas

Data contracts work with both open and closed schemas:

import DataFrame

enum Status = Active | Inactive | Pending

let exact : DataFrame { id: Int, name: String, status: Status } = DataFrame.readCsv "users.csv"

let flexible : DataFrame { status: Status, .. } = DataFrame.readCsv "users.csv"
Try it

With a closed schema, extra columns in the file cause an error. With an open schema (..), extra columns are ignored.

Multiple Contract Columns

You can apply contracts to multiple columns in the same schema:

import DataFrame

enum City = Berlin | Munich | Hamburg
enum Gender = Male | Female | Other

let data : DataFrame { city: City, gender: Gender, .. } = DataFrame.readCsv "demographics.csv"
Try it

Both columns are validated, and all violations are reported together.

Error Messages

Contract violations produce detailed error messages:

Error: Data contract violation in column 'city' (file: data.csv)
       Invalid values: ["Cologne", "Frankfurt"]
       Valid values: Berlin, Munich, Hamburg
Hint: Check that column values match the enum variants.
Note: Data contracts are validated at compile time for literal file paths.

For int+label contracts:

Error: Data contract violation in column 'education' (file: survey.csv)
       Invalid values: [5, 6, 99]
       Valid mappings: (1, Primary), (2, Secondary), (3, Tertiary), (4, Postgraduate)

When Validation Happens

Contract validation runs at compile time whenever the file path is evaluable at compile time — a string literal, a constant variable, or a constant concatenation. Validation is skipped only when the path is computed at runtime (for example, a function parameter).

For a full explanation of when compile-time checking applies, why runtime paths are treated differently, and how the two-pass process (schema inference + full contract scan) works, see DataFrame Type Safety: When Checking Happens and Performance Characteristics.

Use Cases

Survey Data Validation

import DataFrame

enum AgeGroup
    = Under18 "Under 18"
    | Young "18-34"
    | Middle "35-54"
    | Senior "55+"

enum Satisfaction = VeryDissatisfied | Dissatisfied | Neutral | Satisfied | VerySatisfied

let survey : DataFrame {
    age_group: AgeGroup,
    satisfaction: [(1, Satisfaction::VeryDissatisfied), (2, Satisfaction::Dissatisfied), (3, Satisfaction::Neutral), (4, Satisfaction::Satisfied), (5, Satisfaction::VerySatisfied)]
} = DataFrame.readCsv "survey_results.csv"
Try it

Geographic Data

import DataFrame

enum Region
    = NorthEast "North East"
    | NorthWest "North West"
    | Midlands
    | SouthEast "South East"
    | SouthWest "South West"
    | London

let sales : DataFrame { region: Region, amount: Float, .. } = DataFrame.readCsv "regional_sales.csv"
Try it

STATA-Style Value Labels

-- norun

import DataFrame

enum Employment
    = Employed        (ValueLabel 1 "Employed")
    | Unemployed      (ValueLabel 2 "Unemployed")
    | NotInLaborForce (ValueLabel 3 "Not in Labor Force")
    | Retired         (ValueLabel 4 "Retired")

let labor : DataFrame {
    person_id: Int,
    emp_status: Employment
} = DataFrame.readDta "labor_force.dta"

Contracts in Task Declarations

Data contract annotations work in task expecting and task exposing declarations, not just let bindings. This lets you enforce column types at the boundary between task files.

task exposing — a task that reads a Stata file and exposes the result with a ValueLabel enum column type uses the annotation as the authoritative type. Even though the file's column is stored as integers, the declared type (e.g., DataFrame { gender: Gender }) is what the caller sees. The compiler accepts this without a type mismatch error.

task expecting — a task that receives a DataFrame can declare enum column types in its expecting clause. The caller must pass a DataFrame whose column type is compatible with the declared enum type.

Newtypes are fully supported: type SurveyData = DataFrame { gender: Gender } can appear as both an expecting parameter type and an exposing output type.

See Tasks for the full task syntax reference.

Best Practices

  1. Define domain enums for all categorical variables in your data
  2. Use ValueLabel annotations on integer-coded enums — then the column type is just the enum name, no mapping to repeat
  3. Use display strings for values that don't map to valid identifiers
  4. Use open schemas (..) during exploration, closed schemas for production
  5. Document your contracts — they serve as schema documentation
  6. Use non-exhaustive contracts (.. in value-label lists) when you expect unknown codes

Next Steps