Frictionless Data Standard¶

Overview¶

Frictionless Data is an open-source toolkit and set of specifications designed to simplify data management, integration, and sharing. It brings simplicity to the data experience by standardizing how data is packaged, described, and distributed.

Purpose¶

Make data reproducible, processable, and standardizable
Handle everything from simple CSV files to complex data pipelines
Promote FAIR principles: Findable, Accessible, Interoperable, Reusable
Create reliable, repeatable, and automated data integration workflows

Design Philosophy¶

Approachable: Minimal core with simple concepts
Incrementally Adoptable: Start small and scale as needed
Progressive: Enhances existing tools and workflows
Simplicity: Easy to understand and implement
Extensibility: Can be extended for specific use cases
Cross-technology: Works across different platforms and languages

Target Users¶

Researchers
Data Scientists
Data Engineers
Anyone working with structured data

Core Concepts¶

1. Data Packaging¶

Bundle data files with metadata and schemas to provide clarity and context. This makes data self-describing and easier to understand and use.

2. Data Transformation¶

Clean and convert data between formats with standardized processes.

3. Data Storage/Integration¶

Push data into different platforms and applications using consistent formats and metadata.

Main Specifications¶

The Frictionless Data standard is composed of several modular specifications that work together:

1. Data Package¶

2. Data Resource¶

3. Table Schema¶

Additional specifications include CSV Dialect, Tabular Data Package, Fiscal Data Package, and more.

1. Data Package¶

A Data Package is a simple container format for describing and distributing a collection of data.

Structure¶

A Data Package is centered around a datapackage.json descriptor file placed in the top-level directory.

Required Properties¶

resources (array): List of data resources in the package (REQUIRED)

Recommended Properties¶

name: Unique, URL-friendly identifier (lowercase, alphanumeric, hyphens, underscores)
id: Globally unique identifier (UUID or DOI)
licenses: Licensing information
profile: Specification profile being used

Optional Properties¶

title: Human-readable title
description: Detailed description (supports Markdown)
version: Semantic version string
sources: Information about raw data origins
contributors: People or organizations involved
keywords: Array of tags for searchability
image: Representative image URL

Example: Minimal Data Package¶

{
  "name": "my-dataset",
  "resources": [
    {
      "path": "data.csv",
      "title": "My Data"
    }
  ]
}

Example: Complete Data Package¶

{
  "name": "global-temperature-data",
  "title": "Global Temperature Data 1880-2020",
  "version": "1.0.0",
  "description": "Historical global temperature measurements from weather stations worldwide.",
  "licenses": [
    {
      "name": "CC-BY-4.0",
      "path": "https://creativecommons.org/licenses/by/4.0/",
      "title": "Creative Commons Attribution 4.0"
    }
  ],
  "sources": [
    {
      "title": "NOAA Climate Data",
      "path": "https://www.noaa.gov/climate-data"
    }
  ],
  "contributors": [
    {
      "title": "Jane Doe",
      "role": "author",
      "email": "jane@example.com"
    }
  ],
  "keywords": ["climate", "temperature", "weather"],
  "resources": [
    {
      "name": "temperature-readings",
      "path": "data/temperatures.csv",
      "title": "Temperature Readings",
      "schema": {
        "fields": [
          {
            "name": "station_id",
            "type": "string"
          },
          {
            "name": "date",
            "type": "date"
          },
          {
            "name": "temperature",
            "type": "number"
          }
        ]
      }
    }
  ]
}

2. Data Resource¶

A Data Resource describes a single data file or data source (like an individual table or file).

Required Properties¶

name: Unique identifier (lowercase alphanumeric, periods, hyphens, underscores)

Data Location (choose one)¶

path: Path to file (local relative path or remote URL)
data: Inline data within the descriptor

Optional Properties¶

title: Human-readable name
description: Detailed description
format: File format (e.g., 'csv', 'json', 'xlsx')
mediatype: MIME type (e.g., 'text/csv')
encoding: Character encoding (e.g., 'utf-8')
schema: Data structure description (often a Table Schema)

Example: Resource with Path¶

{
  "name": "sales-data",
  "title": "Sales Data Q1 2024",
  "path": "data/sales-q1-2024.csv",
  "format": "csv",
  "mediatype": "text/csv",
  "encoding": "utf-8",
  "description": "Quarterly sales data including revenue, units sold, and region."
}

Example: Resource with Inline Data¶

{
  "name": "countries",
  "title": "Country Codes",
  "data": [
    {"code": "US", "name": "United States"},
    {"code": "GB", "name": "United Kingdom"},
    {"code": "NO", "name": "Norway"}
  ]
}

Example: Resource with Remote URL¶

{
  "name": "population-data",
  "path": "https://example.com/data/population-2024.csv",
  "format": "csv",
  "schema": {
    "fields": [
      {"name": "country", "type": "string"},
      {"name": "year", "type": "integer"},
      {"name": "population", "type": "integer"}
    ]
  }
}

3. Table Schema¶

Table Schema is a language-agnostic specification for defining the structure of tabular data. It provides detailed descriptions of fields, types, and constraints.

Structure¶

{
  "fields": [ ... ],
  "primaryKey": "field_name" or ["field1", "field2"],
  "foreignKeys": [ ... ],
  "missingValues": [""]
}

Field Properties¶

Each field in the fields array has:

name (required): Field name
type (recommended): Data type
format: Specific format for the type
title: Human-readable title
description: Field description
constraints: Validation rules

Field Types¶

Type	Description	Example
`string`	Text data	"Hello World"
`number`	Numeric data (int or float)	3.14, 42
`integer`	Whole numbers	42
`boolean`	True/false values	true, false
`date`	Date values	2024-01-15
`datetime`	Date and time	2024-01-15T10:30:00Z
`time`	Time values	10:30:00
`year`	Year values	2024
`object`	JSON object	{"key": "value"}
`array`	JSON array	[1, 2, 3]
`geopoint`	Geographic coordinates	[45.5231, -122.6765]
`geojson`	GeoJSON data	{...}

Field Constraints¶

required (boolean): Field cannot be null/missing
unique (boolean): All values must be unique
minLength / maxLength (integer): String length constraints
minimum / maximum (number): Numeric range constraints
pattern (string): Regular expression validation
enum (array): Restrict to specific allowed values

Example: Basic Table Schema¶

{
  "fields": [
    {
      "name": "id",
      "type": "integer",
      "title": "User ID",
      "constraints": {
        "required": true,
        "unique": true
      }
    },
    {
      "name": "name",
      "type": "string",
      "title": "Full Name",
      "constraints": {
        "required": true,
        "minLength": 2,
        "maxLength": 100
      }
    },
    {
      "name": "age",
      "type": "integer",
      "title": "Age",
      "constraints": {
        "minimum": 0,
        "maximum": 120
      }
    },
    {
      "name": "email",
      "type": "string",
      "format": "email",
      "constraints": {
        "required": true
      }
    },
    {
      "name": "status",
      "type": "string",
      "constraints": {
        "enum": ["active", "inactive", "pending"]
      }
    }
  ],
  "primaryKey": "id",
  "missingValues": ["", "N/A", "null"]
}

Example: Advanced Table Schema with Foreign Keys¶

{
  "fields": [
    {
      "name": "order_id",
      "type": "integer",
      "constraints": {
        "required": true,
        "unique": true
      }
    },
    {
      "name": "customer_id",
      "type": "integer",
      "constraints": {
        "required": true
      }
    },
    {
      "name": "order_date",
      "type": "date",
      "format": "default",
      "constraints": {
        "required": true
      }
    },
    {
      "name": "total_amount",
      "type": "number",
      "constraints": {
        "minimum": 0
      }
    }
  ],
  "primaryKey": "order_id",
  "foreignKeys": [
    {
      "fields": "customer_id",
      "reference": {
        "resource": "customers",
        "fields": "id"
      }
    }
  ]
}

Complete Example: Multi-Resource Data Package¶

Here's a complete example showing how all the specifications work together:

datapackage.json:

{
  "name": "ecommerce-sample-data",
  "title": "E-commerce Sample Dataset",
  "version": "1.0.0",
  "description": "Sample e-commerce data including customers, orders, and products",
  "licenses": [
    {
      "name": "CC-BY-4.0",
      "title": "Creative Commons Attribution 4.0"
    }
  ],
  "resources": [
    {
      "name": "customers",
      "path": "data/customers.csv",
      "format": "csv",
      "schema": {
        "fields": [
          {
            "name": "id",
            "type": "integer",
            "constraints": {"required": true, "unique": true}
          },
          {
            "name": "name",
            "type": "string",
            "constraints": {"required": true}
          },
          {
            "name": "email",
            "type": "string",
            "format": "email",
            "constraints": {"required": true, "unique": true}
          },
          {
            "name": "created_at",
            "type": "datetime"
          }
        ],
        "primaryKey": "id"
      }
    },
    {
      "name": "orders",
      "path": "data/orders.csv",
      "format": "csv",
      "schema": {
        "fields": [
          {
            "name": "order_id",
            "type": "integer",
            "constraints": {"required": true, "unique": true}
          },
          {
            "name": "customer_id",
            "type": "integer",
            "constraints": {"required": true}
          },
          {
            "name": "order_date",
            "type": "date"
          },
          {
            "name": "total",
            "type": "number",
            "constraints": {"minimum": 0}
          }
        ],
        "primaryKey": "order_id",
        "foreignKeys": [
          {
            "fields": "customer_id",
            "reference": {
              "resource": "customers",
              "fields": "id"
            }
          }
        ]
      }
    },
    {
      "name": "product_categories",
      "title": "Product Categories Lookup",
      "data": [
        {"id": 1, "name": "Electronics"},
        {"id": 2, "name": "Clothing"},
        {"id": 3, "name": "Books"}
      ],
      "schema": {
        "fields": [
          {"name": "id", "type": "integer"},
          {"name": "name", "type": "string"}
        ],
        "primaryKey": "id"
      }
    }
  ]
}

Benefits and Use Cases¶

Benefits¶

Self-describing data: Metadata travels with the data
Validation: Schemas enable automatic data validation
Interoperability: Standard format works across tools and platforms
Documentation: Built-in documentation through descriptions
Versioning: Track changes with version numbers
Reproducibility: Clear provenance and structure

Use Cases¶

Research Data Management: Package research datasets with complete metadata
Open Data Publishing: Share government or public data with clear schemas
Data Pipelines: Standardize data flowing through ETL processes
API Documentation: Describe API response structures
Data Catalogs: Build searchable data repositories
Data Quality: Validate data against defined schemas

Tools and Ecosystem¶

The Frictionless Data ecosystem includes:

Frictionless Framework (Python): Create, validate, and transform data packages
Data Package Creator: Web-based tool for creating data packages
Goodtables: Data validation tool
Libraries: Available in Python, JavaScript, R, and other languages

References¶

Official Website: https://frictionlessdata.io/
Specifications: https://specs.frictionlessdata.io/
GitHub: https://github.com/frictionlessdata