Data Package Standard (v2)¶

Overview¶

Data Package is a comprehensive data standard and data definition language (DDL) designed to improve data management, sharing, and interoperability. Version 2 of the standard provides a modernized framework for describing datasets that enhances the findability, accessibility, interoperability, and reusability (FAIR) of data.

Purpose¶

Create a standardized framework for describing datasets
Provide a simple yet extensible data definition language
Enable structured data packaging and delivery
Facilitate easier data discovery and integration
Support comprehensive dataset documentation

Design Philosophy¶

Simple: Easy to understand and implement
Extensible: Can be customized for specific use cases
Language-agnostic: Works across different programming languages
Interoperable: Compatible with existing data standards
FAIR-focused: Promotes best practices in data management

Adoption¶

The Data Package standard is used by major organizations including: - European Commission - GitHub - GBIF (Global Biodiversity Information Facility) - Dryad - And many research institutions worldwide

Software Support¶

Open-source tools and libraries are available for multiple languages including Python, JavaScript, R, and more.

Architecture¶

The Data Package v2 standard consists of four core specifications that work together:

Data Package: Container format for describing a coherent collection of data
Data Resource: Format for describing individual data files or sources
Table Schema: Detailed description of tabular data structure
Table Dialect: Specification for how tabular data is physically stored

Key Concepts¶

Descriptor¶

A descriptor is a JSON object that contains metadata about a data package, resource, or schema. Descriptors can: - Be stored in JSON files (e.g., datapackage.json) - Include custom properties using namespace:propertyName convention - Reference profiles for validation

Profile¶

A profile is a URL that: - Resolves to a valid JSON Schema descriptor - Is versioned and immutable - Serves as a metadata version identifier - Enables validation of descriptors

Tabular Data¶

Tabular data consists of: - Rows with consistent fields - Optional header row defining field names - Multiple possible formats (CSV, Excel, JSON, databases, etc.)

Data Representation¶

Physical representation: How data appears in files (text, binary, etc.)
Logical representation: "Ideal" data with defined types and structures

1. Data Package Specification¶

A Data Package is a simple container format for describing a coherent collection of data in a single package.

Requirements¶

MUST have a descriptor file (preferably named datapackage.json)
MUST include at least one resource
Descriptor contains metadata about the package and its resources

Required Properties¶

resources (array): List of data resources (REQUIRED)

Recommended Properties¶

name: Unique, human-readable identifier (lowercase alphanumeric)
id: Globally unique identifier (UUID, DOI, etc.)
licenses: Licensing information
profile: Profile URL for validation

Standard Properties¶

Property	Type	Description
`$schema`	string	Profile URL (default: `https://datapackage.org/profiles/2.0/datapackage.json`)
`name`	string	Unique identifier, invariant across updates
`id`	string	Globally unique identifier
`title`	string	Human-readable title
`description`	string	Detailed description (supports Markdown)
`version`	string	Semantic version
`created`	string	Creation datetime (ISO 8601)
`homepage`	string	URL to package homepage
`licenses`	array	License objects
`sources`	array	Raw data source objects
`contributors`	array	Contributor objects
`keywords`	array	Search keywords
`image`	string	Representative image URL
`resources`	array	Data resource objects (REQUIRED)

Example: Minimal Data Package¶

{
  "name": "my-package",
  "resources": [
    {
      "name": "my-data",
      "path": "data.csv"
    }
  ]
}

Example: Complete Data Package¶

{
  "$schema": "https://datapackage.org/profiles/2.0/datapackage.json",
  "name": "gdp-data",
  "title": "A nice title",
  "version": "1.0.0",
  "created": "1985-04-12T23:20:50.52Z",
  "licenses": [
    {
      "name": "ODC-PDDL-1.0",
      "path": "http://opendatacommons.org/licenses/pddl/",
      "title": "Open Data Commons Public Domain Dedication and License v1.0"
    }
  ],
  "sources": [
    {
      "title": "World Bank and OECD",
      "path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
    }
  ],
  "contributors": [
    {
      "title": "Joe Bloggs",
      "email": "[email protected]",
      "path": "http://www.bloggs.com",
      "roles": ["creator"]
    }
  ],
  "resources": [
    {
      "name": "gdp",
      "path": "data/gdp.csv",
      "type": "table",
      "schema": {
        "fields": [
          {
            "name": "country",
            "type": "string"
          },
          {
            "name": "year",
            "type": "integer"
          },
          {
            "name": "gdp",
            "type": "number"
          }
        ]
      }
    }
  ]
}

File Structure¶

A typical Data Package directory structure:

my-package/
├── datapackage.json
├── README.md
├── data/
│   ├── mydata.csv
│   └── otherdata.csv
└── scripts/
    └── process.py

2. Data Resource Specification¶

A Data Resource describes and packages a single data resource, such as an individual table or file.

Purpose¶

Provide a locator for data with optional rich metadata.

Required Properties¶

name: Unique identifier within the package (lowercase alphanumeric)

Data Location (one required)¶

path: URL or file system path to data file(s)
data: Inline data representation

Recommended Properties¶

type: Resource type (e.g., table for tabular data)
title: Human-readable label
description: Detailed description
format: File extension (e.g., csv, json, xlsx)
mediatype: MIME type (e.g., text/csv)
encoding: Character encoding (e.g., utf-8)

Tabular Data Properties¶

When type is table: - schema: Table Schema describing structure - dialect: Table Dialect describing physical format

Path Property¶

The path property can be: - A single file: "path": "data.csv" - Multiple files: "path": ["data1.csv", "data2.csv"] - A URL: "path": "http://example.com/data.csv" - A relative POSIX path: "path": "data/myfile.csv"

Security restrictions: - Cannot use absolute paths (/path/to/file) - Cannot use parent paths (../file) - Cannot reference hidden folders

Example: Basic Resource¶

{
  "name": "solar-system",
  "path": "http://example.com/solar-system.csv",
  "title": "The Solar System",
  "format": "csv",
  "mediatype": "text/csv",
  "encoding": "utf-8"
}

Example: Tabular Resource with Schema¶

{
  "name": "population",
  "type": "table",
  "path": "data/population.csv",
  "title": "Population Data",
  "format": "csv",
  "schema": {
    "fields": [
      {
        "name": "country",
        "type": "string"
      },
      {
        "name": "population",
        "type": "integer"
      }
    ]
  }
}

Example: Inline Data¶

{
  "name": "countries",
  "data": [
    {"code": "US", "name": "United States"},
    {"code": "GB", "name": "United Kingdom"},
    {"code": "NO", "name": "Norway"}
  ]
}

Example: Multiple Files¶

{
  "name": "annual-reports",
  "path": [
    "data/2020.csv",
    "data/2021.csv",
    "data/2022.csv"
  ],
  "format": "csv"
}

3. Table Schema Specification¶

Table Schema is a format for declaring schemas for tabular data, providing type checking and validation capabilities.

Core Properties¶

Property	Type	Description
`fields`	array	Field descriptors (REQUIRED)
`$schema`	string	Profile URL
`missingValues`	array	Defines how missing values are represented
`primaryKey`	string/array	Unique row identifier(s)
`foreignKeys`	array	Links between tables

Field Properties¶

Each field in the fields array has:

Property	Type	Description
`name`	string	Field name (REQUIRED)
`type`	string	Data type (recommended)
`format`	string	Specific format for the type
`title`	string	Human-readable title
`description`	string	Field description
`example`	any	Example value
`constraints`	object	Validation rules
`categories`	array	Categorical values with labels
`categoriesOrdered`	boolean	Whether categories have order

Field Types¶

Type	Description	Example
`string`	Text data	"Hello World"
`number`	Decimal numbers	3.14, -42.5
`integer`	Whole numbers	42, -17
`boolean`	True/false	true, false
`date`	Calendar date	2024-01-15
`datetime`	Date and time	2024-01-15T10:30:00Z
`time`	Time of day	10:30:00
`year`	Year	2024
`yearmonth`	Year and month	2024-01
`duration`	Time duration	P1Y2M (ISO 8601)
`geopoint`	Geographic coordinates	[45.5231, -122.6765]
`geojson`	GeoJSON geometry	{...}
`object`	JSON object	{"key": "value"}
`array`	JSON array	[1, 2, 3]
`any`	Unspecified/mixed type	any value

Field Constraints¶

Constraint	Type	Description
`required`	boolean	Field cannot be null
`unique`	boolean	All values must be unique
`minLength`	integer	Minimum string length
`maxLength`	integer	Maximum string length
`minimum`	number	Minimum value
`maximum`	number	Maximum value
`pattern`	string	Regular expression validation
`enum`	array	Restrict to specific values

Example: Basic Table Schema¶

{
  "fields": [
    {
      "name": "id",
      "type": "integer"
    },
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "price",
      "type": "number"
    }
  ]
}

Example: Schema with Constraints¶

{
  "fields": [
    {
      "name": "id",
      "type": "integer",
      "constraints": {
        "required": true,
        "unique": true
      }
    },
    {
      "name": "email",
      "type": "string",
      "format": "email",
      "constraints": {
        "required": true
      }
    },
    {
      "name": "price",
      "type": "integer",
      "constraints": {
        "minimum": 100,
        "maximum": 150,
        "required": true
      }
    }
  ],
  "primaryKey": "id"
}

Example: Categories¶

Categories allow you to define labeled values, useful for coded data:

{
  "fields": [
    {
      "name": "fruit",
      "type": "integer",
      "categories": [
        {"value": 0, "label": "apple"},
        {"value": 1, "label": "orange"},
        {"value": 2, "label": "banana"}
      ],
      "categoriesOrdered": true
    }
  ]
}

Example: Foreign Keys¶

Foreign keys create relationships between tables:

{
  "resources": [
    {
      "name": "population-by-state",
      "schema": {
        "fields": [
          {"name": "state-code", "type": "string"},
          {"name": "population", "type": "integer"}
        ],
        "foreignKeys": [
          {
            "fields": ["state-code"],
            "reference": {
              "resource": "state-codes",
              "fields": ["code"]
            }
          }
        ]
      }
    },
    {
      "name": "state-codes",
      "schema": {
        "fields": [
          {"name": "code", "type": "string"},
          {"name": "name", "type": "string"}
        ],
        "primaryKey": "code"
      }
    }
  ]
}

Example: Missing Values¶

Define how missing or null values are represented:

{
  "fields": [
    {"name": "temperature", "type": "number"},
    {"name": "notes", "type": "string"}
  ],
  "missingValues": [
    {"value": "", "label": "OMITTED"},
    {"value": "-99", "label": "REFUSED"},
    {"value": "N/A", "label": "NOT_APPLICABLE"}
  ]
}

4. Table Dialect Specification¶

Table Dialect describes how tabular data is physically stored in a file, supporting various formats like CSV, TSV, JSON, Excel, and databases.

Purpose¶

Define parsing rules for tabular data files
Ensure data interoperability across different formats
Provide configuration for delimiters, quotes, line terminators, etc.

Standard Properties¶

Property	Type	Default	Description
`$schema`	string	-	Profile URL
`header`	boolean	`true`	Whether first row contains field names
`headerRows`	array	`[1]`	Row numbers containing headers
`delimiter`	string	`,`	Field delimiter character
`lineTerminator`	string	`\r\n`	Row separator
`quoteChar`	string	`"`	Quote character
`doubleQuote`	boolean	`true`	Whether quotes are escaped by doubling
`escapeChar`	string	-	Escape character (alternative to doubleQuote)
`skipInitialSpace`	boolean	`false`	Ignore whitespace after delimiter
`commentChar`	string	-	Comment line prefix
`caseSensitiveHeader`	boolean	`false`	Whether header names are case-sensitive

Supported Formats¶

The Table Dialect specification supports: - Delimited formats: CSV, TSV, pipe-delimited - Structured formats: JSON, YAML - Spreadsheets: Excel (.xlsx, .xls), OpenOffice (.ods) - Databases: SQL tables, query results

Example: Semicolon-Delimited CSV¶

{
  "delimiter": ";",
  "quoteChar": "'",
  "header": true
}

Example: Tab-Delimited with Comments¶

{
  "delimiter": "\t",
  "lineTerminator": "\n",
  "commentChar": "#",
  "header": true
}

Example: No Header Row¶

{
  "header": false,
  "delimiter": ",",
  "quoteChar": "\""
}

Example: Custom Quote Escaping¶

{
  "delimiter": ",",
  "quoteChar": "\"",
  "doubleQuote": false,
  "escapeChar": "\\"
}

Example: Multiline Headers¶

{
  "headerRows": [1, 2],
  "delimiter": ","
}

Important Notes¶

Table Dialect focuses on formatting, not data types or schema
Character encoding is handled separately (not part of dialect)
Backward compatible with the older "CSV Dialect" specification
Orthogonal to Table Schema (they work together but are independent)

Complete Example: Multi-Resource Package¶

Here's a comprehensive example showing all specifications working together:

datapackage.json:

{
  "$schema": "https://datapackage.org/profiles/2.0/datapackage.json",
  "name": "population-and-cities",
  "title": "Country Population and Major Cities",
  "version": "1.0.0",
  "description": "Population data by country with major cities information",
  "created": "2024-01-15T10:00:00Z",
  "licenses": [
    {
      "name": "CC-BY-4.0",
      "title": "Creative Commons Attribution 4.0",
      "path": "https://creativecommons.org/licenses/by/4.0/"
    }
  ],
  "contributors": [
    {
      "title": "Data Team",
      "email": "[email protected]",
      "roles": ["author", "maintainer"]
    }
  ],
  "keywords": ["population", "demographics", "cities"],
  "resources": [
    {
      "name": "countries",
      "type": "table",
      "path": "data/countries.csv",
      "title": "Country Population Data",
      "format": "csv",
      "mediatype": "text/csv",
      "encoding": "utf-8",
      "dialect": {
        "delimiter": ",",
        "header": true,
        "quoteChar": "\""
      },
      "schema": {
        "fields": [
          {
            "name": "country_code",
            "type": "string",
            "title": "ISO Country Code",
            "constraints": {
              "required": true,
              "unique": true,
              "pattern": "^[A-Z]{2}$"
            }
          },
          {
            "name": "country_name",
            "type": "string",
            "title": "Country Name",
            "constraints": {
              "required": true
            }
          },
          {
            "name": "population",
            "type": "integer",
            "title": "Total Population",
            "constraints": {
              "minimum": 0
            }
          },
          {
            "name": "area_km2",
            "type": "number",
            "title": "Area in km²"
          },
          {
            "name": "last_updated",
            "type": "date"
          }
        ],
        "primaryKey": "country_code",
        "missingValues": [
          {"value": "", "label": "NOT_PROVIDED"},
          {"value": "N/A", "label": "NOT_APPLICABLE"}
        ]
      }
    },
    {
      "name": "cities",
      "type": "table",
      "path": "data/cities.csv",
      "title": "Major Cities",
      "format": "csv",
      "schema": {
        "fields": [
          {
            "name": "city_id",
            "type": "integer",
            "constraints": {
              "required": true,
              "unique": true
            }
          },
          {
            "name": "city_name",
            "type": "string",
            "constraints": {
              "required": true
            }
          },
          {
            "name": "country_code",
            "type": "string",
            "constraints": {
              "required": true,
              "pattern": "^[A-Z]{2}$"
            }
          },
          {
            "name": "population",
            "type": "integer",
            "constraints": {
              "minimum": 0
            }
          },
          {
            "name": "coordinates",
            "type": "geopoint",
            "title": "City Coordinates"
          }
        ],
        "primaryKey": "city_id",
        "foreignKeys": [
          {
            "fields": ["country_code"],
            "reference": {
              "resource": "countries",
              "fields": ["country_code"]
            }
          }
        ]
      }
    },
    {
      "name": "regions",
      "title": "World Regions Lookup",
      "data": [
        {"code": "EU", "name": "Europe"},
        {"code": "AS", "name": "Asia"},
        {"code": "AF", "name": "Africa"},
        {"code": "NA", "name": "North America"},
        {"code": "SA", "name": "South America"},
        {"code": "OC", "name": "Oceania"}
      ],
      "schema": {
        "fields": [
          {"name": "code", "type": "string"},
          {"name": "name", "type": "string"}
        ],
        "primaryKey": "code"
      }
    }
  ]
}

This example demonstrates: - Multiple resources (CSV files and inline data) - Complete metadata (licenses, contributors, keywords) - Detailed table schemas with various field types - Constraints and validation rules - Primary and foreign keys for referential integrity - Table dialect specification for CSV parsing - Missing value definitions

Advanced Features¶

Custom Properties¶

Descriptors can include custom properties using namespace conventions:

{
  "name": "my-package",
  "myorg:internal_id": "12345",
  "myorg:department": "research",
  "resources": [ ]
}

Resource Caching¶

The _cache property provides fallback locations when primary data sources are unavailable:

{
  "name": "remote-data",
  "path": "https://example.com/data.csv",
  "_cache": "cache/data.csv"
}

Profiles¶

Specify custom profiles for validation:

{
  "profile": "https://example.com/profiles/custom-package.json",
  "name": "my-package",
  "resources": [ ]
}

Benefits and Use Cases¶

Benefits¶

FAIR Data Principles: Enhances findability, accessibility, interoperability, and reusability
Self-Describing: Metadata travels with data
Validation: Schemas enable automatic validation
Standardization: Consistent format across organizations
Documentation: Built-in comprehensive metadata
Versioning: Track changes and evolution
Relationships: Foreign keys enable data linking
Type Safety: Strong typing for data fields

Use Cases¶

Research Data Management: Package research datasets with complete provenance
Open Data Publishing: Government and public sector data sharing
Data Integration: ETL pipelines with standardized formats
Data Catalogs: Build searchable data repositories
Data Quality: Validate data against schemas
Collaborative Projects: Share data with clear documentation
Archival: Long-term data preservation with metadata

Tools and Software¶

The Data Package ecosystem includes:

Open Data Editor: Visual tool for creating and editing data packages
Frictionless Framework (Python): Create, validate, and transform packages
JavaScript libraries: Browser and Node.js support
R packages: Integration with R workflows
Command-line tools: Automation and scripting

Comparison with v1¶

Data Package v2 builds on v1 with improvements including:

Enhanced Table Schema with categories and ordered categories
More flexible missing values with labels
Table Dialect specification (evolved from CSV Dialect)
Better profile system for validation
Improved foreign key definitions
Extended metadata properties

References¶

Official Website: https://datapackage.org/
Specifications:
Data Package: https://datapackage.org/standard/data-package/
Data Resource: https://datapackage.org/standard/data-resource/
Table Schema: https://datapackage.org/standard/table-schema/
Table Dialect: https://datapackage.org/standard/table-dialect/
GitHub: https://github.com/frictionlessdata
Open Knowledge Foundation: https://okfn.org/

Data Package v2 is inspired by and compatible with: - DataCite - Zenodo - DCAT (Data Catalog Vocabulary) - CKAN - Schema.org

Data Package Standard (v2)¶

Overview¶

Purpose¶

Design Philosophy¶

Adoption¶

Software Support¶

Architecture¶

Key Concepts¶

Descriptor¶

Profile¶

Tabular Data¶

Data Representation¶

1. Data Package Specification¶

Requirements¶

Required Properties¶

Recommended Properties¶

Standard Properties¶

Example: Minimal Data Package¶

Example: Complete Data Package¶

File Structure¶

2. Data Resource Specification¶

Purpose¶

Required Properties¶

Data Location (one required)¶

Recommended Properties¶

Tabular Data Properties¶

Path Property¶

Example: Basic Resource¶

Example: Tabular Resource with Schema¶

Example: Inline Data¶

Example: Multiple Files¶

3. Table Schema Specification¶

Core Properties¶

Field Properties¶

Field Types¶

Field Constraints¶

Example: Basic Table Schema¶

Example: Schema with Constraints¶

Example: Categories¶

Example: Foreign Keys¶

Example: Missing Values¶

4. Table Dialect Specification¶

Purpose¶

Standard Properties¶

Supported Formats¶

Example: Semicolon-Delimited CSV¶

Example: Tab-Delimited with Comments¶

Example: No Header Row¶

Example: Custom Quote Escaping¶

Example: Multiline Headers¶

Important Notes¶

Complete Example: Multi-Resource Package¶

Advanced Features¶

Custom Properties¶

Resource Caching¶

Profiles¶

Benefits and Use Cases¶

Benefits¶

Use Cases¶

Tools and Software¶

Comparison with v1¶

References¶

Related Standards¶