Data Package Standard (v2)¶
Overview¶
Data Package is a comprehensive data standard and data definition language (DDL) designed to improve data management, sharing, and interoperability. Version 2 of the standard provides a modernized framework for describing datasets that enhances the findability, accessibility, interoperability, and reusability (FAIR) of data.
Purpose¶
- Create a standardized framework for describing datasets
- Provide a simple yet extensible data definition language
- Enable structured data packaging and delivery
- Facilitate easier data discovery and integration
- Support comprehensive dataset documentation
Design Philosophy¶
- Simple: Easy to understand and implement
- Extensible: Can be customized for specific use cases
- Language-agnostic: Works across different programming languages
- Interoperable: Compatible with existing data standards
- FAIR-focused: Promotes best practices in data management
Adoption¶
The Data Package standard is used by major organizations including: - European Commission - GitHub - GBIF (Global Biodiversity Information Facility) - Dryad - And many research institutions worldwide
Software Support¶
Open-source tools and libraries are available for multiple languages including Python, JavaScript, R, and more.
Architecture¶
The Data Package v2 standard consists of four core specifications that work together:
- Data Package: Container format for describing a coherent collection of data
- Data Resource: Format for describing individual data files or sources
- Table Schema: Detailed description of tabular data structure
- Table Dialect: Specification for how tabular data is physically stored
Key Concepts¶
Descriptor¶
A descriptor is a JSON object that contains metadata about a data package, resource, or schema. Descriptors can:
- Be stored in JSON files (e.g., datapackage.json)
- Include custom properties using namespace:propertyName convention
- Reference profiles for validation
Profile¶
A profile is a URL that: - Resolves to a valid JSON Schema descriptor - Is versioned and immutable - Serves as a metadata version identifier - Enables validation of descriptors
Tabular Data¶
Tabular data consists of: - Rows with consistent fields - Optional header row defining field names - Multiple possible formats (CSV, Excel, JSON, databases, etc.)
Data Representation¶
- Physical representation: How data appears in files (text, binary, etc.)
- Logical representation: "Ideal" data with defined types and structures
1. Data Package Specification¶
A Data Package is a simple container format for describing a coherent collection of data in a single package.
Requirements¶
- MUST have a descriptor file (preferably named
datapackage.json) - MUST include at least one resource
- Descriptor contains metadata about the package and its resources
Required Properties¶
resources(array): List of data resources (REQUIRED)
Recommended Properties¶
name: Unique, human-readable identifier (lowercase alphanumeric)id: Globally unique identifier (UUID, DOI, etc.)licenses: Licensing informationprofile: Profile URL for validation
Standard Properties¶
| Property | Type | Description |
|---|---|---|
$schema |
string | Profile URL (default: https://datapackage.org/profiles/2.0/datapackage.json) |
name |
string | Unique identifier, invariant across updates |
id |
string | Globally unique identifier |
title |
string | Human-readable title |
description |
string | Detailed description (supports Markdown) |
version |
string | Semantic version |
created |
string | Creation datetime (ISO 8601) |
homepage |
string | URL to package homepage |
licenses |
array | License objects |
sources |
array | Raw data source objects |
contributors |
array | Contributor objects |
keywords |
array | Search keywords |
image |
string | Representative image URL |
resources |
array | Data resource objects (REQUIRED) |
Example: Minimal Data Package¶
Example: Complete Data Package¶
{
"$schema": "https://datapackage.org/profiles/2.0/datapackage.json",
"name": "gdp-data",
"title": "A nice title",
"version": "1.0.0",
"created": "1985-04-12T23:20:50.52Z",
"licenses": [
{
"name": "ODC-PDDL-1.0",
"path": "http://opendatacommons.org/licenses/pddl/",
"title": "Open Data Commons Public Domain Dedication and License v1.0"
}
],
"sources": [
{
"title": "World Bank and OECD",
"path": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}
],
"contributors": [
{
"title": "Joe Bloggs",
"email": "[email protected]",
"path": "http://www.bloggs.com",
"roles": ["creator"]
}
],
"resources": [
{
"name": "gdp",
"path": "data/gdp.csv",
"type": "table",
"schema": {
"fields": [
{
"name": "country",
"type": "string"
},
{
"name": "year",
"type": "integer"
},
{
"name": "gdp",
"type": "number"
}
]
}
}
]
}
File Structure¶
A typical Data Package directory structure:
my-package/
├── datapackage.json
├── README.md
├── data/
│ ├── mydata.csv
│ └── otherdata.csv
└── scripts/
└── process.py
2. Data Resource Specification¶
A Data Resource describes and packages a single data resource, such as an individual table or file.
Purpose¶
Provide a locator for data with optional rich metadata.
Required Properties¶
name: Unique identifier within the package (lowercase alphanumeric)
Data Location (one required)¶
path: URL or file system path to data file(s)data: Inline data representation
Recommended Properties¶
type: Resource type (e.g.,tablefor tabular data)title: Human-readable labeldescription: Detailed descriptionformat: File extension (e.g.,csv,json,xlsx)mediatype: MIME type (e.g.,text/csv)encoding: Character encoding (e.g.,utf-8)
Tabular Data Properties¶
When type is table:
- schema: Table Schema describing structure
- dialect: Table Dialect describing physical format
Path Property¶
The path property can be:
- A single file: "path": "data.csv"
- Multiple files: "path": ["data1.csv", "data2.csv"]
- A URL: "path": "http://example.com/data.csv"
- A relative POSIX path: "path": "data/myfile.csv"
Security restrictions:
- Cannot use absolute paths (/path/to/file)
- Cannot use parent paths (../file)
- Cannot reference hidden folders
Example: Basic Resource¶
{
"name": "solar-system",
"path": "http://example.com/solar-system.csv",
"title": "The Solar System",
"format": "csv",
"mediatype": "text/csv",
"encoding": "utf-8"
}
Example: Tabular Resource with Schema¶
{
"name": "population",
"type": "table",
"path": "data/population.csv",
"title": "Population Data",
"format": "csv",
"schema": {
"fields": [
{
"name": "country",
"type": "string"
},
{
"name": "population",
"type": "integer"
}
]
}
}
Example: Inline Data¶
{
"name": "countries",
"data": [
{"code": "US", "name": "United States"},
{"code": "GB", "name": "United Kingdom"},
{"code": "NO", "name": "Norway"}
]
}
Example: Multiple Files¶
{
"name": "annual-reports",
"path": [
"data/2020.csv",
"data/2021.csv",
"data/2022.csv"
],
"format": "csv"
}
3. Table Schema Specification¶
Table Schema is a format for declaring schemas for tabular data, providing type checking and validation capabilities.
Core Properties¶
| Property | Type | Description |
|---|---|---|
fields |
array | Field descriptors (REQUIRED) |
$schema |
string | Profile URL |
missingValues |
array | Defines how missing values are represented |
primaryKey |
string/array | Unique row identifier(s) |
foreignKeys |
array | Links between tables |
Field Properties¶
Each field in the fields array has:
| Property | Type | Description |
|---|---|---|
name |
string | Field name (REQUIRED) |
type |
string | Data type (recommended) |
format |
string | Specific format for the type |
title |
string | Human-readable title |
description |
string | Field description |
example |
any | Example value |
constraints |
object | Validation rules |
categories |
array | Categorical values with labels |
categoriesOrdered |
boolean | Whether categories have order |
Field Types¶
| Type | Description | Example |
|---|---|---|
string |
Text data | "Hello World" |
number |
Decimal numbers | 3.14, -42.5 |
integer |
Whole numbers | 42, -17 |
boolean |
True/false | true, false |
date |
Calendar date | 2024-01-15 |
datetime |
Date and time | 2024-01-15T10:30:00Z |
time |
Time of day | 10:30:00 |
year |
Year | 2024 |
yearmonth |
Year and month | 2024-01 |
duration |
Time duration | P1Y2M (ISO 8601) |
geopoint |
Geographic coordinates | [45.5231, -122.6765] |
geojson |
GeoJSON geometry | {...} |
object |
JSON object | {"key": "value"} |
array |
JSON array | [1, 2, 3] |
any |
Unspecified/mixed type | any value |
Field Constraints¶
| Constraint | Type | Description |
|---|---|---|
required |
boolean | Field cannot be null |
unique |
boolean | All values must be unique |
minLength |
integer | Minimum string length |
maxLength |
integer | Maximum string length |
minimum |
number | Minimum value |
maximum |
number | Maximum value |
pattern |
string | Regular expression validation |
enum |
array | Restrict to specific values |
Example: Basic Table Schema¶
{
"fields": [
{
"name": "id",
"type": "integer"
},
{
"name": "name",
"type": "string"
},
{
"name": "price",
"type": "number"
}
]
}
Example: Schema with Constraints¶
{
"fields": [
{
"name": "id",
"type": "integer",
"constraints": {
"required": true,
"unique": true
}
},
{
"name": "email",
"type": "string",
"format": "email",
"constraints": {
"required": true
}
},
{
"name": "price",
"type": "integer",
"constraints": {
"minimum": 100,
"maximum": 150,
"required": true
}
}
],
"primaryKey": "id"
}
Example: Categories¶
Categories allow you to define labeled values, useful for coded data:
{
"fields": [
{
"name": "fruit",
"type": "integer",
"categories": [
{"value": 0, "label": "apple"},
{"value": 1, "label": "orange"},
{"value": 2, "label": "banana"}
],
"categoriesOrdered": true
}
]
}
Example: Foreign Keys¶
Foreign keys create relationships between tables:
{
"resources": [
{
"name": "population-by-state",
"schema": {
"fields": [
{"name": "state-code", "type": "string"},
{"name": "population", "type": "integer"}
],
"foreignKeys": [
{
"fields": ["state-code"],
"reference": {
"resource": "state-codes",
"fields": ["code"]
}
}
]
}
},
{
"name": "state-codes",
"schema": {
"fields": [
{"name": "code", "type": "string"},
{"name": "name", "type": "string"}
],
"primaryKey": "code"
}
}
]
}
Example: Missing Values¶
Define how missing or null values are represented:
{
"fields": [
{"name": "temperature", "type": "number"},
{"name": "notes", "type": "string"}
],
"missingValues": [
{"value": "", "label": "OMITTED"},
{"value": "-99", "label": "REFUSED"},
{"value": "N/A", "label": "NOT_APPLICABLE"}
]
}
4. Table Dialect Specification¶
Table Dialect describes how tabular data is physically stored in a file, supporting various formats like CSV, TSV, JSON, Excel, and databases.
Purpose¶
- Define parsing rules for tabular data files
- Ensure data interoperability across different formats
- Provide configuration for delimiters, quotes, line terminators, etc.
Standard Properties¶
| Property | Type | Default | Description |
|---|---|---|---|
$schema |
string | - | Profile URL |
header |
boolean | true |
Whether first row contains field names |
headerRows |
array | [1] |
Row numbers containing headers |
delimiter |
string | , |
Field delimiter character |
lineTerminator |
string | \r\n |
Row separator |
quoteChar |
string | " |
Quote character |
doubleQuote |
boolean | true |
Whether quotes are escaped by doubling |
escapeChar |
string | - | Escape character (alternative to doubleQuote) |
skipInitialSpace |
boolean | false |
Ignore whitespace after delimiter |
commentChar |
string | - | Comment line prefix |
caseSensitiveHeader |
boolean | false |
Whether header names are case-sensitive |
Supported Formats¶
The Table Dialect specification supports: - Delimited formats: CSV, TSV, pipe-delimited - Structured formats: JSON, YAML - Spreadsheets: Excel (.xlsx, .xls), OpenOffice (.ods) - Databases: SQL tables, query results
Example: Semicolon-Delimited CSV¶
Example: Tab-Delimited with Comments¶
Example: No Header Row¶
Example: Custom Quote Escaping¶
Example: Multiline Headers¶
Important Notes¶
- Table Dialect focuses on formatting, not data types or schema
- Character encoding is handled separately (not part of dialect)
- Backward compatible with the older "CSV Dialect" specification
- Orthogonal to Table Schema (they work together but are independent)
Complete Example: Multi-Resource Package¶
Here's a comprehensive example showing all specifications working together:
datapackage.json:
{
"$schema": "https://datapackage.org/profiles/2.0/datapackage.json",
"name": "population-and-cities",
"title": "Country Population and Major Cities",
"version": "1.0.0",
"description": "Population data by country with major cities information",
"created": "2024-01-15T10:00:00Z",
"licenses": [
{
"name": "CC-BY-4.0",
"title": "Creative Commons Attribution 4.0",
"path": "https://creativecommons.org/licenses/by/4.0/"
}
],
"contributors": [
{
"title": "Data Team",
"email": "[email protected]",
"roles": ["author", "maintainer"]
}
],
"keywords": ["population", "demographics", "cities"],
"resources": [
{
"name": "countries",
"type": "table",
"path": "data/countries.csv",
"title": "Country Population Data",
"format": "csv",
"mediatype": "text/csv",
"encoding": "utf-8",
"dialect": {
"delimiter": ",",
"header": true,
"quoteChar": "\""
},
"schema": {
"fields": [
{
"name": "country_code",
"type": "string",
"title": "ISO Country Code",
"constraints": {
"required": true,
"unique": true,
"pattern": "^[A-Z]{2}$"
}
},
{
"name": "country_name",
"type": "string",
"title": "Country Name",
"constraints": {
"required": true
}
},
{
"name": "population",
"type": "integer",
"title": "Total Population",
"constraints": {
"minimum": 0
}
},
{
"name": "area_km2",
"type": "number",
"title": "Area in km²"
},
{
"name": "last_updated",
"type": "date"
}
],
"primaryKey": "country_code",
"missingValues": [
{"value": "", "label": "NOT_PROVIDED"},
{"value": "N/A", "label": "NOT_APPLICABLE"}
]
}
},
{
"name": "cities",
"type": "table",
"path": "data/cities.csv",
"title": "Major Cities",
"format": "csv",
"schema": {
"fields": [
{
"name": "city_id",
"type": "integer",
"constraints": {
"required": true,
"unique": true
}
},
{
"name": "city_name",
"type": "string",
"constraints": {
"required": true
}
},
{
"name": "country_code",
"type": "string",
"constraints": {
"required": true,
"pattern": "^[A-Z]{2}$"
}
},
{
"name": "population",
"type": "integer",
"constraints": {
"minimum": 0
}
},
{
"name": "coordinates",
"type": "geopoint",
"title": "City Coordinates"
}
],
"primaryKey": "city_id",
"foreignKeys": [
{
"fields": ["country_code"],
"reference": {
"resource": "countries",
"fields": ["country_code"]
}
}
]
}
},
{
"name": "regions",
"title": "World Regions Lookup",
"data": [
{"code": "EU", "name": "Europe"},
{"code": "AS", "name": "Asia"},
{"code": "AF", "name": "Africa"},
{"code": "NA", "name": "North America"},
{"code": "SA", "name": "South America"},
{"code": "OC", "name": "Oceania"}
],
"schema": {
"fields": [
{"name": "code", "type": "string"},
{"name": "name", "type": "string"}
],
"primaryKey": "code"
}
}
]
}
This example demonstrates: - Multiple resources (CSV files and inline data) - Complete metadata (licenses, contributors, keywords) - Detailed table schemas with various field types - Constraints and validation rules - Primary and foreign keys for referential integrity - Table dialect specification for CSV parsing - Missing value definitions
Advanced Features¶
Custom Properties¶
Descriptors can include custom properties using namespace conventions:
{
"name": "my-package",
"myorg:internal_id": "12345",
"myorg:department": "research",
"resources": [ ]
}
Resource Caching¶
The _cache property provides fallback locations when primary data sources are unavailable:
Profiles¶
Specify custom profiles for validation:
{
"profile": "https://example.com/profiles/custom-package.json",
"name": "my-package",
"resources": [ ]
}
Benefits and Use Cases¶
Benefits¶
- FAIR Data Principles: Enhances findability, accessibility, interoperability, and reusability
- Self-Describing: Metadata travels with data
- Validation: Schemas enable automatic validation
- Standardization: Consistent format across organizations
- Documentation: Built-in comprehensive metadata
- Versioning: Track changes and evolution
- Relationships: Foreign keys enable data linking
- Type Safety: Strong typing for data fields
Use Cases¶
- Research Data Management: Package research datasets with complete provenance
- Open Data Publishing: Government and public sector data sharing
- Data Integration: ETL pipelines with standardized formats
- Data Catalogs: Build searchable data repositories
- Data Quality: Validate data against schemas
- Collaborative Projects: Share data with clear documentation
- Archival: Long-term data preservation with metadata
Tools and Software¶
The Data Package ecosystem includes:
- Open Data Editor: Visual tool for creating and editing data packages
- Frictionless Framework (Python): Create, validate, and transform packages
- JavaScript libraries: Browser and Node.js support
- R packages: Integration with R workflows
- Command-line tools: Automation and scripting
Comparison with v1¶
Data Package v2 builds on v1 with improvements including:
- Enhanced Table Schema with categories and ordered categories
- More flexible missing values with labels
- Table Dialect specification (evolved from CSV Dialect)
- Better profile system for validation
- Improved foreign key definitions
- Extended metadata properties
References¶
- Official Website: https://datapackage.org/
- Specifications:
- Data Package: https://datapackage.org/standard/data-package/
- Data Resource: https://datapackage.org/standard/data-resource/
- Table Schema: https://datapackage.org/standard/table-schema/
- Table Dialect: https://datapackage.org/standard/table-dialect/
- GitHub: https://github.com/frictionlessdata
- Open Knowledge Foundation: https://okfn.org/
Related Standards¶
Data Package v2 is inspired by and compatible with: - DataCite - Zenodo - DCAT (Data Catalog Vocabulary) - CKAN - Schema.org