Sunstone Python Library¶

A Python library for managing datasets with lineage tracking in data science projects.

Overview¶

sunstone-py helps data scientists and researchers build reproducible data pipelines with automatic lineage tracking. It provides a pandas-compatible API that tracks where your data comes from, what operations you perform, and maintains a complete audit trail—all with minimal changes to your existing code.

Key Features¶

W3C PROV-O Lineage: Every transformation is recorded using the W3C provenance standard—know exactly where your data came from, what happened to it, and which fields derived from which sources
Dataset Management: Centralized datasets.yaml configuration for all data inputs and outputs
DataFrame Metadata: Unified metadata container with per-field annotations (descriptions, units, source tracking)
Plugin System: Extensible URL handlers (local, HTTP, GCS, S3/R2) and format handlers (CSV, JSON, Excel, Parquet, TSV) with entry point discovery
Unit-Aware Arithmetic: Pint integration for column-level unit tracking with automatic compatibility checks and QUDT round-tripping
Semantic Metadata: RDF triple support with automatic prefix expansion for rich dataset descriptions
Command-Line Tools: Validate, lock, and publish datasets with the sunstone CLI
Pandas-Compatible: Familiar API via from sunstone import pandas as pd—supports CSV, Excel, JSON, and Parquet
Strict/Relaxed Modes: Choose between automatic registration (exploratory) or enforced pre-registration (production)
Data Package Publishing: Build standards-compliant data packages and push to cloud storage
Full Type Hints: Complete type annotation support for better IDE integration and type safety

Why sunstone-py?¶

Problem: In data science projects, it's hard to track: - Where did this dataset come from? - What transformations were applied? - Which outputs are derived from which inputs? - Is this analysis reproducible?

Solution: sunstone-py automatically tracks all of this as you work, storing metadata in a human-readable datasets.yaml file and maintaining lineage through your pandas operations.

Installation¶

# Using uv (recommended)
uv add sunstone-py

# Using pip
pip install sunstone-py

Development Installation¶

For local development or contributing:

git clone https://github.com/sunstoneinstitute/sunstone-py.git
cd sunstone-py
uv venv
uv sync
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Installing from Git¶

To use the latest development version:

# pyproject.toml
dependencies = [
    "sunstone-py @ git+https://github.com/sunstoneinstitute/sunstone-py.git",
]

Or for local development with live changes:

dependencies = [
    "sunstone-py @ file://${HOME}/git/sunstone-py"
]

Quick Example¶

from sunstone import pandas as pd
from pathlib import Path

PROJECT_PATH = Path.cwd()

# Read data - lineage automatically tracked
schools = pd.read_csv('data/schools.csv', project_path=PROJECT_PATH)

# Transform using familiar pandas operations
summary = schools[schools['enrollment'] > 100].groupby('district').sum()

# Save with automatic lineage tracking
summary.to_csv(
    'outputs/summary.csv',
    slug='school-summary',
    name='School Enrollment Summary',
    index=False
)

# Check what went into this dataset
print(summary.metadata.lineage.sources)      # Source datasets
print(summary.metadata.lineage.get_licenses())  # Source licenses

That's it! The lineage is automatically tracked and saved to datasets.yaml.

What Gets Tracked?¶

For Every Dataset¶

Metadata: Name, description, location, schema
Source Attribution: Where the data came from, when acquired, license
Lineage: Which datasets were used to create this one
Operations: What transformations were applied
Versioning: Content hash and timestamps (only updates when data changes)

Automatically Generated¶

When you save a DataFrame, sunstone-py:

Registers the dataset in datasets.yaml (in relaxed mode)
Infers the schema from your data
Records all source datasets in the lineage
Calculates a content hash (for change detection)
Saves operation descriptions

Getting Started¶

Ready to dive in? Here's your learning path:

Quick Start - Get up and running in 5 minutes
Core Concepts - Understand lineage tracking and strict/relaxed modes
Data Packages How-To - Configure inputs, outputs, and packages for publishing
CLI Guide - Learn the command-line tools for dataset management
API Reference - Complete API documentation
Examples - Real-world usage patterns and workflows
Migration: datasets.lock.yaml - Upgrade guide for v1.7 lock file split

Common Use Cases¶

Research & Academia¶

Track data provenance for reproducible research
Document data sources and transformations for publications
Share datasets with complete lineage metadata
Validate analyses before publication

Production Pipelines¶

Enforce dataset registration with strict mode
Build and publish data packages to cloud storage
Validate datasets in CI/CD pipelines
Maintain audit trails for compliance

Data Science Teams¶

Centralized dataset catalog in datasets.yaml
Automatic schema inference and validation
Track which analyses depend on which datasets
Share work with complete documentation

Command-Line Tools¶

The sunstone CLI provides tools for dataset management:

# List all datasets
sunstone dataset list

# Validate datasets.yaml structure
sunstone dataset validate

# Enable strict mode for production
sunstone dataset strict

# Build a Data Package
sunstone package build

# Push to Google Cloud Storage
sunstone package push --env prod

See the CLI Guide for complete documentation.

Key Concepts¶

Lineage Tracking (W3C PROV-O)¶

Every DataFrame automatically tracks: - Sources: Which datasets were read - Activities: Script/notebook executions with agents and timestamps - Field Derivations: Which output columns came from which source datasets - Attribution: Licenses and source information

Lineage propagates through operations like merge, join, concat, and custom transformations. Field-level derivations are auto-populated on read.

Strict vs Relaxed Mode¶

Relaxed Mode (default): Auto-registers new datasets, perfect for exploration
Strict Mode: Requires pre-registration, enforces documentation for production

Switch between modes per-operation, globally, or via CLI.

Dataset Management¶

All datasets live in datasets.yaml: - Inputs: External data sources with attribution - Outputs: Generated datasets with lineage - Schemas: Field names and types - Metadata: Publishing config, strict mode flags

Learn more in Core Concepts.

Integration with Data Package Standard¶

sunstone-py builds on the Data Package v2 standard, an open specification for data distribution. You can:

Build standards-compliant datapackage.json files
Add RDF triples with automatic prefix expansion (DCAT, Dublin Core, schema.org, custom vocabularies)
Publish to cloud storage (GCS, S3, etc.)
Integrate with tools that consume Data Packages
Share data with complete metadata

See the Data Packages How-To for a practical guide to configuring packages, or the Data Package Standard for the full specification.

Development¶

Running Tests¶

uv run pytest

Type Checking¶

uv run mypy

Linting and Formatting¶

uv run ruff check
uv run ruff format

Documentation¶

This documentation is built with MkDocs and the Material theme:

uv run mkdocs serve  # Preview locally
uv run mkdocs build  # Build static site

Support & Contributing¶

Documentation: https://sunstoneinstitute.github.io/sunstone-py/
Issues: GitHub Issues
Source Code: GitHub Repository
PyPI: sunstone-py

Contributions are welcome! Please feel free to submit issues or pull requests.

About Sunstone Institute¶

Sunstone Institute is a philanthropy-funded organization using data and AI to show the world as it really is, and inspire action everywhere.

License¶

MIT License - see LICENSE file for details.