Sunstone Python Library¶
A Python library for managing datasets with lineage tracking in data science projects.
Overview¶
sunstone-py helps data scientists and researchers build reproducible data pipelines with automatic lineage tracking. It provides a pandas-compatible API that tracks where your data comes from, what operations you perform, and maintains a complete audit trail—all with minimal changes to your existing code.
Key Features¶
- W3C PROV-O Lineage: Every transformation is recorded using the W3C provenance standard—know exactly where your data came from, what happened to it, and which fields derived from which sources
- Dataset Management: Centralized
datasets.yamlconfiguration for all data inputs and outputs - DataFrame Metadata: Unified metadata container with per-field annotations (descriptions, units, source tracking)
- Plugin System: Extensible URL handlers (local, HTTP, GCS, S3/R2) and format handlers (CSV, JSON, Excel, Parquet, TSV) with entry point discovery
- Unit-Aware Arithmetic: Pint integration for column-level unit tracking with automatic compatibility checks and QUDT round-tripping
- Semantic Metadata: RDF triple support with automatic prefix expansion for rich dataset descriptions
- Command-Line Tools: Validate, lock, and publish datasets with the
sunstoneCLI - Pandas-Compatible: Familiar API via
from sunstone import pandas as pd—supports CSV, Excel, JSON, and Parquet - Strict/Relaxed Modes: Choose between automatic registration (exploratory) or enforced pre-registration (production)
- Data Package Publishing: Build standards-compliant data packages and push to cloud storage
- Full Type Hints: Complete type annotation support for better IDE integration and type safety
Why sunstone-py?¶
Problem: In data science projects, it's hard to track: - Where did this dataset come from? - What transformations were applied? - Which outputs are derived from which inputs? - Is this analysis reproducible?
Solution: sunstone-py automatically tracks all of this as you work, storing metadata in a human-readable datasets.yaml file and maintaining lineage through your pandas operations.
Installation¶
Development Installation¶
For local development or contributing:
git clone https://github.com/sunstoneinstitute/sunstone-py.git
cd sunstone-py
uv venv
uv sync
source .venv/bin/activate # On Windows: .venv\Scripts\activate
Installing from Git¶
To use the latest development version:
# pyproject.toml
dependencies = [
"sunstone-py @ git+https://github.com/sunstoneinstitute/sunstone-py.git",
]
Or for local development with live changes:
Quick Example¶
from sunstone import pandas as pd
from pathlib import Path
PROJECT_PATH = Path.cwd()
# Read data - lineage automatically tracked
schools = pd.read_csv('data/schools.csv', project_path=PROJECT_PATH)
# Transform using familiar pandas operations
summary = schools[schools['enrollment'] > 100].groupby('district').sum()
# Save with automatic lineage tracking
summary.to_csv(
'outputs/summary.csv',
slug='school-summary',
name='School Enrollment Summary',
index=False
)
# Check what went into this dataset
print(summary.metadata.lineage.sources) # Source datasets
print(summary.metadata.lineage.get_licenses()) # Source licenses
That's it! The lineage is automatically tracked and saved to datasets.yaml.
What Gets Tracked?¶
For Every Dataset¶
- Metadata: Name, description, location, schema
- Source Attribution: Where the data came from, when acquired, license
- Lineage: Which datasets were used to create this one
- Operations: What transformations were applied
- Versioning: Content hash and timestamps (only updates when data changes)
Automatically Generated¶
When you save a DataFrame, sunstone-py:
- Registers the dataset in
datasets.yaml(in relaxed mode) - Infers the schema from your data
- Records all source datasets in the lineage
- Calculates a content hash (for change detection)
- Saves operation descriptions
Getting Started¶
Ready to dive in? Here's your learning path:
- Quick Start - Get up and running in 5 minutes
- Core Concepts - Understand lineage tracking and strict/relaxed modes
- Data Packages How-To - Configure inputs, outputs, and packages for publishing
- CLI Guide - Learn the command-line tools for dataset management
- API Reference - Complete API documentation
- Examples - Real-world usage patterns and workflows
- Migration: datasets.lock.yaml - Upgrade guide for v1.7 lock file split
Common Use Cases¶
Research & Academia¶
- Track data provenance for reproducible research
- Document data sources and transformations for publications
- Share datasets with complete lineage metadata
- Validate analyses before publication
Production Pipelines¶
- Enforce dataset registration with strict mode
- Build and publish data packages to cloud storage
- Validate datasets in CI/CD pipelines
- Maintain audit trails for compliance
Data Science Teams¶
- Centralized dataset catalog in
datasets.yaml - Automatic schema inference and validation
- Track which analyses depend on which datasets
- Share work with complete documentation
Command-Line Tools¶
The sunstone CLI provides tools for dataset management:
# List all datasets
sunstone dataset list
# Validate datasets.yaml structure
sunstone dataset validate
# Enable strict mode for production
sunstone dataset strict
# Build a Data Package
sunstone package build
# Push to Google Cloud Storage
sunstone package push --env prod
See the CLI Guide for complete documentation.
Key Concepts¶
Lineage Tracking (W3C PROV-O)¶
Every DataFrame automatically tracks: - Sources: Which datasets were read - Activities: Script/notebook executions with agents and timestamps - Field Derivations: Which output columns came from which source datasets - Attribution: Licenses and source information
Lineage propagates through operations like merge, join, concat, and custom transformations. Field-level derivations are auto-populated on read.
Strict vs Relaxed Mode¶
- Relaxed Mode (default): Auto-registers new datasets, perfect for exploration
- Strict Mode: Requires pre-registration, enforces documentation for production
Switch between modes per-operation, globally, or via CLI.
Dataset Management¶
All datasets live in datasets.yaml:
- Inputs: External data sources with attribution
- Outputs: Generated datasets with lineage
- Schemas: Field names and types
- Metadata: Publishing config, strict mode flags
Learn more in Core Concepts.
Integration with Data Package Standard¶
sunstone-py builds on the Data Package v2 standard, an open specification for data distribution. You can:
- Build standards-compliant
datapackage.jsonfiles - Add RDF triples with automatic prefix expansion (DCAT, Dublin Core, schema.org, custom vocabularies)
- Publish to cloud storage (GCS, S3, etc.)
- Integrate with tools that consume Data Packages
- Share data with complete metadata
See the Data Packages How-To for a practical guide to configuring packages, or the Data Package Standard for the full specification.
Development¶
Running Tests¶
Type Checking¶
Linting and Formatting¶
Documentation¶
This documentation is built with MkDocs and the Material theme:
Support & Contributing¶
- Documentation: https://sunstoneinstitute.github.io/sunstone-py/
- Issues: GitHub Issues
- Source Code: GitHub Repository
- PyPI: sunstone-py
Contributions are welcome! Please feel free to submit issues or pull requests.
About Sunstone Institute¶
Sunstone Institute is a philanthropy-funded organization using data and AI to show the world as it really is, and inspire action everywhere.
License¶
MIT License - see LICENSE file for details.