Quick Start¶

Get started with sunstone-py in minutes.

1. Set Up Your Project with datasets.yaml¶

Create a datasets.yaml file in your project directory:

publish:
  enabled: true
  to: gs://my-bucket/datasets/schools/

inputs:
  - name: School Data
    slug: school-data
    location: data/schools.csv
    source:
      name: Ministry of Education
      location:
        data: https://example.com/schools.csv
      attributedTo: Ministry of Education
      acquiredAt: 2025-01-15
      acquisitionMethod: manual-download
      license: CC-BY-4.0
    fields:
      - name: school_id
        type: string
      - name: enrollment
        type: integer

outputs: []

2. Use Pandas-Like API with Lineage Tracking¶

import sunstone
from sunstone import pandas as pd
from pathlib import Path

# Set the project path once. read_csv/read_excel/read_dataset and
# the DataFrame constructor will pick this up automatically.
sunstone.set_project_path(Path.cwd())

# Read data - lineage automatically tracked
df = pd.read_csv('data/schools.csv')

# Transform using familiar pandas operations
result = df[df['enrollment'] > 100].groupby('district').sum()

# Save with automatic lineage tracking and dataset registration
result.to_csv(
    'outputs/summary.csv',
    slug='school-summary',
    name='School Enrollment Summary',
    index=False
)

You can still pass project_path= explicitly per call, or use with sunstone.use_project_path(...): for a scoped override.

3. Check Lineage Metadata¶

# View lineage information (via metadata container)
print(result.metadata.lineage.sources)      # Source datasets
print(result.metadata.lineage.get_licenses())  # All source licenses

# Check field-level provenance
for fd in result.metadata.lineage.field_derivations:
    print(f"  {fd.output_field} <- {fd.source_entity}")

4. Annotate Columns (Optional)¶

# Add descriptions and units to columns
result.set_field_metadata('enrollment', description='Total enrolled students', unit='students')

# Save as Parquet instead of CSV
result.to_parquet(
    'outputs/summary.parquet',
    slug='school-summary',
    name='School Enrollment Summary'
)

Next Steps¶

Learn about the CLI tools for dataset management
Understand core concepts like strict mode and lineage tracking
Browse the API reference for detailed documentation
Check out examples for real-world usage patterns