API Reference¶
Complete API documentation for sunstone-py.
pandas Module¶
Drop-in replacement for pandas with lineage tracking.
Functions¶
read_dataset(slug, project_path=None, strict=None, fetch_from_url=True, format=None, **kwargs)¶
Read a dataset by slug with automatic format detection.
Parameters:
slug(str): Dataset slug to look up indatasets.yamlproject_path(str | Path | None): Path to project directory. Defaults toPath.cwd()strict(bool | None): Enable strict mode. If None, reads fromSUNSTONE_DATAFRAME_STRICTenv varfetch_from_url(bool): If True and dataset has a source URL but no local file, fetch automaticallyformat(str | None): Format override ('csv','json','excel','parquet','tsv'). Auto-detected from extension if not provided**kwargs: Additional arguments passed to the underlying pandas reader
Returns: DataFrame with lineage tracking
Example:
df = pd.read_dataset('official-un-member-states')
df = pd.read_dataset('my-data', format='json', project_path='/path/to/project')
read_csv(filepath, project_path=None, strict=None, **kwargs)¶
Read CSV file with lineage tracking.
Parameters:
filepath(str | Path): Path to CSV file, URL, or dataset slugproject_path(str | Path | None): Path to project directory containingdatasets.yaml. Defaults toPath.cwd()strict(bool | None): If True, dataset must be pre-registered indatasets.yaml. If None, reads fromSUNSTONE_DATAFRAME_STRICTenv var**kwargs: Additional arguments passed topandas.read_csv()
Returns: DataFrame with lineage tracking
Raises:
DatasetNotFoundError: If dataset not found indatasets.yamlStrictModeError: If strict=True and dataset not pre-registered
Example:
df = pd.read_csv(
'data/schools.csv',
project_path='/path/to/project',
strict=True,
encoding='utf-8'
)
read_excel(filepath, project_path=None, strict=None, fetch_from_url=True, **kwargs)¶
Read Excel file (.xlsx/.xls) with lineage tracking.
Parameters:
filepath(str | Path): Path to Excel file or dataset slugproject_path(str | Path | None): Path to project directory containingdatasets.yaml. Defaults toPath.cwd()strict(bool | None): If True, dataset must be pre-registered. If None, reads fromSUNSTONE_DATAFRAME_STRICTenv varfetch_from_url(bool): If True and dataset has a source URL but no local file, automatically fetch from URL**kwargs: Additional arguments passed topandas.read_excel()
Returns: DataFrame with lineage tracking
Raises:
DatasetNotFoundError: If dataset not found indatasets.yamlFileNotFoundError: Ifdatasets.yamldoesn't exist
Example:
# Load by slug (recommended)
df = pd.read_excel('my-excel-data', project_path='/path/to/project')
# Load by file path
df = pd.read_excel('data/schools.xlsx', project_path='/path/to/project', sheet_name='Sheet1')
read_json(filepath, project_path=None, strict=None, **kwargs)¶
Read JSON file with lineage tracking.
Parameters:
filepath(str | Path): Path to JSON file or dataset slugproject_path(str | Path | None): Path to project directory. Defaults toPath.cwd()strict(bool | None): Enable strict mode. If None, reads fromSUNSTONE_DATAFRAME_STRICTenv var**kwargs: Additional arguments passed topandas.read_json()
Returns: DataFrame with lineage tracking
Example:
# Read a JSON file
df = pd.read_json('data/records.json', project_path=PROJECT_PATH)
# With pandas options
df = pd.read_json('data/records.json', orient='records', lines=True)
merge(left, right, **kwargs)¶
Merge DataFrames with combined lineage.
Parameters:
left(DataFrame): Left DataFrameright(DataFrame): Right DataFrame**kwargs: Arguments passed topandas.merge()
Returns: DataFrame with lineage from both sources
Example:
result = pd.merge(schools, teachers, on='school_id', how='inner')
print(len(result.lineage.sources)) # 2
concat(dfs, **kwargs)¶
Concatenate DataFrames with combined lineage.
Parameters:
dfs(list[DataFrame]): List of DataFrames to concatenate**kwargs: Arguments passed topandas.concat()
Returns: DataFrame with lineage from all sources
Example:
DataFrame Class¶
Main class for working with data and lineage.
Class Methods¶
read_csv(filepath, project_path, strict=False, **kwargs)¶
Read CSV file and return DataFrame.
Parameters: Same as pandas.read_csv()
Returns: DataFrame instance
read_excel(filepath, project_path, strict=False, fetch_from_url=True, **kwargs)¶
Read Excel file and return DataFrame.
Parameters: Same as pandas.read_excel()
Returns: DataFrame instance
Instance Methods¶
to_csv(path, slug, name, **kwargs)¶
Write DataFrame to CSV and register in datasets.yaml.
Parameters:
path(str | Path): Output file pathslug(str): Machine-readable identifiername(str): Human-readable name**kwargs: Arguments passed topandas.DataFrame.to_csv()
Returns: None
Example:
Note: Publishing is controlled by the top-level publish configuration in datasets.yaml, not per-dataset.
to_parquet(path, slug, name, **kwargs)¶
Write DataFrame to Parquet file and register in datasets.yaml.
Parameters:
path(str | Path): Output file pathslug(str | None): Machine-readable identifier (required in relaxed mode if not registered)name(str | None): Human-readable name (required in relaxed mode if not registered)track(bool): If False, write without lineage tracking or dataset registration**kwargs: Arguments passed topandas.DataFrame.to_parquet()
Returns: None
Example:
set_field_metadata(column, *, description, unit, source, type, constraints)¶
Set metadata for a column. Returns self for method chaining.
Parameters:
column(str): Column name to annotatedescription(str, optional): Human-readable description of the fieldunit(str, optional): Unit of measure (e.g.,'kg','students','%')source(str, optional): Slug of the input dataset this field comes fromtype(str, optional): Data type override. If None, inferred from dtype at write timeconstraints(dict, optional): Validation constraints (e.g., enum values)
Returns: DataFrame (self, for chaining)
Example:
df.set_field_metadata('population', description='Total population', unit='people')
df.set_field_metadata('gdp', description='Gross domestic product', unit='USD')
# Method chaining
df = (df
.set_field_metadata('area', unit='km^2')
.set_field_metadata('density', unit='people / km^2')
)
merge(right, **kwargs)¶
Merge with another DataFrame.
Parameters:
right(DataFrame): DataFrame to merge with**kwargs: Arguments passed topandas.merge()
Returns: New DataFrame with combined lineage
join(other, **kwargs)¶
Join with another DataFrame.
Parameters:
other(DataFrame): DataFrame to join with**kwargs: Arguments passed topandas.DataFrame.join()
Returns: New DataFrame with combined lineage
concat(others, **kwargs)¶
Concatenate with other DataFrames.
Parameters:
others(list[DataFrame]): DataFrames to concatenate**kwargs: Arguments passed topandas.concat()
Returns: New DataFrame with combined lineage
apply_operation(operation, description)¶
Apply transformation with lineage tracking.
Parameters:
operation(callable): Function that takes a pandas DataFrame and returns a pandas DataFramedescription(str): Human-readable description of the operation
Returns: New DataFrame with operation recorded in lineage
Example:
def adjust_enrollment(df):
return df.assign(adjusted=df['enrollment'] * 1.1)
result = df.apply_operation(
adjust_enrollment,
description="Apply 10% enrollment adjustment factor"
)
Instance Attributes¶
data¶
Access the underlying pandas DataFrame.
Type: pandas.DataFrame
Example:
# Get numpy array
values = df.data.values
# Use pandas methods not wrapped
styled = df.data.style.highlight_max()
metadata¶
Access the unified metadata container.
Type: Metadata
Example:
# Lineage is accessed through metadata
print(df.metadata.lineage.sources)
print(df.metadata.lineage.get_licenses())
# Dataset identity
df.metadata.slug = 'my-dataset'
df.metadata.name = 'My Dataset'
df.metadata.description = 'A description of this dataset'
# RDF prefixes and custom properties
df.metadata.rdf_prefixes = {'schema': 'http://schema.org/'}
df.metadata.custom_properties = {'schema:about': 'Education'}
# Per-field metadata (see set_field_metadata)
print(df.metadata.field_metadata)
lineage (deprecated)¶
Access lineage metadata directly. Use df.metadata.lineage instead.
Type: LineageMetadata
Example:
DatasetsManager Class¶
Manage datasets.yaml files programmatically.
Constructor¶
DatasetsManager(project_path, datasets_file=None)¶
Create a datasets manager.
Parameters:
project_path(str | Path): Path to project directory containingdatasets.yamldatasets_file(str | Path | None): Path to a specific datasets YAML file (relative to project_path or absolute). Defaults to"datasets.yaml"
Example:
manager = DatasetsManager('/path/to/project')
# Use a custom datasets file
manager = DatasetsManager('/path/to/project', datasets_file='config/my-datasets.yaml')
Methods¶
find_dataset_by_location(location, dataset_type=None)¶
Find dataset by file path.
Parameters:
location(str): File path to search fordataset_type(str, optional): Filter by 'input' or 'output'
Returns: DatasetMetadata | None
Example:
find_dataset_by_slug(slug, dataset_type=None)¶
Find dataset by slug identifier.
Parameters:
slug(str): Slug to search fordataset_type(str, optional): Filter by 'input' or 'output'
Returns: DatasetMetadata | None
Example:
get_all_inputs()¶
Get all input datasets.
Returns: list[DatasetMetadata]
get_all_outputs()¶
Get all output datasets.
Returns: list[DatasetMetadata]
get_publish_config()¶
Get the top-level publish configuration.
Returns: PublishConfig | None
Example:
publish_config = manager.get_publish_config()
if publish_config and publish_config.enabled:
print(f"Publishing to: {publish_config.to}")
print(f"Flatten: {publish_config.flatten}")
add_output_dataset(name, slug, location, fields)¶
Register new output dataset.
Parameters:
name(str): Human-readable nameslug(str): Machine-readable identifierlocation(str): File pathfields(list[FieldSchema]): Field definitions
Returns: None
Example:
from sunstone import FieldSchema
manager.add_output_dataset(
name='Analysis Results',
slug='analysis-results',
location='outputs/results.csv',
fields=[
FieldSchema(name='category', type='string'),
FieldSchema(name='count', type='integer'),
FieldSchema(name='avg_value', type='number')
]
)
Note: Use the top-level publish configuration in datasets.yaml to enable publishing for all outputs.
update_output_dataset(slug, **kwargs)¶
Update existing output dataset.
Parameters:
slug(str): Dataset slug to update**kwargs: Fields to update (name, location, fields, etc.)
Returns: None
set_dataset_strict(slug, strict, dataset_type=None)¶
Enable or disable strict mode for a dataset.
Parameters:
slug(str): Dataset slugstrict(bool): True to enable strict mode, False to disabledataset_type(str, optional): Filter by 'input' or 'output'
Returns: None
Raises: DatasetNotFoundError if dataset not found
Example:
# Enable strict mode
manager.set_dataset_strict('school-data', True)
# Disable strict mode
manager.set_dataset_strict('school-data', False)
update_output_lineage(slug, lineage, content_hash, strict=False)¶
Update lineage metadata for an output dataset.
Parameters:
slug(str): Output dataset sluglineage(LineageMetadata): Lineage metadata to writecontent_hash(str): Hash of the file contentstrict(bool): If True, validates without modifying
Returns: None
Raises:
DatasetNotFoundError: If dataset not foundDatasetValidationError: In strict mode, if lineage differs
Note: Timestamp only updates when content_hash changes.
get_absolute_path(location)¶
Convert relative path to absolute project path.
Parameters:
location(str): Relative or absolute path
Returns: Path
Validation Functions¶
check_notebook_imports(notebook_path)¶
Validate a single notebook's imports.
Parameters:
notebook_path(str | Path): Path to notebook file
Returns: ValidationResult
Example:
result = check_notebook_imports('analysis.ipynb')
if result.is_valid:
print("✓ Notebook uses sunstone imports")
else:
print(result.summary())
validate_project_notebooks(project_path)¶
Validate all notebooks in a project.
Parameters:
project_path(str | Path): Path to project directory
Returns: dict[Path, ValidationResult]
Example:
results = validate_project_notebooks('/path/to/project')
for path, result in results.items():
if not result.is_valid:
print(f"\n{path}:")
print(result.summary())
Data Classes¶
FieldSchema¶
Field definition for datasets.
Attributes:
name(str): Field nametype(str | None): Field type (string, number, integer, boolean, date, datetime). If None, inferred from dtype at write timedescription(str, optional): Field descriptionunit(str, optional): Unit of measure (e.g.,'kg','%','people')source(str, optional): Slug of the input dataset this field's data comes fromconstraints(dict, optional): Validation constraints
Example:
from sunstone import FieldSchema
field = FieldSchema(
name='enrollment',
type='integer',
description='Number of enrolled students',
unit='students',
constraints={'minimum': 0}
)
# type can be omitted — it's inferred at write time
field = FieldSchema(name='ratio', description='Student-teacher ratio')
DatasetMetadata¶
Dataset metadata from datasets.yaml.
Attributes:
name(str): Human-readable nameslug(str): Machine-readable identifierlocation(str): File pathfields(list[FieldSchema]): Field definitionssource(SourceMetadata | None): Source attribution (inputs only)strict(bool): Strict mode enableddataset_type(str): 'input' or 'output'
PublishConfig¶
Top-level publishing configuration.
Attributes:
enabled(bool): Whether publishing is enabledto(str | None): Destination URL or pathflatten(bool): Whether to flatten directory structure (default: False)
Path Resolution:
- If
toends with.json: Used as datapackage filename gs://bucket/countries.json→ datapackage at exact path- If
todoesn't end with.json: Treated as directory gs://bucket/datasets/project/→ adds/datapackage.json
Example:
from sunstone import PublishConfig
config = PublishConfig(
enabled=True,
to='gs://my-bucket/datasets/project/',
flatten=False
)
LineageMetadata¶
Lineage tracking information. Aligned with W3C PROV-O.
Attributes:
sources(list[DatasetMetadata]): Source datasets that contributed to this datacreated_at(datetime | None): Timestamp when lineage was last updated (content changed)content_hash(str | None): SHA256 hash of the DataFrame contentactivity(Activity | None): The PROV-O Activity that generated this datafield_derivations(list[FieldDerivation] | None): Field-level derivation detail (prov:qualifiedDerivation)
Methods:
get_licenses(): Return list of all source licensesadd_source(source): Add source datasetpopulate_field_derivations(columns, slug): Auto-populate field derivations for columns from a sourcemerge(other): Merge lineage from another DataFrame, combining sources and field derivations
Activity¶
A W3C PROV-O Activity representing a script or notebook execution.
Attributes:
id(str): Unique identifier (e.g.,'exec-{timestamp}-{hash}')used(list[UsageRecord]): Input entities consumed by this activitygenerated(list[EntityRef]): Output entities producedwas_associated_with(list[Agent]): Agents involved in this activitystarted_at(datetime | None): When the activity startedended_at(datetime | None): When the activity endedscript_path(str | None): Path to the executed Python scriptgit_commit(str | None): Git commit hash at time of execution
Agent¶
A W3C PROV-O Agent: something that bears responsibility for an activity.
Attributes:
id(str): Unique identifier (username, org name, software name)type(AgentType): One ofPERSON,SOFTWARE,ORGANIZATIONlabel(str | None): Human-readable labelversion(str | None): Version string (for SoftwareAgent)
FieldDerivation¶
Records that an output field was derived from a source entity. Maps to prov:qualifiedDerivation at the field level.
Attributes:
output_field(str): Name of the output columnsource_entity(str): Slug of the source datasetsource_field(str | None): Name of the source field, if known
EntityRef¶
Lightweight reference to a PROV Entity (dataset).
Attributes:
slug(str): Dataset slug identifiernamespace(str | None): Optional namespace URI for external entities
UsageRecord¶
Records how an Activity used an Entity. Maps to prov:qualifiedUsage.
Attributes:
entity(EntityRef): Which entity was usedcolumns(list[str] | None): Which columns were selected (None means all)filters(dict | None): Filters applied during read
Metadata Class¶
Unified metadata container for DataFrames.
Attributes:
lineage(LineageMetadata): Lineage metadata tracking data provenancedescription(str | None): Human-readable description of the datasetrdf_prefixes(dict | None): RDF namespace prefixes for custom propertiescustom_properties(dict | None): Custom properties including RDF triplesfield_metadata(dict[str, FieldSchema]): Per-column metadata, keyed by column nameslug(str | None): Dataset slug, used at write timename(str | None): Human-readable dataset name, used at write time
Plugin System¶
The plugin system handles reading, writing, and URL resolution through a registry of handlers.
PluginRegistry¶
Central registry for auth providers, URL handlers, and format handlers.
PluginRegistry.get(project_path=None)¶
Return a cached registry instance. If project_path is provided, the registry is scoped to that project and loads project-specific plugin configuration.
Example:
registry.fetch(url, dest)¶
Download a URL to a local file using the appropriate URL handler.
Parameters:
url(str): URL to download (supportshttp://,https://,gs://,s3://,r2://, local paths)dest(Path): Local destination file path
Returns: Path to the downloaded file
Example:
from pathlib import Path
from sunstone.plugins import PluginRegistry
registry = PluginRegistry.get()
registry.fetch('gs://my-bucket/data.csv', Path('data/local.csv'))
Note:
DatasetsManager.fetch_from_url()is deprecated. UsePluginRegistry.get().fetch()instead.
Plugin Protocols¶
Plugins implement one or more of these protocols:
AuthProvider: Provides authentication headers for HTTP requestsURLHandler: Resolves URLs to readable/writable streams viaopen(url, mode)FormatHandler: Reads and writes data formats (CSV, JSON, Excel, Parquet, TSV)
Plugin Discovery¶
External plugins are discovered via the sunstone.plugins entry point group:
# In your plugin's pyproject.toml
[project.entry-points."sunstone.plugins"]
my-plugin = "my_package:MyPlugin"
Plugin Configuration¶
Configuration is loaded with cascading precedence:
datasets.yaml→plugins.<name>section (highest priority)pyproject.toml→[tool.sunstone.plugins.<name>]section- Environment variables →
SUNSTONE_PLUGIN_<NAME>_<KEY>
Built-in URL Handlers¶
| Scheme | Handler | Extra |
|---|---|---|
| Local files | LocalFileHandler |
Built-in |
http://, https:// |
HttpURLHandler |
Built-in (with SSRF protection) |
gs:// |
GcsURLHandler |
Requires sunstone-py[gcs] |
s3://, r2:// |
S3URLHandler |
Requires sunstone-py[s3] |
Unit-Aware Arithmetic¶
sunstone-py integrates with Pint for unit-aware column arithmetic.
Unit Modes¶
Set via SUNSTONE_UNIT_MODE environment variable or programmatically:
from sunstone.units import set_unit_mode
set_unit_mode('strict') # Raise on unit mismatch
set_unit_mode('auto') # Auto-convert compatible units
set_unit_mode('relaxed') # No unit validation (default)
| Mode | Add/Sub mismatch | Mul/Div | Unknown units |
|---|---|---|---|
relaxed |
Allowed | Allowed | Allowed |
strict |
Error | Computes result unit | Error |
auto |
Auto-converts if compatible | Computes result unit | Warning |
Setting Units on Columns¶
Unit Tracking Through Operations¶
When columns with units are used in merge, join, or concat operations, sunstone validates unit compatibility and (in auto mode) applies conversions automatically.
QUDT Round-Tripping¶
Units stored as QUDT URIs in datasets.yaml are preserved through read/write cycles via the unit_source field on FieldSchema.
Exceptions¶
from sunstone.exceptions import (
SunstoneError,
DatasetNotFoundError,
StrictModeError,
DatasetValidationError,
LineageError
)
SunstoneError¶
Base exception for all sunstone-py errors.
DatasetNotFoundError¶
Raised when dataset not found in datasets.yaml.
Example:
try:
df = pd.read_csv('missing.csv', project_path=PROJECT_PATH)
except DatasetNotFoundError as e:
print(f"Dataset not registered: {e}")
StrictModeError¶
Raised when operation blocked in strict mode.
Example:
try:
df.to_csv('new.csv', slug='new', name='New', strict=True)
except StrictModeError as e:
print(f"Strict mode violation: {e}")
DatasetValidationError¶
Raised when dataset validation fails.
LineageError¶
Raised when lineage tracking encounters an error.
Type Hints¶
sunstone-py includes complete type hints for IDE support: