dataio.scripts.sync_dataset_documentation

Sync dataset documentation (README.md and metadata.json) from S3 file server to database.

This script fetches README.md and metadata.json files from the S3 filestore and caches their contents in the datasets table for faster access.

Usage: # Sync all datasets uv run python -m dataio.scripts.sync_dataset_documentation

# Sync specific dataset
uv run python -m dataio.scripts.sync_dataset_documentation --dataset DS_EXAMPLE01

# Dry run (show what would be synced)
uv run python -m dataio.scripts.sync_dataset_documentation --dry-run

Module Contents

Functions

get_database_url

Build database URL from environment variables.

get_s3_client

Initialize S3 client.

fetch_file_from_s3

Fetch a file from S3 for a dataset.

sync_dataset_documentation

Sync documentation for a single dataset.

main

Data

API

dataio.scripts.sync_dataset_documentation.logger

‘getLogger(…)’

dataio.scripts.sync_dataset_documentation.get_database_url() str[source]

Build database URL from environment variables.

dataio.scripts.sync_dataset_documentation.get_s3_client()[source]

Initialize S3 client.

dataio.scripts.sync_dataset_documentation.fetch_file_from_s3(bucket, dataset_id: str, filename: str) Optional[str][source]

Fetch a file from S3 for a dataset.

Looks in both STANDARDISED and PREPROCESSED versions. Returns the file content as string, or None if not found.

dataio.scripts.sync_dataset_documentation.sync_dataset_documentation(db_session, bucket, dataset_id: str, dry_run: bool = False) dict[source]

Sync documentation for a single dataset.

Returns dict with sync results.

dataio.scripts.sync_dataset_documentation.main()[source]