EXPERIMENTAL: This repository contains the Python module pydex for connecting to the Riverscapes Data Exchange API. It also includes a collection of scripts that use pydex classes to perform useful tasks.
This project is designed to simplify interaction with the Riverscapes GraphQL API. It uses modern Python packaging standards, including a pyproject.toml file for configuration and dependency management.
This project uses uv to manage Python virtual environments and dependencies. uv is an alternative to tools like pipenv and poetry.
- Install
uvby following the installation instructions for your operating system. - Ensure you have Python 3.9 or higher installed.
We have started using Spatialite for some operations. This is a binary that sites on top of SQLite and provides several powerful geospatial operations as ST_ functions, similar to PostGIS on top of Postgres.
Spatialite is distributed as an extension to SQLite, but unfortunately the core SQLite3 Python package is not compiled to allow extensions to be loaded (presumably for security reasons). Therefore we use a package called APSW (Another Python SQLite Wrapper) that does. APSW can be installed with UV and then you have to load the extension with the following code, where spatialite_path is the path to the Spatialite binary. MacOS users can install Spatialite using homebrew and then search for the file mod_spatialite.8.dylib. Windows users can download Spatialite binaries from the Gaia GIS site. Our Python that uses Spatialite should all allow you to specify this path in the launch.json file.
conn = apsw.Connection(rme_gpkg)
conn.enable_load_extension(True)
conn.load_extension(spatialite_path)
curs = conn.cursor()To set up the project, follow these steps:
# Clone the repository
git clone https://github.com/Riverscapes/data-exchange-scripts.git
cd data-exchange-scripts
# Sync the environment using uv
uv syncThis will create a .venv folder in the root of the repository with the correct Python environment and dependencies installed.
- Open the repository in VSCode.
- If the
.venvenvironment is not automatically detected, reload the window or restart VSCode. - Select the Python interpreter located at
.venv/bin/python(on macOS/Linux) or.venv\Scripts\python.exe(on Windows).
The best way to run a script is going to be using the "Run and Debug" feature in VSCode. This will ensure that the correct virtual environment is activated and that the script runs in the correct context.
Click that button and select the dropdown item that best fits. If you're just trying to run a file without a launch item you can use 🚀 Python: Run/Debug Current File (with .env). This will run the script and set you up with a server environment context (production or staging).
Running scripts this way will also allow you to drop breakpoints in your code and debug it.
This project includes optional dependencies for geospatial functionality. To install these dependencies, run:
uv sync --extra geoThis will install packages like gdal and shapely. Note that gdal may require additional system-level dependencies. On macOS, you can install gdal using Homebrew:
brew install gdal- Open the codespace "Riverscapes API Codespace."
- In VSCode, load the
RiverscapesAPI.code-workspaceworkspace. - Ensure the appropriate Python version is selected (e.g.,
3.12.9 ('.venv')).
NOTE: The codespace environment does not currently support scripts requiring GDAL (e.g. project merging). Run those locally.
- Dependency Management: Use
uv syncto ensure your environment is always up-to-date with the dependencies specified inpyproject.toml.
This project uses port 4721 to authenticate locally and 4723 when used inside a codespace. This may conflict with other codespaces (such as riverscapes-tools).
If you want to use this repository as a dependency in another project you can do so by adding it to your pyproject.toml file. For example:
[tool.uv.sources]
pydex = { git = "https://github.com/Riverscapes/data-exchange-scripts.git", branch = "main" }For legacy projects that use pip you can install it directly from the repository:
pip install git+https://github.com/Riverscapes/data-exchange-scripts.gitContributions are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bugfix.
- Submit a pull request with a clear description of your changes.
This project is licensed under the MIT License. See the LICENSE file for details.
The repository includes an automated pipeline for publishing tool/layer column definitions to Amazon Athena as an external table.
- Each tool now publishes a single unified
layer_definitions.jsoncontaining both descriptive metadata and column definitions (no separate per-layer files ordef_pathindirection). - These live beside the tool's code (e.g. under
pipelines/<tool_name>/).
export_layer_definitions_for_s3.py in the riverscapesXML repo scans the repo for every layer_definitions.json and produces partitioned Parquet (default) output.
We've loaded it as a requirement in the toml: https://github.com/Riverscapes/RiverscapesXML.git@master#subdirectory=riverscapes_metadata
Partition hierarchy (three levels):
authority– repository root name (e.g.data-exchange-scripts). Derived automatically from the git repo folder name.authority_name– the tool / package authority publishing the layer definitions (from JSON).tool_schema_version– semantic version of the tool's layer definition schema (from JSON).
Output pattern:
dist/metadata/
authority=<repo>/authority_name=<authority>/tool_schema_version=<version>/layer_metadata.parquet
index.json
Default behavior:
- Output format: Parquet (use
--format csvfor CSV). - Partition columns (
authority,authority_name,tool_schema_version) are NOT inside the Parquet/CSV files unless--include-partition-colsis passed. - A
commit_sha(current HEAD) is written into every row and stored again inindex.jsonwith a run timestamp. - Schema validation is enforced; any validation error causes a loud failure (non-zero exit code). An
index.jsonwithstatus: validation_failedand the collectedvalidation_errorsis still written for diagnostics, but no partition files are produced.
Run locally:
python scripts/metadata/export_layer_definitions_for_s3.pyOptional flags:
python scripts/metadata/export_layer_definitions_for_s3.py --format csv # CSV instead of Parquet
python scripts/metadata/export_layer_definitions_for_s3.py --include-partition-cols # Embed partition columns in each fileWe publish to: s3://riverscapes-athena/metadata/layer_definitions/
This gets turned into the athena table default.layer_definitions.
Add new partitions (after upload):
-- auto-discover
MSCK REPAIR TABLE layer_definitions;
-- OR manual:
ALTER TABLE layer_definitions
ADD IF NOT EXISTS PARTITION (
authority='data-exchange-scripts',
authority_name='rme_to_athena',
tool_schema_version='1.0.0'
)
LOCATION 's3://riverscapes-athena/metadata/layer_definitions/authority=data-exchange-scripts/authority_name=rme_to_athena/tool_schema_version=1.0.0/';Workflow file: .github/workflows/metadata-catalog.yml
Steps performed:
- Checkout code.
- Assume AWS IAM role via OIDC (secret
METADATA_PUBLISH_ROLE_ARN). - Install dependencies (Python 3.12 +
uv sync). - Run flatten script -> partitioned Parquet.
- Sync
dist/metadatato S3 bucket prefix. - Run
MSCK REPAIR TABLEto load partitions. - Perform sample queries (partition listing / row count).
The role must allow:
- S3 List/Get/Put/Delete under
metadata/layer_definitions/and query result prefix. - Athena: StartQueryExecution, GetQueryExecution, GetQueryResults.
- Glue: Get/Create/Update table & partitions for the database/table.
- Validate layer schemas (dtype whitelist, required fields & semantic checks).
- Explicit partition adds instead of MSCK for faster updates.
- Historical snapshots (extra partition like
snapshot_date). - Glue Catalog integration (automated table & partition registration without MSCK).
- Data quality profile summary (row counts, distinct key coverage) in
index.json.
| Symptom | Likely Cause | Fix |
|---|---|---|
| Empty Athena table | Partitions not loaded | Run MSCK REPAIR TABLE or add partitions manually |
| Wrong data types | Created table before column rename | Drop & recreate external table with new DDL |
| Missing new version | Workflow didn’t run or lacked perms | Check Actions logs & IAM role policies |
| Zero rows for authority | Upload sync failed | Inspect S3 prefix & re-run workflow |