Attribution This repository makes use of and builds on external code from the following sources: - datasets/mosaiks: Adapted from the Global Policy Lab's mosaiks-paper repository, which provides code for feature extraction and dataset handling using MOSAIKS features. - sampling/: Adapted from TypiClust's deep-al module, which includes implementations for active learning and sampling methods. We thank the authors of these repositories for making their code available.
This repository contains a complete data processing pipeline for optimized sampling analysis across three datasets:
- USAVars
- India SECC
- Togo soil fertility (Note: not yet publicly released)
- Data Download
- Featurization
- Train-Test Split
- Generate GeoDataFrames
- Group Creation
- Initial Sampling
- Running Sampling
- Bash Scripts
- Results and Analysis
- Contributing
Download using torchgeo.
- Docs: TorchGeo USAVars Dataset
Process/download from this repository: https://github.com/emilylaiken/satellite-fairness-replication
-
mosaiks_features_by_shrug_condensed_regions_25_max_tiles_100_india.csv- Description:
- Precomputed MOSAIKS features (4000-dim)
- Columns:
condensed_shrug_id: Unique ID per unitFeature0toFeature3999: Satellite features
- Description:
-
grouped.csv- Description:
- Contains labels
- Columns:
condensed_shrug_id(matching above)secc_cons_pc_combined: Target variable
- Description:
-
villages_with_regions.shp- Description:
- Shapefile with spatial polygons
- Columns:
condensed_: Will be renamed tocondensed_shrug_idgeometry: Polygon geometries
- Description:
Not yet available. Will be released by the Togolese Ministry of Agriculture.
Run:
cd datasets
python featurization.py \
--dataset_name USAVars \
--data_root /path/to/your/data/root \
--labels treecover,population \
--num_features 4096Follow instructions at: satellite-fairness-replication
- Save features as a
.pklfile (dict format) with keys:'X','ids_X', and'latlon'.
Run format_data.py:
cd datasets
python format_data.py \
--save \
--label population \ # or other label
--feature_path /path/to/featurized/data/CONTUS_UAR_torchgeo4096.pkl # or India featuresCreates an 80/20 split, saved as a .pkl file with {X, y, latlon}_{train, test}.
GeoDataFrames are used for clusters and region-based sampling strategies.
First, download US county shape files from census.gov (Here we use 2015 shape files):
cd groups
python usavars_generate_gdfs.py \
--labels population,treecover \
--input_folder ../0_data/features/usavars \
--year 2015 \
--county_shp ../0_data/boundaries/us/us_county_2015 \
--output_dir ./admin_gdfs/usavarsAdmin levels are pre-included in the shapefile. No processing needed.
Use:
python togo_generate_gdfs.py- Admin Groups: States, regions
- Image Groups: Feature-based KMeans clustering
- NLCD Groups: Land cover classes (U.S. only)
Generate county-level groups for a single dataset (US population):
python generate_groups.py \
--datasets "usavars_pop" \
--gdf_paths "../0_data/admin_gdfs/usavars/gdf_counties_population_2015.geojson" \
--id_cols "id" \
--shape_files "../0_data/boundaries/us/us_county_2015" \
--group_type "counties" \
--group_cols "combined_county_id"To run for other datasets (usavars_tc, india_secc, togo), replace the --datasets, --gdf_paths, --id_cols arguments accordingly.
Note: the following groups should be made:
- USAVars:
state,county - India SECC:
state,district - Togo :
region,cantons
The image_clusters.py script generates clusters from image features.
python image_clusters.py \
--dataset USAVars \ #or India SECC, Togo
--num_clusters 8 \ #or 3
--feature_path /path/to/features.pkl \
--output_path /path/to/save/clusters_{num_clusters}.pkl
- Download 2016 NLCD TIFF from: https://earthexplorer.usgs.gov/
- Run
nlcd_groups.pywith the following required arguments: --input_dir: Directory containing NAIP image files--nlcd_path: Path to existing NLCD .tif raster file--output_dir: Directory to save output files--dataset_name: Dataset name for output filenames (default: "usavars")--file_pattern: File pattern to match in input directory (default: "*.tif")--k_min: Minimum number of clusters to test (default: 2)--k_max: Maximum number of clusters to test (default: 10)
Specifically, make sure k = 8
Use Jupyter Notebooks:
usavars_initial_sample.ipynbindia_secc_initial_sample.ipynbtogo_initial_sample.ipynb
Save the outputs as .pkl to:
0_data/initial_samples/{dataset}/...
Attention: To solve optimization problem, cvxpy is run with MOSEK solver. You need a license to use MOSEK, which can be requested from https://www.mosek.com/products/academic-licenses/
Edit config files in:
sampling/configs/{dataset}/*.yaml
Set correct paths, especially DATASET.ROOT_DIR.
Example usage:
python train.py \
--cfg ../configs/usavars/RIDGE_POP.yaml \
--exp-name experiment_1 \
--sampling_fn greedycost \
--budget 1000 \
--initial_set_str test \
--seed 42 \
--unit_assignment_path ../../0_data/groups/usavars_pop/counties_assignments.pkl \
--id_path '../../0_data/initial_samples/usavars/population/cluster_sampling/fixedstrata_Idaho_16-Louisiana_22-Mississippi_28-New Mexico_35-Pennsylvania_42/sample_state_combined_county_id_5ppc_150_size_seed_1.pkl' \
--cost_func uniform \
--cost_name uniform \
--group_assignment_path ../../0_data/groups/usavars_pop/states_assignments.pkl \
--group_type states| Argument | Description |
|---|---|
--cost_array_path |
Path to a NumPy cost array |
--unit_assignment_path |
Path to unit assignment file |
--region_assignment_path |
Path to region assignment file |
--util_lambda |
Utility lambda for optimization |
--alpha |
Alpha parameter for cost-based sampling |
--points_per_unit |
For unit-based sampling |
run_india.shrun_togo.shrun_usavars_population.shrun_usavars_treecover.sh
run_india_rep_states.shrun_togo_rep_regions.shrun_usavars_population_rep_states.shrun_usavars_treecover_rep_states.sh
run_india_rep_image_8.shrun_togo_cluster_rep_image_8.shrun_usavars_population_rep_image_8.shrun_usavars_treecover_rep_image_8.sh
(U.S. only)
run_usavars_population_rep_nlcd.shrun_usavars_treecover_rep_nlcd.sh
run_india_cluster_multiple.shrun_togo_cluster_multiple_initial_set.sh
run_togo_cost_diff.sh
Go to the summarize/ directory.
python parse_out_log.py --multiple True # or Falsepython generate_latex_table.pypython plot_multiple_initial_set.pypython plot_alpha.py- Make sure all data paths in config files are set correctly.
- Check that required
.pklfiles (features, splits, groups) exist.