diff --git a/docs/runorder.md b/docs/runorder.md index 8e50717..ca43f64 100644 --- a/docs/runorder.md +++ b/docs/runorder.md @@ -71,12 +71,6 @@ Firstly we have access to healthcare data: python3 -m methods.inputs.download_accessibility /data/tmf/access/raw.tif ``` -Which must then be turned into smaller tiles: - -```shark-run:gdalenv -python3 -m methods.inputs.generate_access_tiles /data/tmf/access/raw.tif /data/tmf/jrc/tif /data/tmf/access -``` - ## Country boarders We use OSM data for country borders data. This requires an API key to access the downloads. To get an API key you must follow these steps: @@ -134,8 +128,8 @@ DATA_PATH="/data/tmf/gedi" Once you have this the download script to pull the raw GEDI data is: ```shark-run:gdalenv -python3 -m methods.inputs.locate_gedi_data /data/tmf/123/buffer.geojson /data/tmf/gedi/granule/info/ -python3 -m methods.inputs.download_gedi_data /data/tmf/gedi/granule/info/* /data/tmf/gedi/granule/ +python3 -m methods.inputs.locate_gedi_data /data/buffer.geojson /data/gedi/info/ +python3 -m methods.inputs.download_gedi_data /data/gedi/info/* /data/gedi/granule/ ``` Each GEDI granule is a complete sweep from the ISS along the path that crossed the buffer zone, and so contains @@ -143,17 +137,21 @@ much more data than we need. We thus need to filter this to give us just the GED same time we filter based on time and quality as per the methodology document: ```shark-run:gdalenv -python3 -m methods.inputs.filter_gedi_data --buffer /data/tmf/123/buffer.geojson \ - --granules /data/tmf/gedi/granule/ \ - --output /data/tmf/123/gedi.geojson +python3 -m methods.inputs.filter_gedi_data --buffer /data/buffer.geojson \ + --granules /data/gedi/granule/ \ + --output /data/gedi.geojson ``` Once filtered we can then combine the GEDI data with the JRC data we have to work out the carbon density per land usage class: ```shark-run:gdalenv -python3 -m methods.inputs.generate_carbon_density --jrc /data/tmf/jrc/tif \ - --gedi /data/tmf/123/gedi.geojson \ - --output /data/tmf/123/carbon-density.csv +python3 -m methods.inputs.generate_carbon_density --jrc /data/jrc/AnnualChange/ \ + --gedi /data/gedi.geojson \ + --output /data/tmf/123/agb.csv +``` + +```shark-publish +/data/tmf/123/agb.csv ``` ## Generate matching area @@ -185,7 +183,6 @@ We use the SRTM elevation data, which we download to cover the matching area: ```shark-run:gdalenv python3 -m methods.inputs.download_srtm_data --project /data/tmf/project_boundaries/123.geojson \ --matching /data/tmf/project_boundaries/123/matching-area.geojson \ - --zips /data/tmf/srtm/zip \ --tifs /data/tmf/srtm/tif ``` @@ -206,9 +203,11 @@ python3 -m methods.inputs.rescale_tiles_to_jrc --jrc /data/tmf/jrc/tif \ --output /data/tmf/rescaled-slopes ``` -## Country raster +## Project baselayer rasters -Again, rather than repeatedly dynamically rasterize the country vectors, we rasterise them once and re-use them: +For various source inputs we generate a single raster layer that contains all we want, at the correct scale. + +First a map of countries: ```shark-run:gdalenv python3 -m methods.inputs.generate_country_raster --jrc /data/tmf/jrc/tif \ @@ -217,6 +216,24 @@ python3 -m methods.inputs.generate_country_raster --jrc /data/tmf/jrc/tif \ --output /data/tmf/123/countries.tif ``` +Then a raster map of ecoregions. + +```shark-run:gdalenv +python3 -m methods.inputs.generate_ecoregion_rasters --jrc /data/jrc/AnnualChange/ \ + --matching /data/tmf/project_boundaries/123/matching-area.geojson \ + --ecoregions /data/tmf/ecoregions/ecoregions.geojson \ + --output /data/tmf/ecoregions/ecoregion.tif +``` + +Finally a map of access data. + +```shark-run:gdalenv +python3 -m methods.inputs.generate_access_raster --jrc /data/jrc/AnnualChange/ \ + --matching /data/tmf/project_boundaries/123/matching-area.geojson \ + --access /data/tmf/access/raw.tif \ + --output /data/tmf/123/access.tif +``` + # Pixel matching Pixel matching is split into three main stages: Calculating K, then M, and then finding pairs between them. @@ -229,39 +246,44 @@ First we make K. python3 -m methods.matching.calculate_k --project /data/tmf/project_boundaries/123.geojson \ --start_year 2012 \ --evaluation_year 2021 \ - --jrc /data/tmf/jrc/tif \ - --cpc /data/tmf/fcc-cpcs \ - --ecoregions /data/tmf/ecoregions \ - --elevation /data/tmf/rescaled-elevation \ - --slope /data/tmf/rescaled-slopes \ - --access /data/tmf/access \ + --jrc /data/jrc/AnnualChange/ \ + --cpc /data/jrc/cpc/ \ + --ecoregions /data/tmf/ecoregions/ecoregion.tif \ + --elevation /data/tmf/rescaled-elevation/ \ + --slope /data/tmf/rescaled-slopes/ \ + --access /data/tmf/123/access.tif \ --countries-raster /data/tmf/123/countries.tif \ --output /data/tmf/123/k.parquet ``` +```shark-publish +/data/tmf/123/k.parquet +``` + + ## Calculate set M Calculating the set M is a three step process. First we generate a raster per K that has the matches for that particular pixel: ```shark-run:gdalenv python3 -m methods.matching.find_potential_matches --k /data/tmf/123/k.parquet \ - --matching /data/tmf/123/matching-area.geojson \ + --matching /data/tmf/project_boundaries/123/matching-area.geojson \ --start_year 2012 \ --evaluation_year 2021 \ - --jrc /data/tmf/jrc/tif \ - --cpc /data/tmf/fcc-cpcs \ - --ecoregions /data/tmf/ecoregions \ - --elevation /data/tmf/rescaled-elevation \ - --slope /data/tmf/rescaled-slopes \ - --access /data/tmf/access \ - --countries-raster /data/tmf/123/countries.tif \ - --output /data/tmf/123/matches + --jrc /data/jrc/AnnualChange/ \ + --cpc /data/jrc/cpc/ \ + --ecoregions /data/tmf/ecoregions/ecoregion.tif \ + --elevation /data/tmf/rescaled-elevation/ \ + --slope /data/tmf/rescaled-slopes/ \ + --access /data/tmf/123/access.tif \ + --countries-raster /data/tmf/123/countries.tif \ + --output /data/tmf/123/matches/ ``` Then these rasters get combined into a single raster that is all the potential matches as one. The j parameter controls how many concurrent processes the script can use, which is bounded mostly by how much memory you have available. The value 20 is good for our server, but may not match yours. ```shark-run:gdalenv -python3 -m methods.matching.build_m_raster --rasters_directory /data/tmf/123/matches \ +python3 -m methods.matching.build_m_raster --rasters_directory /data/tmf/123/matches/ \ --output /data/tmf/123/matches.tif \ -j 20 ``` @@ -270,19 +292,25 @@ We then convert that raster into a table of pixels plus the data we need: ```shark-run:gdalenv python3 -m methods.matching.build_m_table --raster /data/tmf/123/matches.tif \ - --matching /data/tmf/123/matching-area.geojson \ + --matching /data/tmf/project_boundaries/123/matching-area.geojson \ --start_year 2012 \ --evaluation_year 2021 \ - --jrc /data/tmf/jrc/tif \ - --cpc /data/tmf/fcc-cpcs \ - --ecoregions /data/tmf/ecoregions \ - --elevation /data/tmf/rescaled-elevation \ - --slope /data/tmf/rescaled-slopes \ - --access /data/tmf/access \ - --countries-raster /data/tmf/123/countries.tif \ + --jrc /data/jrc/AnnualChange/ \ + --cpc /data/jrc/cpc/ \ + --ecoregions /data/tmf/ecoregions/ecoregion.tif \ + --elevation /data/tmf/rescaled-elevation/ \ + --slope /data/tmf/rescaled-slopes/ \ + --access /data/tmf/123/access.tif \ + --countries-raster /data/tmf/123/countries.tif \ --output /data/tmf/123/matches.parquet ``` + +```shark-publish +/data/tmf/123/matches.parquet +``` + + ## Find pairs Finally we can find the 100 sets of matched pairs. The seed is used to control the random number generator, and using the same seed on each run ensures consistency despite randomness being part of the selection process. @@ -291,9 +319,8 @@ Finally we can find the 100 sets of matched pairs. The seed is used to control t python3 -m methods.matching.find_pairs --k /data/tmf/123/k.parquet \ --m /data/tmf/123/matches.parquet \ --start_year 2012 \ - --output /data/tmf/123/pairs \ - --seed 42 \ - -j 1 + --output /data/tmf/123/pairs/ \ + --seed 42 ``` # Calculate additionality @@ -304,9 +331,14 @@ Finally this script calculates additionality: python3 -m methods.outputs.calculate_additionality --project /data/tmf/project_boundaries/123.geojson \ --project_start 2012 \ --evaluation_year 2021 \ - --density /data/tmf/123/carbon-density.csv \ - --matches /data/tmf/123/pairs \ + --density /data/tmf/123/agb.csv \ + --matches /data/tmf/123/pairs/ \ --output /data/tmf/123/additionality.csv +cat /data/tmf/123/additionality.csv ``` -By running the additionality step with the environment variable `TMF_PARTIALS` set to some directory, this step will also generate GeoJSON files for visualising the pairs and the standardised mean difference calculations for the matching variables. You can add `TMF_PARTIALS=/some/dir` before `python3` to set this just for a specific run of `calculate_additionality`. +By running the additionality step with the environment variable `TMF_PARTIALS` set to some directory, this step will also generate GeoJSON files for visualising the pairs and the standardised mean difference calculations for the matching variables. You can add `TMF_PARTIALS=/some/dir` before `python3` to set this just for a specific run of `calculate_additionality`. + +```shark-publish +/data/tmf/123/additionality.csv +``` \ No newline at end of file diff --git a/methods/inputs/download_accessibility.py b/methods/inputs/download_accessibility.py index e7ca2a1..68e5e71 100644 --- a/methods/inputs/download_accessibility.py +++ b/methods/inputs/download_accessibility.py @@ -1,14 +1,14 @@ +import io import os import sys import tempfile import traceback +import zipfile from shutil import move from glob import glob import requests -from biomassrecovery.utils.unzip import unzip # type: ignore - # As taken from https://malariaatlas.org/open-access-policy/ # Under "Creative Commons Attribution 3.0 Unported License" ACCESS_DATA = "https://data.malariaatlas.org/geoserver/Accessibility/ows?service=CSW&version=2.0.1" \ @@ -22,20 +22,13 @@ def is_tif(fname : str) -> bool: def download_accessibility_tif(source_url: str, target_path: str) -> None: with tempfile.TemporaryDirectory() as tmpdir: - download_path = os.path.join(tmpdir, "accessibility.zip") - response = requests.get(source_url, stream=True, timeout=60) - if not response.ok: - raise DownloadError(f'{response.status_code}: {response.reason}') - with open(download_path, 'wb') as output_file: - print("Downloading accessibility data...") - for chunk in response.iter_content(chunk_size=1024*1024): - output_file.write(chunk) - print(f"Unzipping from {download_path}") - unzip( - download_path, - tmpdir, - filter_func=is_tif - ) + with requests.get(source_url, stream=True, timeout=60) as response: + if not response.ok: + raise DownloadError(response.status_code, response.reason, source_url) + with zipfile.ZipFile(io.BytesIO(response.content)) as zipf: + members = zipf.namelist() + for member in members: + _ = zipf.extract(member, path=tmpdir) # There should only be a single TIF file tif = glob("*.tif", root_dir=tmpdir) diff --git a/methods/inputs/download_jrc_data.py b/methods/inputs/download_jrc_data.py index e8b0cde..e6c12b4 100644 --- a/methods/inputs/download_jrc_data.py +++ b/methods/inputs/download_jrc_data.py @@ -1,5 +1,6 @@ import argparse import io +import logging import multiprocessing import os import shutil @@ -13,13 +14,19 @@ import requests import shapely # type: ignore +logging.basicConfig( + level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s" +) + URL_FORMAT = "https://ies-ows.jrc.ec.europa.eu/iforce/tmf_v1/download.py?type=tile&dataset=AnnualChange&lat=%s&lon=%s" def download_jrc_dataset(tempdir: str, output_dir: str, tile_url: str) -> None: with requests.get(tile_url, stream=True, timeout=60) as response: if response.status_code == HTTPStatus.NOT_FOUND: + logging.warning("No file found at %s", tile_url) return if response.status_code != HTTPStatus.OK: + logging.warning("Bad status for %s: %d", tile_url, response.status_code) raise ValueError(response.status_code) try: with zipfile.ZipFile(io.BytesIO(response.content)) as zzip: @@ -32,8 +39,10 @@ def download_jrc_dataset(tempdir: str, output_dir: str, tile_url: str) -> None: os.rename(target, final_path) except OSError: shutil.move(target, final_path) + logging.info("Successfully fetched %s", tile_url) except zipfile.BadZipFile: - pass + logging.error("Bad zip file %s: %s", tile_url, response.content) + raise def download_jrc_data( target_tif_directory: str, @@ -64,6 +73,7 @@ def download_jrc_data( url = URL_FORMAT % (slat, slng) tile_urls.append(url) + logging.info("Found %s tiles", len(tile_urls)) with tempfile.TemporaryDirectory() as tempdir: core_count = int(multiprocessing.cpu_count() / 4) n_workers = min(len(tile_urls), core_count) diff --git a/methods/inputs/download_shapefiles.py b/methods/inputs/download_shapefiles.py index 4ac2c39..e2de3f2 100644 --- a/methods/inputs/download_shapefiles.py +++ b/methods/inputs/download_shapefiles.py @@ -1,13 +1,13 @@ +import io import os import sys import tempfile +import zipfile from glob import glob import geopandas # type: ignore import requests -from biomassrecovery.utils.unzip import unzip # type: ignore - from ..common import DownloadError # As taken from the World Bank data catalog https://datacatalog.worldbank.org/search/dataset/0038272 @@ -32,26 +32,27 @@ class UnpackError(Exception): # 4: convert it to something more useful (i.e., geojson/gpkg) def download_country_polygons(source_url: str, target_filename: str) -> None: with tempfile.TemporaryDirectory() as tmpdir: - download_path = os.path.join(tmpdir, "countries.zip") - response = requests.get(source_url, stream=True, timeout=60) - if not response.ok: - raise DownloadError(response.status_code, response.reason, source_url) - with open(download_path, 'wb') as output_file: - for chunk in response.iter_content(chunk_size=1024*1024): - output_file.write(chunk) - unzip_path = os.path.join(tmpdir, "countries") - unzip(download_path, unzip_path) + with requests.get(source_url, stream=True, timeout=60) as response: + if not response.ok: + raise DownloadError(response.status_code, response.reason, source_url) + try: + with zipfile.ZipFile(io.BytesIO(response.content)) as zipf: + members = zipf.namelist() + for member in members: + _ = zipf.extract(member, path=tmpdir) + except zipfile.BadZipFile as exc: + print(response.content) + raise exc # find the shape file and convert it to a geojson - - shape_files = glob("*.shp", root_dir=unzip_path) + shape_files = glob("*.shp", root_dir=tmpdir) matches = len(shape_files) if matches == 0: raise UnpackError("No shape files found in archive") elif matches > 1: raise UnpackError("Too many shape files found") - shape_file_path = os.path.join(unzip_path, shape_files[0]) + shape_file_path = os.path.join(tmpdir, shape_files[0]) shape_file_data = geopandas.read_file(shape_file_path) shape_file_data.to_file(target_filename, driver='GeoJSON') diff --git a/methods/inputs/download_srtm_data.py b/methods/inputs/download_srtm_data.py index ba34314..e3d7f4f 100644 --- a/methods/inputs/download_srtm_data.py +++ b/methods/inputs/download_srtm_data.py @@ -1,17 +1,16 @@ import argparse -import errno +import io +import logging import math import os import shutil -import sys import tempfile -import logging +import zipfile import requests import shapely # type: ignore from geopandas import gpd # type: ignore -from biomassrecovery.utils.unzip import unzip # type: ignore logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s") @@ -21,10 +20,8 @@ def download_srtm_data( project_boundaries_filename: str, pixel_matching_boundaries_filename: str, - destination_zip_folder: str, destintation_tiff_folder: str ) -> None: - os.makedirs(destination_zip_folder, exist_ok=True) os.makedirs(destintation_tiff_folder, exist_ok=True) project = gpd.read_file(project_boundaries_filename) @@ -44,92 +41,60 @@ def download_srtm_data( for xoffset in range(min_x, max_x + 1): url = URL_TEMPLATE % (xoffset, yoffset) logging.info("Fetching SRTM tile %s", url) - target_filename = url.split('/')[-1] - target_path = os.path.join(destination_zip_folder, target_filename) - if not os.path.exists(target_path): + tiff_target_name = TIFF_NAME_TEMPLATE % (xoffset, yoffset) + tiff_target_path = os.path.join(destintation_tiff_folder, tiff_target_name) + + if not os.path.exists(tiff_target_path): with tempfile.TemporaryDirectory() as tempdir: - download_target = os.path.join(tempdir, target_filename) with requests.get(url, stream=True, timeout=60) as response: - # We are pretty coarse with our max x and y and so we might - # create a big rectangle that includes a tile that is all ocean - # and it is easier to just drop 404s. if response.status_code == 404: logging.warning("URL %s not found", url) continue response.raise_for_status() - with open(download_target, 'wb') as output_file: - for chunk in response.iter_content(chunk_size=1024*1024): - output_file.write(chunk) - shutil.move(download_target, target_path) - - tiff_target_name = TIFF_NAME_TEMPLATE % (xoffset, yoffset) - tiff_target_path = os.path.join(destintation_tiff_folder, tiff_target_name) - if not os.path.exists(tiff_target_path): - with tempfile.TemporaryDirectory() as tempdir: - unzip(target_path, tempdir) - unziped_tiff_path = os.path.join(tempdir, tiff_target_name) - if not os.path.exists(unziped_tiff_path): - raise FileNotFoundError(errno.ENOENT, "Zip contents not as expected", tiff_target_name) - shutil.move(unziped_tiff_path, tiff_target_path) + with zipfile.ZipFile(io.BytesIO(response.content)) as zipf: + members = zipf.namelist() + for member in members: + unziped_path = zipf.extract(member, path=tempdir) + # The zip file has metadata files in it alongside the TIF we care about + if unziped_path.endswith(".tif"): + shutil.move(unziped_path, tiff_target_path) def main() -> None: # To not break the pipeline, I need to support the old non-named arguments and the new named version. # Unfortunately, AFAICT ArgumentParser's exit_on_error doesn't seem to be honoured if required arguments # aren't present, so we need to check sys.argv first before deciding how to handle args in the old or # new style. - new_style = all(x in sys.argv for x in ["--project", "--matching", "--zips", "--tifs"]) - - if new_style: - parser = argparse.ArgumentParser(description="Download SRTM data for a given set of areas.") - parser.add_argument( - "--project", - type=str, - required=True, - dest="project", - help="GeoTIFF file containing pixels in set M as generated by build_m_raster.py" - ) - parser.add_argument( - "--matching", - type=str, - required=True, - dest="matching", - help="GeoTIFF file containing pixels in set M as generated by build_m_raster.py" - ) - parser.add_argument( - "--zips", - type=str, - required=True, - dest="zip_folder_path", - help="GeoTIFF file containing pixels in set M as generated by build_m_raster.py" - ) - parser.add_argument( - "--tifs", - type=str, - required=True, - dest="tif_folder_path", - help="GeoTIFF file containing pixels in set M as generated by build_m_raster.py" - ) - args = parser.parse_args() - project_boundaries_filename = args.project - pixel_matching_boundaries_filename = args.matching - destination_zip_folder = args.zip_folder_path - destination_tiff_folder = args.tif_folder_path - else: - try: - project_boundaries_filename = sys.argv[1] - pixel_matching_boundaries_filename = sys.argv[2] - destination_zip_folder = sys.argv[3] - destination_tiff_folder = sys.argv[4] - except IndexError: - print(f"Usage: {sys.argv[0]} PROJECT_BOUNDARIES PIXEL_MATCHING_AREA " - "DESTINATION_ZIP_FOLDER DESTINATION_TIFF_FOLDER", file=sys.stderr) - sys.exit(1) + parser = argparse.ArgumentParser(description="Download SRTM data for a given set of areas.") + parser.add_argument( + "--project", + type=str, + required=True, + dest="project", + help="GeoTIFF file containing pixels in set M as generated by build_m_raster.py" + ) + parser.add_argument( + "--matching", + type=str, + required=True, + dest="matching", + help="GeoTIFF file containing pixels in set M as generated by build_m_raster.py" + ) + parser.add_argument( + "--tifs", + type=str, + required=True, + dest="tif_folder_path", + help="GeoTIFF file containing pixels in set M as generated by build_m_raster.py" + ) + args = parser.parse_args() + project_boundaries_filename = args.project + pixel_matching_boundaries_filename = args.matching + destination_tiff_folder = args.tif_folder_path download_srtm_data( project_boundaries_filename, pixel_matching_boundaries_filename, - destination_zip_folder, destination_tiff_folder ) diff --git a/methods/inputs/generate_access_raster.py b/methods/inputs/generate_access_raster.py new file mode 100644 index 0000000..07e812c --- /dev/null +++ b/methods/inputs/generate_access_raster.py @@ -0,0 +1,84 @@ +import argparse +import glob +import os +import shutil +import tempfile + +from yirgacheffe.layers import GroupLayer, RasterLayer, VectorLayer # type: ignore + +def generate_access_tiles( + jrc_directory_path: str, + matching_zone_filename: str, + access_geotif_filename: str, + output_filename: str +) -> None: + output_folder, _ = os.path.split(output_filename) + os.makedirs(output_folder, exist_ok=True) + + jrc_layer = GroupLayer([ + RasterLayer.layer_from_file(os.path.join(jrc_directory_path, filename)) + for filename in glob.glob("*2020*.tif", root_dir=jrc_directory_path) + ]) + + matching = VectorLayer.layer_from_file( + matching_zone_filename, + None, + jrc_layer.pixel_scale, + jrc_layer.projection, + ) + + access_data = RasterLayer.layer_from_file(access_geotif_filename) + access_data.set_window_for_intersection(matching.area) + + temp = RasterLayer.empty_raster_layer_like(access_data) + access_data.save(temp) + + with tempfile.TemporaryDirectory() as tempdir: + temp_output = os.path.join(tempdir, "rescaled.tif") + + scaled = RasterLayer.scaled_raster_from_raster(temp, jrc_layer.pixel_scale, filename=temp_output) + scaled.close() + + shutil.move(temp_output, output_filename) + +def main() -> None: + parser = argparse.ArgumentParser(description="Generate a GeoTIFF from country boundaries") + parser.add_argument( + "--jrc", + type=str, + required=True, + dest="jrc_directory_path", + help="Directory containing JRC AnnualChange GeoTIFF tiles for all years." + ) + parser.add_argument( + "--matching", + type=str, + required=True, + dest="matching_zone_filename", + help="Filename of GeoJSON file describing area from which matching pixels may be selected." + ) + parser.add_argument( + "--access", + type=str, + required=True, + dest="access_geotif_filename", + help="GeoJSON of country boundaries." + ) + parser.add_argument( + "--output", + type=str, + required=True, + dest="output_filename", + help="Destination raster file." + ) + args = parser.parse_args() + + generate_access_tiles( + args.jrc_directory_path, + args.matching_zone_filename, + args.access_geotif_filename, + args.output_filename + ) + +if __name__ == "__main__": + main() diff --git a/methods/inputs/generate_access_tiles.py b/methods/inputs/generate_access_tiles.py deleted file mode 100644 index 1e72c28..0000000 --- a/methods/inputs/generate_access_tiles.py +++ /dev/null @@ -1,72 +0,0 @@ -import glob -import os -import re -import shutil -import sys -import tempfile -from functools import partial -from multiprocessing import Pool - -from yirgacheffe.layers import RasterLayer # type: ignore - -def process_file( - access_geotif_filename: str, - output_folder: str, - jrc_filename: str, -) -> None: - matches = re.match(r".*_([NS]\d+)_([WE]\d+).tif", jrc_filename) - assert matches is not None - result_filename = f"access_{matches[1]}_{matches[2]}.tif" - result_path = os.path.join(output_folder, result_filename) - if os.path.exists(result_path): - return - - access_data = RasterLayer.layer_from_file(access_geotif_filename) - - with tempfile.TemporaryDirectory() as tempdir: - - jrc_raster = RasterLayer.layer_from_file(jrc_filename) - access_data.set_window_for_intersection(jrc_raster.area) - - temp = RasterLayer.empty_raster_layer_like(access_data) - access_data.save(temp) - - temp_output = os.path.join(tempdir, result_filename) - - scaled = RasterLayer.scaled_raster_from_raster(temp, jrc_raster.pixel_scale, filename=temp_output) - del scaled._dataset - - shutil.move(temp_output, result_path) - -def generate_access_tiles( - access_geotif_filename: str, - jrc_folder: str, - output_folder: str -) -> None: - os.makedirs(output_folder, exist_ok=True) - - jrc_files = [os.path.join(jrc_folder, filename) for filename in glob.glob("*2020*.tif", root_dir=jrc_folder)] - - with Pool(processes=50) as pool: - pool.map( - partial( - process_file, - access_geotif_filename, - output_folder, - ), - jrc_files - ) - -def main() -> None: - try: - access_geotif_filename = sys.argv[1] - jrc_folder = sys.argv[2] - output_folder = sys.argv[3] - except IndexError: - print(f"Usage: {sys.argv[0]} ACCESS_GEOTIFF JRC_FOLDER OUTPUT_FOLDER", file=sys.stderr) - sys.exit(1) - - generate_access_tiles(access_geotif_filename, jrc_folder, output_folder) - -if __name__ == "__main__": - main() diff --git a/methods/inputs/generate_country_raster.py b/methods/inputs/generate_country_raster.py index 45f1590..195aa48 100644 --- a/methods/inputs/generate_country_raster.py +++ b/methods/inputs/generate_country_raster.py @@ -37,7 +37,7 @@ def generate_country_raster( def main() -> None: - parser = argparse.ArgumentParser(description="Calculates sample pixels in project, aka set K") + parser = argparse.ArgumentParser(description="Generate a GeoTIFF from country boundaries") parser.add_argument( "--jrc", type=str, diff --git a/methods/inputs/generate_ecoregion_rasters.py b/methods/inputs/generate_ecoregion_rasters.py index 695d619..d2f2f97 100644 --- a/methods/inputs/generate_ecoregion_rasters.py +++ b/methods/inputs/generate_ecoregion_rasters.py @@ -1,14 +1,11 @@ +import argparse import glob import os -import re import shutil -import sys import tempfile -from functools import partial -from multiprocessing import Pool from osgeo import gdal # type: ignore -from yirgacheffe.layers import RasterLayer, VectorLayer # type: ignore +from yirgacheffe.layers import GroupLayer, RasterLayer, VectorLayer # type: ignore # The real cost here is repeated re-drawing of the ecoregions, so cut down on that import yirgacheffe.operators # type: ignore @@ -18,59 +15,88 @@ # will throw an error when it hits 200MB unless you override the limit thus os.environ["OGR_GEOJSON_MAX_OBJ_SIZE"] = "0" -def process_tile(result_path: str, ecoregions_filename: str, jrc_path: str, ) -> None: - matches = re.match(r".*_([NS]\d+)_([WE]\d+).tif", jrc_path) - assert matches is not None - filename = f"ecoregion_{matches[1]}_{matches[2]}.tif" - final_filename = os.path.join(result_path, filename) +def generate_ecoregion_rasters( + jrc_directory_path: str, + matching_zone_filename: str, + ecoregions_filename: str, + output_filename: str +) -> None: + output_folder, _ = os.path.split(output_filename) + os.makedirs(output_folder, exist_ok=True) + + jrc_layer = GroupLayer([ + RasterLayer.layer_from_file(os.path.join(jrc_directory_path, filename)) + for filename in glob.glob("*2020*.tif", root_dir=jrc_directory_path) + ]) + + matching = VectorLayer.layer_from_file( + matching_zone_filename, + None, + jrc_layer.pixel_scale, + jrc_layer.projection, + ) - try: - raster = RasterLayer.layer_from_file(final_filename) - if raster.sum() > 0: - return - except (FileNotFoundError, TypeError): - # File doesn't exist or is corrupt, so we need to do the work - pass + ecoregions = VectorLayer.layer_from_file( + ecoregions_filename, + None, + jrc_layer.pixel_scale, + jrc_layer.projection, + datatype=gdal.GDT_UInt16, + burn_value="ECO_ID" + ) + + layers = [jrc_layer, matching, ecoregions] + intersection = RasterLayer.find_intersection(layers) + for layer in layers: + layer.set_window_for_intersection(intersection) with tempfile.TemporaryDirectory() as tempdir: - jrc_raster = RasterLayer.layer_from_file(jrc_path) - ecoregions = VectorLayer.layer_from_file( - ecoregions_filename, - None, - jrc_raster.pixel_scale, - jrc_raster.projection, - datatype=gdal.GDT_UInt16, - burn_value="ECO_ID" - ) + target_filename = os.path.join(tempdir, "ecoregion.tif") - target_filename = os.path.join(tempdir, filename) - result = RasterLayer.empty_raster_layer_like(jrc_raster, filename=target_filename, + result = RasterLayer.empty_raster_layer_like(jrc_layer, filename=target_filename, datatype=ecoregions.datatype, compress=False) - ecoregions.set_window_for_intersection(jrc_raster.area) ecoregions.save(result) - del result._dataset - shutil.move(target_filename, final_filename) - -def generate_ecoregion_rasters( - ecoregions_filename: str, - jrc_folder: str, - output_folder: str, -) -> None: - os.makedirs(output_folder, exist_ok=True) - jrc_files = [os.path.join(jrc_folder, filename) for filename in glob.glob("*2020*.tif", root_dir=jrc_folder)] - with Pool(processes=50) as pool: - pool.map(partial(process_tile, output_folder, ecoregions_filename), jrc_files) + result.close() + shutil.move(target_filename, output_filename) def main() -> None: - try: - ecoregions_filename = sys.argv[1] - jrc_folder = sys.argv[2] - output_folder = sys.argv[3] - except IndexError: - print(f"Usage: {sys.argv[0]} ECOREGIONS_GEOJSON JRC_FOLDER OUTPUT_FOLDER", file=sys.stderr) - sys.exit(1) + parser = argparse.ArgumentParser(description="Generate a GeoTIFF from ecoregion data") + parser.add_argument( + "--jrc", + type=str, + required=True, + dest="jrc_directory_path", + help="Directory containing JRC AnnualChange GeoTIFF tiles for all years." + ) + parser.add_argument( + "--matching", + type=str, + required=True, + dest="matching_zone_filename", + help="Filename of GeoJSON file describing area from which matching pixels may be selected." + ) + parser.add_argument( + "--ecoregions", + type=str, + required=True, + dest="ecoregions_filename", + help="GeoJSON of ecoregions." + ) + parser.add_argument( + "--output", + type=str, + required=True, + dest="output_filename", + help="Destination raster file folder." + ) + args = parser.parse_args() - generate_ecoregion_rasters(ecoregions_filename, jrc_folder, output_folder) + generate_ecoregion_rasters( + args.jrc_directory_path, + args.matching_zone_filename, + args.ecoregions_filename, + args.output_filename, + ) if __name__ == "__main__": main() diff --git a/methods/inputs/generate_fine_circular_coverage.py b/methods/inputs/generate_fine_circular_coverage.py index d98f9f8..5194c91 100644 --- a/methods/inputs/generate_fine_circular_coverage.py +++ b/methods/inputs/generate_fine_circular_coverage.py @@ -151,7 +151,9 @@ def generate_fine_circular_coverage( filesets = [] for year in years: for area in areas: - filesets.append([os.path.join(jrc_directory, x) for x in jrc_filenames if year in x and area in x]) + fileset = [os.path.join(jrc_directory, x) for x in jrc_filenames if year in x and area in x] + if fileset: + filesets.append(fileset) total_files = sum(len(files) for files in filesets) diff --git a/methods/matching/calculate_k.py b/methods/matching/calculate_k.py index dfecd05..1df4153 100644 --- a/methods/matching/calculate_k.py +++ b/methods/matching/calculate_k.py @@ -66,10 +66,11 @@ def build_layer_collection( # ecoregions is such a heavy layer it pays to just rasterize it once - we should possibly do this once # as part of import of the ecoregions data - ecoregions = GroupLayer([ - RasterLayer.layer_from_file(os.path.join(ecoregions_directory_path, filename)) for filename in - glob.glob("*.tif", root_dir=ecoregions_directory_path) - ], name="ecoregions") + # ecoregions = GroupLayer([ + # RasterLayer.layer_from_file(os.path.join(ecoregions_directory_path, filename)) for filename in + # glob.glob("*.tif", root_dir=ecoregions_directory_path) + # ], name="ecoregions") + ecoregions = RasterLayer.layer_from_file(ecoregions_directory_path) elevation = GroupLayer([ RasterLayer.layer_from_file(os.path.join(elevation_directory_path, filename)) for filename in @@ -80,10 +81,11 @@ def build_layer_collection( glob.glob("slope*.tif", root_dir=slope_directory_path) ], name="slopes") - access = GroupLayer([ - RasterLayer.layer_from_file(os.path.join(access_directory_path, filename)) for filename in - glob.glob("*.tif", root_dir=access_directory_path) - ], name="access") + # access = GroupLayer([ + # RasterLayer.layer_from_file(os.path.join(access_directory_path, filename)) for filename in + # glob.glob("*.tif", root_dir=access_directory_path) + # ], name="access") + access = RasterLayer.layer_from_file(access_directory_path) countries = RasterLayer.layer_from_file(countries_raster_filename) diff --git a/scripts/tmfpython.sh b/scripts/tmfpython.sh index beb8178..c1f2397 100755 --- a/scripts/tmfpython.sh +++ b/scripts/tmfpython.sh @@ -107,7 +107,6 @@ echo "--Matching area created.--" #Download SRTM data tmfpython3 -m methods.inputs.download_srtm_data --project "${input_dir}/${proj}.geojson" \ --matching "${output_dir}/${proj}/matching-area.geojson" \ ---zips "${output_dir}/srtm/zip" \ --tifs "${output_dir}/srtm/tif" echo "--SRTM downloaded.--" @@ -132,6 +131,20 @@ tmfpython3 -m methods.inputs.generate_country_raster --jrc /maps/forecol/data/JR --output "${output_dir}/${proj}/countries.tif" echo "--Country raster created.--" +#Create ecoregion raster +tmfpython3 -m methods.inputs.generate_ecoregion_rasters --jrc /maps/forecol/data/JRC/v1_2022/AnnualChange/tifs \ +--matching "${output_dir}/${proj}/matching-area.geojson" \ +--ecoregions /maps/4C/ecoregions/ecoregions.geojson \ +--output "${output_dir}/${proj}/ecoregion.tif" +echo "--Ecoregions raster created.--" + +#Create access raster +tmfpython3 -m methods.inputs.generate_access_raster --jrc /maps/forecol/data/JRC/v1_2022/AnnualChange/tifs \ +---matching "${output_dir}/${proj}/matching-area.geojson" \ +--access /maps/4C/access/raw/202001_Global_Walking_Only_Travel_Time_To_Healthcare_2019.tif \ +--output "${output_dir}/${proj}/access.tif" +echo "--Access raster created.--" + #Matching: calculate set K tmfpython3 -m methods.matching.calculate_k \ --project "${input_dir}/${proj}.geojson" \ @@ -139,10 +152,10 @@ tmfpython3 -m methods.matching.calculate_k \ --evaluation_year "$eval_year" \ --jrc /maps/forecol/data/JRC/v1_2022/AnnualChange/tifs \ --cpc /maps/rhm31/fine_circular_coverage/forecol_complete/ \ ---ecoregions /maps/4C/ecoregions/ \ +--ecoregions "${output_dir}/${proj}/ecoregion.tif" \ --elevation "${output_dir}/rescaled-elevation" \ --slope "${output_dir}/rescaled-slopes" \ ---access /maps/4C/access \ +--access "${output_dir}/${proj}/access.tif" \ --countries-raster "${output_dir}/${proj}/countries.tif" \ --output "${output_dir}/${proj}/k.parquet" echo "--Set K created.--" @@ -155,10 +168,10 @@ tmfpython3 -m methods.matching.find_potential_matches \ --evaluation_year "$eval_year" \ --jrc /maps/forecol/data/JRC/v1_2022/AnnualChange/tifs \ --cpc /maps/rhm31/fine_circular_coverage/forecol_complete/ \ ---ecoregions /maps/4C/ecoregions/ \ +--ecoregions "${output_dir}/${proj}/ecoregion.tif" \ --elevation "${output_dir}/rescaled-elevation" \ --slope "${output_dir}/rescaled-slopes" \ ---access /maps/4C/access \ +--access "${output_dir}/${proj}/access.tif" \ --countries-raster "${output_dir}/${proj}/countries.tif" \ --output "${output_dir}/${proj}/matches" tmfpython3 -m methods.matching.build_m_raster \ @@ -172,10 +185,10 @@ tmfpython3 -m methods.matching.build_m_table \ --evaluation_year "$eval_year" \ --jrc /maps/forecol/data/JRC/v1_2022/AnnualChange/tifs \ --cpc /maps/rhm31/fine_circular_coverage/forecol_complete/ \ ---ecoregions /maps/4C/ecoregions/ \ +--ecoregions "${output_dir}/${proj}/ecoregion.tif" \ --elevation "${output_dir}/rescaled-elevation" \ --slope "${output_dir}/rescaled-slopes" \ ---access /maps/4C/access \ +--access "${output_dir}/${proj}/access.tif" \ --countries-raster "${output_dir}/${proj}/countries.tif" \ --output "${output_dir}/${proj}/matches.parquet" echo "--Set M created.--"