Download the Rainfall Rescue dataΒΆ
The images being used are the 10-year rainfall sheets: lose-leaf forms, each recording monthly rainfall at a single UK station over a period of 10-years. The collection of these in the UK Meteorological Archive comprises about 65,000 such sheets covering 1677-1960 (though the early years include very few stations).
A key advantage of these documents is that they were manually digitized in spring 2020 by the Rainfall Rescue citizen science project, and Ed Hawkins, PI of that project, is sharing the transcriptions through a gitHub repository. So we can get image::transcription pairs from that repository.
Download the repository:
#!/bin/bash
# Get Ed Hawkins' Rainfall Rescue github repository
# (contains all the images and csv files with transcribed data).
# Note - 17Gb zip file, 19Gb unpacked
wget -O $SCRATCH/rainfall-rescue.zip https://github.com/ed-hawkins/rainfall-rescue/archive/master.zip
unzip -d $SCRATCH $SCRATCH/rainfall-rescue.zip
# Reset all the access times (or $SCRATCH will delete them as too old).
find $SCRATCH/rainfall-rescue-master -type f -exec touch {} +
# Copy the images from Ed's GDrive (shared with me).
rclone copy gdrive:IMAGES --drive-shared-with-me /data/scratch/philip.brohan/rainfall-rescue-master//IMAGES
Extract the available image::transcription pairs:
#!/usr/bin/env python
# Get all the error-free jpg::csv pairs from Ed's Rainfall Rescue dataset
import sys
import os
import shutil
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(
"--ind",
help="CSV directory",
type=str,
required=False,
default="%s/rainfall-rescue-master/DATA/" % os.getenv("SCRATCH"),
)
parser.add_argument(
"--imd",
help="Image directory",
type=str,
required=False,
default="%s/rainfall-rescue-master/IMAGES/" % os.getenv("SCRATCH"),
)
parser.add_argument(
"--outd",
help="Output directory",
type=str,
required=False,
default="%s/Robot_Rainfall_Rescue/from_Ed/" % os.getenv("SCRATCH"),
)
args = parser.parse_args()
# Make the output directory (on SCRATCH - lots of Gb of images)
for subd in ("images", "csvs"):
sd = "%s/%s" % (args.outd, subd)
if not os.path.isdir(sd):
os.makedirs(sd)
# Find all the images
images = {}
for dirpath, dirnames, filenames in os.walk(args.imd):
for filename in filenames:
if filename[-4:] != ".jpg":
continue
images[filename[:-4]] = dirpath + "/" + filename
# Find all the csvs
csvs = {}
for dirpath, dirnames, filenames in os.walk(args.ind):
for filename in filenames:
if filename[-4:] != ".csv":
continue
csvs[filename[:-4]] = dirpath + "/" + filename
# For each jpg, if there is a csv with the same name, copy both.
for image in images.keys():
if image not in csvs:
print("No csv for %s" % image)
continue
shutil.copy(images[image], "%s/%s" % (args.outd, "images"))
shutil.copy(csvs[image], "%s/%s" % (args.outd, "csvs"))