Download the Rainfall Rescue dataΒΆ

The images being used are the 10-year rainfall sheets: lose-leaf forms, each recording monthly rainfall at a single UK station over a period of 10-years. The collection of these in the UK Meteorological Archive comprises about 65,000 such sheets covering 1677-1960 (though the early years include very few stations).

A key advantage of these documents is that they were manually digitized in spring 2020 by the Rainfall Rescue citizen science project, and Ed Hawkins, PI of that project, is sharing the transcriptions through a gitHub repository. So we can get image::transcription pairs from that repository.

Download the repository:

#!/bin/bash

# Get Ed Hawkins' Rainfall Rescue github repository
#  (contains all the images and csv files with transcribed data).
#  Note - 17Gb zip file, 19Gb unpacked

wget -O $SCRATCH/rainfall-rescue.zip https://github.com/ed-hawkins/rainfall-rescue/archive/master.zip
unzip -d $SCRATCH $SCRATCH/rainfall-rescue.zip

# Reset all the access times (or $SCRATCH will delete them as too old).
find $SCRATCH/rainfall-rescue-master -type f -exec touch {} +

# Copy the images from Ed's GDrive (shared with me).
rclone copy gdrive:IMAGES --drive-shared-with-me /data/scratch/philip.brohan/rainfall-rescue-master//IMAGES

Extract the available image::transcription pairs:

#!/usr/bin/env python

# Get all the error-free jpg::csv pairs from Ed's Rainfall Rescue dataset

import sys
import os
import shutil
import argparse

parser = argparse.ArgumentParser()
parser.add_argument(
    "--ind",
    help="CSV directory",
    type=str,
    required=False,
    default="%s/rainfall-rescue-master/DATA/" % os.getenv("SCRATCH"),
)
parser.add_argument(
    "--imd",
    help="Image directory",
    type=str,
    required=False,
    default="%s/rainfall-rescue-master/IMAGES/" % os.getenv("SCRATCH"),
)
parser.add_argument(
    "--outd",
    help="Output directory",
    type=str,
    required=False,
    default="%s/Robot_Rainfall_Rescue/from_Ed/" % os.getenv("SCRATCH"),
)
args = parser.parse_args()

# Make the output directory (on SCRATCH - lots of Gb of images)
for subd in ("images", "csvs"):
    sd = "%s/%s" % (args.outd, subd)
    if not os.path.isdir(sd):
        os.makedirs(sd)

# Find all the images
images = {}
for dirpath, dirnames, filenames in os.walk(args.imd):
    for filename in filenames:
        if filename[-4:] != ".jpg":
            continue
        images[filename[:-4]] = dirpath + "/" + filename

# Find all the csvs
csvs = {}
for dirpath, dirnames, filenames in os.walk(args.ind):
    for filename in filenames:
        if filename[-4:] != ".csv":
            continue
        csvs[filename[:-4]] = dirpath + "/" + filename

# For each jpg, if there is a csv with the same name, copy both.
for image in images.keys():
    if image not in csvs:
        print("No csv for %s" % image)
        continue
    shutil.copy(images[image], "%s/%s" % (args.outd, "images"))
    shutil.copy(csvs[image], "%s/%s" % (args.outd, "csvs"))