How to reproduce and extend this work

This project is designed to be easy to reproduce and extend. Everything involved is kept under version control in a git repository. The repository is hosted on GitHub (and the documentation made with GitHub Pages). The repository is https://github.com/philip-brohan/Robot_Rainfall_Rescue. This repository contains everything you need to reproduce or extend this work.

If you are familiar with GitHub, you already know what to do (fork or clone the repository): If you’d prefer not to bother with that, you can download the whole thing as a zip file.

As well as downloading the software, some setup is necessary to run it successfully:

These scripts will only work in a python environment with the appropriate python version and libraries available. I use conda to manage the required python environment - which is specified in a yaml file:

name: rrr-spice
channels:
  - conda-forge
dependencies:
# Basics
  - python=3.10
  - matplotlib=3.10
  - black  # Code formatter
  - sphinx  # Documentation
# Huggingface transformers library
  - huggingface_hub>=0.33
  - transformers>=4.51.3
#  - diffusers>=0.35.0
  
# Download images from Google Drive
  - rclone=1.69

# Azure
  - marshmallow=3.26  # V4 breaks azure login
  - azure-ai-ml = 1.23
  - azure-identity = 1.21
  - azure-storage-file-datalake = 12.19
  - azure-keyvault = 4.3

# Torch backend for transformers
  - pip
  - pip:
      - torch>=2.4.0
      - torchvision
      - hf_xet
      - accelerate==1.4.0
      - compressed-tensors
      - evaluate==0.4.3
      - bitsandbytes>=0.45.3
      - trl==0.15.2
      - peft>=0.15.0
      - pillow==11.1.0
      - protobuf
      - sentencepiece
      - flash-attention
      - datasets==3.3.2
      - tensorboardX
      - tensorboard
      - opencensus-ext-logging
      - git+https://github.com/huggingface/diffusers
     

variables:
# Tell python to look for modules in the root directory of the project
# (A hack, needs to be edited for every installation, but makes code
#  management much easier.)
# Replace with the path to your project directory root.
  PYTHONPATH: /home/users/philip.brohan/Projects/Robot_Rainfall_Rescue

# Project data path
  PDIR: /data/scratch/philip.brohan/Robot_Rainfall_Rescue

# Tell huggingface where to store model weights
  HF_HOME: /data/users/philip.brohan/huggingface

# Azure ML subscription, workspace and resource group
  AZML_SUBSCRIPTION_ID: 79c7890c-2a30-44ef-aa8d-419d25b7bb8e
  AZML_WORKSPACE_NAME: mlw-llmdatarescue-uksouth-01
  AZML_RESOURCE_GROUP: rg-climate-llmdatarescue

As always with ML work, you will need access to a GPU. I use a dedicated MS Azure ML workspace for this, so the code in this repository contains some scripts and configuration files specific to that. But the actual training and analysis code is not specific to Azure, and should work in any suitable python environment with access to a GPU (I used 80Gb H100s, mostly; 60Gb A100s will do instead. I have not tested any of this on smaller GPUs). All the scripts here will run on a single 60Gb A100.

The project documentation (these web pages) are included in the repository (in the docs directory). The documentation is in reStructuredText format, and uses the Sphinx documentation generator.