Prompts used to control the extraction

The same pair of prompts are used for all models.

By convention, there are two prompts:

  1. The system prompt, which sets the overall task and context

  2. The user prompt, which provides the specific instructions and input data

I’ve done a bit of experimentation with the prompts, but not much. A major challenge is getting the models to output valid JSON in exactly the format specified. There is probably room for improvement with these prompts, but they work well enough.

# Prompts for SmolVLM to extract rainfall data from tables in historical documents.

# System prompt
s_prompt = (
    "You are a climate scientist. Your task is to extract climate data from pages containing historical observations. "
    + "The page you are working on is a record of monthly rainfall from one UK weather station. "
    + "At the top of each page is the name of a weather station, and the number of the station. "
    + "The station name will follow the words 'RAINFALL at' in the top centre of the page. "
    + "The station number will be in the top-right corner of the page. "
    + "The page contains a table with monthly rainfall data for ten years,  "
    + "The first row of the table gives the years. There will be 10 years."
    + "The first column of the table is the month name, starting with January at the top and December at the bottom. "
    + "The bulk of the table gives values for each calendar month, in each of the ten years. "
    + "Each column is the data for one year, January at the top, December at the bottom. "
    + "There is sometimes an extra column on the right (after the last year) - ignore this column. "
    + "Each row is the data for one calendar month, with the first year on the left and the last year on the right. "
    + "At the bottom of the table is an extra row with totals for each year. "
    + "Sometimes some of the data values are missing, left blank. For missing values, return 'null'."
    + "Return exactly one JSON object that follows the schema in the user message. Do NOT output any explanation or extra text."
)

u_prompt = (
    "Fill the following JSON template. Keep the exact keys and structure."
    ' Use strings for numbers and "null" for missing entries.\n\n'
    "{\n"
    '  "Name": "<name>",\n'
    '  "Number": "<number>",\n'
    '  "Years": ["<Y1>","<Y2>","<Y3>","<Y4>","<Y5>","<Y6>","<Y7>","<Y8>","<Y9>","<Y10>"],\n'
    '  "January": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "February": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "March": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "April": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "May": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "June": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "July": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "August": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "September": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "October": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "November": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "December": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"],\n'
    '  "Totals": ["<v1>","<v2>","<v3>","<v4>","<v5>","<v6>","<v7>","<v8>","<v9>","<v10>"]\n'
    "}\n\n"
    "Replace placeholders in angle brackets <> with values. "
    "Arrays must have 10 entries for Years and months. "
    "If a value is missing, put null (without extra text). "
    "Output only the JSON object, nothing else."
)