Page locations¶

The Gemini documentation describes a way to get page bounding boxes for items on the page. This is not a vital feature, but it is useful for debugging and for understanding how Gemini is interpreting the page.

So we modify the existing code to request a bounding box for each item.

#!/usr/bin/env python3

# Basic test of the Gemini API - get the station metadata as
#  structured output.

import os
import PIL.Image
import google.generativeai as genai
import typing_extensions as typing

# You will need an API key get it from https://ai.google.dev/gemini-api/docs/api-key

# I keep my API key in the .gemini_api file in my home directory.
with open("%s/.gemini_api" % os.getenv("HOME"), "r") as file:
    api_key = file.read().strip()

# Default protocol is 'GRPC' - but that is blocked by the Office firewall.
#  Use 'REST' instead.
genai.configure(api_key=api_key, transport="rest")


# Specify a structure for the desired output
class MetaData(typing.TypedDict):
    Year: int
    Year_bounding_box: list[int]
    StationNumber: int
    StationNumber_bounding_box: list[int]
    Location: str
    Location_bounding_box: list[int]
    County: str
    County_bounding_box: list[int]
    Sea_level_height: int
    Sea_level_height_bounding_box: list[int]
    Gauge_diameter: int
    Gauge_diameter_bounding_box: list[int]
    Gauge_height_feet: int
    Gauge_height_feet_bounding_box: list[int]
    Gauge_height_inches: int
    Gauge_height_inches_bounding_box: list[int]


# Load the sample image
img = PIL.Image.open(
    "../../images/jpgs_300dpi/Devon_1941-1950_RainNos_1651-1689-293.jpg"
)

# Pick an AI to use - this one is the latest as of 2025-01-29
model = genai.GenerativeModel("gemini-2.0-flash-exp")

# Ask a question about the image
result = model.generate_content(
    [
        img,
        "\n\n",
        "List the station metadata. Also give the bounding box of each item in [xmin,ymin,xmax,ymax] format.",
    ],
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json", response_schema=MetaData
    ),
)
# Structured data as JSON
with open("locations.json", "w") as file:
    file.write(result.text)
with open("rest.txt", "w") as file:
    file.write(str(result))

And bounding boxes are indeed added to the JSON output

{
"County": "DEVON",
"County_bounding_box": [
108,
211,
122,
254
],
"Gauge_diameter": 5,
"Gauge_diameter_bounding_box": [
122,
218,
136,
255
],
"Gauge_height_feet": 1,
"Gauge_height_feet_bounding_box": [
135,
267,
149,
292
],
 "Gauge_height_inches": 3,
"Gauge_height_inches_bounding_box": [
135,
301,
148,
321
],
"Location": "BADWORTHY COTTAGE, S. BRENT",
"Location_bounding_box": [
80,
86,
109,
398
],
"Sea_level_height": 550,
"Sea_level_height_bounding_box": [
147,
288,
162,
381
],
 "StationNumber": 1678,
"StationNumber_bounding_box": [
45,
759,
66,
828
],
"Year": 1947,
"Year_bounding_box": [
64,
514,
88,
603
]
}

Unfortunately it doesn’t work - the locations are not correct. It sort-of-works, and it’s not clear exactly what is going wrong, but at the moment this feature is not useable.

Image showing bounding boxes for extracted data

Program to plot the bounding box diagnostic

Page locations¶

Table Of Contents

Get the code

Found a bug, or have a suggestion?