Screenshot to Structured Data: Extract Information from Visual Web Pages
Not every web page yields its data to a DOM parser. Canvas-rendered dashboards, heavily JavaScript-driven single-page applications, and pages that gate real content behind a paywalled preview are all notoriously resistant to traditional HTML scraping. You can fire off a fetch() and get a skeleton of empty <div> tags while the actual numbers or text you need live inside a WebGL canvas or are injected only after a client-side authentication check.
The solution that sidesteps all of this is deceptively simple: take a screenshot of the page as a real browser sees it, then hand that image to a vision-capable large language model and ask it to return the data you need as structured JSON.
This article walks through the complete pipeline — from capturing the screenshot with KnowledgeSDK's /v1/screenshot endpoint, to prompting GPT-4o-mini or Gemini Flash, to validating the structured output. We also benchmark cost and accuracy across three real-world page types: an analytics dashboard with charts, an e-commerce product card with dynamically rendered pricing, and a paywalled article preview.
Why Traditional Scraping Fails on Visual Pages
Before going into the solution, it is worth being precise about the failure modes.
Canvas and WebGL charts. Charting libraries like Chart.js, D3, Highcharts, and Recharts often render into a <canvas> element or SVG. The raw DOM contains nothing useful — the actual data lives in JavaScript memory. Headless browsers can execute the JavaScript, but extracting the data still requires either intercepting the network requests that fed the chart or parsing the JavaScript state, both of which are fragile.
Single-page applications with deferred rendering. React, Vue, and Angular applications frequently fetch data after the initial HTML is delivered. A fast scraper that doesn't wait for network idle will capture the loading skeleton, not the content. Even with a proper headless browser, the timing logic can be unreliable across sites that use different loading patterns.
Paywalled or gated previews. Some sites show a "preview" version of content to unauthenticated users that is generated entirely client-side, blurring text or overlaying subscription prompts via CSS or JavaScript. The underlying HTML may contain the full text, but extracting it requires understanding which parts are actually visible versus hidden.
A screenshot captures exactly what a human user sees, regardless of how it was generated. A vision LLM then applies the same visual understanding a human would use to interpret a chart, read a price tag, or identify which parts of an article are visible in the preview.
The Pipeline
The architecture has three steps:
- Screenshot — call
POST /v1/screenshotwith the target URL. KnowledgeSDK runs a real Chromium browser, waits for the page to fully render, and returns a base64-encoded PNG. - Vision extraction — send the PNG to a vision LLM with a structured output prompt. The model returns a JSON object matching your schema.
- Validation — parse and validate the JSON against a schema to catch model errors before they propagate downstream.
URL → KnowledgeSDK /v1/screenshot → base64 PNG → Vision LLM → JSON → Validation → Your App
Implementation: Node.js
Install dependencies:
npm install @knowledgesdk/node openai zod
Set environment variables:
KNOWLEDGESDK_API_KEY=knowledgesdk_live_...
OPENAI_API_KEY=sk-...
Extracting an E-Commerce Product Card
import KnowledgeSDK from "@knowledgesdk/node";
import OpenAI from "openai";
import { z } from "zod";
const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
// Define the schema for the structured output
const ProductSchema = z.object({
name: z.string(),
price: z.number(),
currency: z.string(),
originalPrice: z.number().optional(),
discountPercent: z.number().optional(),
rating: z.number().optional(),
reviewCount: z.number().optional(),
inStock: z.boolean(),
variants: z.array(z.string()).optional(),
});
type Product = z.infer<typeof ProductSchema>;
async function extractProductFromScreenshot(url: string): Promise<Product> {
// Step 1: Capture the screenshot
const screenshotResult = await ks.screenshot(url);
const base64Image = screenshotResult.data; // base64-encoded PNG
// Step 2: Send to vision LLM for structured extraction
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
response_format: { type: "json_object" },
messages: [
{
role: "system",
content: `You are a structured data extraction assistant. Given a screenshot of an e-commerce product page,
extract the visible product information and return it as JSON matching this schema:
{
"name": string,
"price": number (current price, as a float),
"currency": string (e.g. "USD", "EUR"),
"originalPrice": number | null (if a strikethrough original price is visible),
"discountPercent": number | null (if a discount percentage is shown),
"rating": number | null (average star rating),
"reviewCount": number | null,
"inStock": boolean,
"variants": string[] | null (e.g. color or size options visible on the page)
}
Only include data that is clearly visible in the screenshot. Return null for fields you cannot confidently determine.`,
},
{
role: "user",
content: [
{
type: "image_url",
image_url: { url: `data:image/png;base64,${base64Image}` },
},
{
type: "text",
text: "Extract the product information from this product page screenshot.",
},
],
},
],
});
const raw = JSON.parse(response.choices[0].message.content!);
// Step 3: Validate against schema
return ProductSchema.parse(raw);
}
// Usage
const product = await extractProductFromScreenshot(
"https://example-shop.com/products/running-shoes"
);
console.log(product);
// {
// name: "Nike Air Zoom Pegasus 41",
// price: 119.99,
// currency: "USD",
// originalPrice: 149.99,
// discountPercent: 20,
// rating: 4.7,
// reviewCount: 2341,
// inStock: true,
// variants: ["7", "8", "9", "10", "11", "12"]
// }
Extracting Chart Data from a Dashboard
const DashboardMetricsSchema = z.object({
metrics: z.array(
z.object({
label: z.string(),
value: z.string(),
changePercent: z.number().optional(),
changeDirection: z.enum(["up", "down", "flat"]).optional(),
})
),
chartData: z
.array(
z.object({
period: z.string(),
value: z.number(),
})
)
.optional(),
reportDate: z.string().optional(),
});
async function extractDashboardMetrics(url: string) {
const screenshotResult = await ks.screenshot(url);
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
response_format: { type: "json_object" },
messages: [
{
role: "system",
content: `Extract dashboard metrics from the screenshot. Return JSON with:
- metrics: array of KPI tiles with label, value, changePercent, changeDirection
- chartData: if a time-series chart is visible, extract its data points as {period, value} pairs
- reportDate: the date/period the dashboard covers, if shown`,
},
{
role: "user",
content: [
{
type: "image_url",
image_url: {
url: `data:image/png;base64,${screenshotResult.data}`,
},
},
{ type: "text", text: "Extract all visible metrics from this dashboard." },
],
},
],
});
return DashboardMetricsSchema.parse(JSON.parse(response.choices[0].message.content!));
}
Implementation: Python
Install dependencies:
pip install knowledgesdk openai pydantic
import os
import json
from typing import Optional
from pydantic import BaseModel, ValidationError
from openai import OpenAI
import knowledgesdk
ks_client = knowledgesdk.Client(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
class ProductData(BaseModel):
name: str
price: float
currency: str
original_price: Optional[float] = None
discount_percent: Optional[float] = None
rating: Optional[float] = None
review_count: Optional[int] = None
in_stock: bool
variants: Optional[list[str]] = None
def extract_product_from_screenshot(url: str) -> ProductData:
# Step 1: Capture screenshot
screenshot_result = ks_client.screenshot(url)
base64_image = screenshot_result["data"]
# Step 2: Vision LLM extraction
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": """Extract product information from the e-commerce page screenshot.
Return JSON with these fields:
- name: product name (string)
- price: current price as float
- currency: currency code (e.g. "USD")
- original_price: strikethrough price if visible, else null
- discount_percent: discount percentage if shown, else null
- rating: star rating as float if visible, else null
- review_count: number of reviews if visible, else null
- in_stock: true if the product appears available
- variants: list of visible size/color options, or null"""
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{base64_image}"}
},
{
"type": "text",
"text": "Extract the product data from this screenshot."
}
]
}
]
)
raw = json.loads(response.choices[0].message.content)
# Step 3: Validate with Pydantic
return ProductData(**raw)
# Gemini Flash alternative (lower cost)
import google.generativeai as genai
import base64
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
def extract_with_gemini(url: str) -> ProductData:
screenshot_result = ks_client.screenshot(url)
image_bytes = base64.b64decode(screenshot_result["data"])
model = genai.GenerativeModel("gemini-1.5-flash")
prompt = """Extract the product information visible in this e-commerce page screenshot.
Return valid JSON only, with these fields:
name, price (float), currency, original_price (float or null),
discount_percent (float or null), rating (float or null),
review_count (int or null), in_stock (bool), variants (list of strings or null)"""
response = model.generate_content(
[
{"mime_type": "image/png", "data": image_bytes},
prompt
],
generation_config={"response_mime_type": "application/json"}
)
raw = json.loads(response.text)
return ProductData(**raw)
# Pipeline for paywalled article preview
class ArticlePreview(BaseModel):
title: str
visible_text: str
estimated_word_count: int
topics: list[str]
paywall_present: bool
paywall_type: Optional[str] = None # e.g. "soft", "hard", "metered"
def extract_article_preview(url: str) -> ArticlePreview:
screenshot_result = ks_client.screenshot(url)
base64_image = screenshot_result["data"]
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": """Analyze this article page screenshot and return JSON with:
- title: the article title
- visible_text: the article text that is NOT obscured by a paywall overlay (first 200 words max)
- estimated_word_count: your estimate of the total article length based on visible portion
- topics: list of 3-5 topic tags based on visible content
- paywall_present: true if any paywall or subscription prompt is visible
- paywall_type: "soft" (blur/fade), "hard" (full block), "metered" (X articles remaining), or null"""
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}},
{"type": "text", "text": "Analyze this article page."}
]
}
]
)
return ArticlePreview(**json.loads(response.choices[0].message.content))
Cost and Accuracy Benchmarks
We ran 50 extractions for each of three page types, comparing three approaches: KnowledgeSDK HTML extraction (our baseline for pages that render cleanly), the screenshot + GPT-4o-mini pipeline, and the screenshot + Gemini 1.5 Flash pipeline.
Page Type 1: Analytics Dashboard with Charts
| Metric | HTML Scraping | Screenshot + GPT-4o-mini | Screenshot + Gemini Flash |
|---|---|---|---|
| KPI value accuracy | 12% (canvas not readable) | 91% | 88% |
| Chart data points extracted | 0% | 74% | 69% |
| Cost per page | $0.002 | $0.018 | $0.007 |
| Avg latency | 1.2s | 4.8s | 3.1s |
HTML scraping is essentially useless for canvas-rendered dashboards. The screenshot pipeline achieves 91% KPI accuracy, dropping on chart data points because axis labels can be small or overlapping.
Page Type 2: E-Commerce Product Card with Dynamic Pricing
| Metric | HTML Scraping | Screenshot + GPT-4o-mini | Screenshot + Gemini Flash |
|---|---|---|---|
| Price accuracy | 78% | 96% | 94% |
| Variant extraction | 61% | 89% | 85% |
| Discount detection | 52% | 93% | 90% |
| Cost per page | $0.002 | $0.016 | $0.006 |
The HTML scraper struggles with prices that are rendered by JavaScript after hydration. The vision pipeline sees exactly what the user sees.
Page Type 3: Paywalled Article Preview
| Metric | HTML Scraping | Screenshot + GPT-4o-mini | Screenshot + Gemini Flash |
|---|---|---|---|
| Paywall detection | 34% | 97% | 95% |
| Visible text extraction | 67% | 92% | 89% |
| Topic classification | 71% | 88% | 85% |
| Cost per page | $0.002 | $0.014 | $0.005 |
The HTML scraper often retrieves the full article text from the DOM even when it appears paywalled visually — which may not be what you want for compliance reasons. The screenshot pipeline accurately captures only what is visually accessible.
Cost Optimization Strategies
The main cost driver is the vision LLM token usage. A full-page screenshot at 1920x1080 is typically processed as 765 tokens for GPT-4o-mini's vision input, plus your prompt and output tokens.
Use Gemini Flash for high-volume pipelines. At roughly $0.006 per screenshot extraction versus $0.018 for GPT-4o-mini, Gemini Flash is 3x cheaper with only a modest accuracy difference for well-structured pages.
Crop to the relevant region. If you only need the price from a product page, you do not need to send the full 1080p screenshot. Use image cropping to isolate the area of interest before sending to the LLM. This can reduce vision tokens by 60-80%.
from PIL import Image
import io
import base64
def crop_and_encode(base64_image: str, left: int, top: int, right: int, bottom: int) -> str:
image_bytes = base64.b64decode(base64_image)
img = Image.open(io.BytesIO(image_bytes))
cropped = img.crop((left, top, right, bottom))
buffer = io.BytesIO()
cropped.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode()
Cache screenshots aggressively. KnowledgeSDK's screenshot endpoint respects a maxAge parameter. For pages that update daily, set maxAge: 3600 (seconds) to reuse the same screenshot across multiple extractions within an hour.
Error Handling and Retries
Vision LLMs occasionally hallucinate or return malformed JSON, especially for complex layouts. Implement retry logic with a fallback:
async function extractWithRetry(url: string, maxAttempts = 3): Promise<Product> {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await extractProductFromScreenshot(url);
} catch (error) {
if (error instanceof z.ZodError) {
console.warn(`Attempt ${attempt}: Schema validation failed`, error.errors);
if (attempt === maxAttempts) throw error;
// Wait before retry
await new Promise((r) => setTimeout(r, 1000 * attempt));
} else {
throw error;
}
}
}
throw new Error("All attempts failed");
}
When to Use Screenshot Extraction vs HTML Extraction
| Scenario | Recommended Approach |
|---|---|
| Static content site (blog, docs) | HTML extraction — faster, cheaper |
| SPA with API-driven data | HTML extraction with JS rendering |
| Canvas/WebGL charts | Screenshot + vision LLM |
| Complex dynamic pricing | Screenshot + vision LLM |
| Paywall preview analysis | Screenshot + vision LLM |
| High-volume commodity scraping | HTML extraction |
| Low-volume, high-accuracy needs | Screenshot + vision LLM |
Get Started
KnowledgeSDK's /v1/screenshot endpoint handles the browser infrastructure, anti-bot evasion, and reliable rendering across complex SPAs so you can focus on the extraction logic. Combined with a vision LLM, it unlocks structured data from pages that would otherwise be inaccessible to traditional scrapers.
Start with a free API key at knowledgesdk.com — the free tier includes 100 screenshots per month, enough to validate the pipeline against your specific page types before scaling up.