Multimodal Web Scraping: When to Use Screenshots vs Markdown for LLMs
When you want to feed web content to a language model, you have two fundamentally different approaches: extract the text as clean markdown, or capture a screenshot and pass the image to a vision-capable model. Both work. Neither is universally superior. The right choice depends on the type of page, the task you are performing, and how much you care about cost and latency.
This article benchmarks both approaches across common page types, identifies the failure modes of each, and shows you how to implement an auto-detection system that chooses the right strategy per URL.
The Two Approaches
Approach 1: Markdown Extraction
Markdown extraction renders the page in a headless browser, strips the HTML, and returns the visible text content formatted as clean markdown. The output looks like this:
# Product Title
**Price:** $189.00
**In Stock:** Yes
**Rating:** 4.4/5 (12,847 reviews)
## Description
The second-generation AirPods Pro deliver...
## Key Features
- Active Noise Cancellation
- Transparency mode
- Adaptive Audio
You pass this markdown directly to an LLM. The model reads it as text. Token count is predictable. Input cost is low.
KnowledgeSDK scrape endpoint:
const result = await client.scrape({ url: 'https://example.com/product' });
console.log(result.markdown); // Clean, LLM-ready markdown
result = client.scrape(url="https://example.com/product")
print(result.markdown)
Approach 2: Screenshot + Vision Model
Screenshot capture renders the page in a headless browser and captures a PNG. You pass that PNG to a vision-capable LLM (GPT-4o, Claude Sonnet, Gemini 1.5 Pro) alongside a question.
KnowledgeSDK screenshot endpoint:
const result = await client.screenshot({ url: 'https://example.com/dashboard' });
// result.image is a base64 PNG
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{ type: 'image_url', image_url: { url: `data:image/png;base64,${result.image}` } },
{ type: 'text', text: 'What are the three most prominent metrics shown in this dashboard?' },
],
}],
});
result = client.screenshot(url="https://example.com/dashboard")
import base64
from openai import OpenAI
openai_client = OpenAI()
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{result.image}"},
},
{
"type": "text",
"text": "What are the three most prominent metrics shown in this dashboard?",
},
],
}],
)
print(response.choices[0].message.content)
Benchmark: Accuracy, Cost, and Latency
We tested both approaches across six common page types. For each, we measured:
- Accuracy: correctness of extracted information (100 samples, human verified)
- Cost per page: API cost including vision model tokens for screenshots
- Latency: end-to-end time from request to parsed result
- Failure rate: percentage of pages where the approach completely failed
Test Setup
- Screenshot approach: KnowledgeSDK screenshot endpoint + GPT-4o vision
- Markdown approach: KnowledgeSDK scrape endpoint + GPT-4o text
- 100 unique URLs per page type, tested in March 2026
- "Failure" = missing required fields or incorrect values causing downstream errors
Page Type 1: E-Commerce Product Pages
| Metric | Markdown | Screenshot |
|---|---|---|
| Accuracy | 96% | 91% |
| Cost per page | $0.004 | $0.019 |
| Latency | 1.2s | 2.8s |
| Failure rate | 3% | 7% |
Winner: Markdown. Product pages are text-heavy with well-structured content. Markdown extraction captures price, title, and description reliably. Screenshots occasionally miss dynamically loaded prices or render at the wrong scroll position.
Page Type 2: Analytics Dashboards and Charts
| Metric | Markdown | Screenshot |
|---|---|---|
| Accuracy | 34% | 87% |
| Cost per page | $0.003 | $0.022 |
| Latency | 1.0s | 3.1s |
| Failure rate | 41% | 9% |
Winner: Screenshot. Charts, graphs, and dashboards do not translate to markdown. A bar chart in HTML becomes a series of empty div elements with no readable values. A vision model reading the screenshot correctly interprets bar heights, axis labels, and trend lines.
Page Type 3: News Articles and Blog Posts
| Metric | Markdown | Screenshot |
|---|---|---|
| Accuracy | 98% | 88% |
| Cost per page | $0.003 | $0.021 |
| Latency | 1.1s | 2.7s |
| Failure rate | 1% | 8% |
Winner: Markdown. Articles are pure text. Markdown extraction is near-perfect. Screenshots introduce noise from ads, sidebars, and cookie banners that appear in the viewport and can confuse the vision model.
Page Type 4: SaaS Product UI (Logged-In State Simulation)
| Metric | Markdown | Screenshot |
|---|---|---|
| Accuracy | 58% | 83% |
| Cost per page | $0.004 | $0.023 |
| Latency | 1.3s | 3.3s |
| Failure rate | 28% | 14% |
Winner: Screenshot. Complex UI layouts with tabs, modals, and data tables often produce garbled markdown where the visual hierarchy is lost. A screenshot preserves the spatial layout that communicates meaning. Note that for actual logged-in states you need pre-authenticated sessions — KnowledgeSDK does not handle session management.
Page Type 5: Pages with Anti-Bot Protection
| Metric | Markdown | Screenshot |
|---|---|---|
| Accuracy | 22% | 31% |
| Cost per page | $0.004 | $0.022 |
| Latency | 4.2s | 5.1s |
| Failure rate | 67% | 58% |
Neither wins outright. Both approaches use the same underlying browser infrastructure, so anti-bot detection affects both. Screenshots have a slightly lower failure rate because CAPTCHA challenges — while not solvable — at least produce a visible image that the vision model can describe ("This page is showing a CAPTCHA challenge"), while markdown extraction returns empty content or an error page with no useful signal.
Page Type 6: Documentation Pages
| Metric | Markdown | Screenshot |
|---|---|---|
| Accuracy | 99% | 82% |
| Cost per page | $0.003 | $0.020 |
| Latency | 1.0s | 2.6s |
| Failure rate | 0% | 11% |
Winner: Markdown. Documentation is the ideal use case for markdown extraction. Code blocks are preserved with syntax, headings create clean structure, and the output is directly usable as LLM context without additional parsing.
Auto-Detection: Choosing the Right Approach per URL
The insight from the benchmark is that the right approach depends on what the page contains, not just what domain it is on. Here is an auto-detection system that inspects a page and picks the strategy:
Node.js
import KnowledgeSDK from '@knowledgesdk/node';
import OpenAI from 'openai';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const openai = new OpenAI();
// Page characteristics that suggest screenshot is better
const SCREENSHOT_INDICATORS = [
'dashboard', 'analytics', 'chart', 'graph', 'visualiz',
'metrics', 'report', 'monitor', 'stats',
];
// URL path patterns that suggest screenshot is better
const SCREENSHOT_URL_PATTERNS = [
/\/dashboard/i, /\/analytics/i, /\/reports?/i,
/\/charts?/i, /\/metrics/i, /\/monitor/i,
];
function shouldUseScreenshot(url, pageTitle = '') {
// Check URL path
if (SCREENSHOT_URL_PATTERNS.some(p => p.test(url))) return true;
// Check page title for dashboard/analytics signals
const titleLower = pageTitle.toLowerCase();
if (SCREENSHOT_INDICATORS.some(s => titleLower.includes(s))) return true;
return false;
}
async function smartExtract(url, question) {
// Start with a quick scrape to get title and check for content
const scrapeResult = await client.scrape({ url });
const markdown = scrapeResult.markdown;
// Check if markdown extraction produced useful content
const isMarkdownEmpty = markdown.trim().length < 200;
const hasScreenshotSignal = shouldUseScreenshot(url, scrapeResult.title ?? '');
if (!isMarkdownEmpty && !hasScreenshotSignal) {
// Use markdown path — fast and cheap
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'Answer the question based on the webpage content provided.' },
{ role: 'user', content: `Webpage content:\n\n${markdown}\n\nQuestion: ${question}` },
],
});
return { answer: response.choices[0].message.content, method: 'markdown' };
}
// Fall back to screenshot
const screenshotResult = await client.screenshot({ url });
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: [
{
type: 'image_url',
image_url: { url: `data:image/png;base64,${screenshotResult.image}` },
},
{ type: 'text', text: question },
],
}],
});
return { answer: response.choices[0].message.content, method: 'screenshot' };
}
// Usage
const result = await smartExtract(
'https://example.com/analytics/dashboard',
'What is the total revenue shown for this month?'
);
console.log(`Method used: ${result.method}`);
console.log(result.answer);
Python
import os
import re
from knowledgesdk import KnowledgeSDK
from openai import OpenAI
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai_client = OpenAI()
SCREENSHOT_URL_PATTERNS = [
re.compile(r"/dashboard", re.IGNORECASE),
re.compile(r"/analytics", re.IGNORECASE),
re.compile(r"/reports?", re.IGNORECASE),
re.compile(r"/charts?", re.IGNORECASE),
re.compile(r"/metrics", re.IGNORECASE),
]
SCREENSHOT_TITLE_SIGNALS = [
"dashboard", "analytics", "chart", "graph", "metrics", "report",
]
def should_use_screenshot(url: str, page_title: str = "") -> bool:
if any(p.search(url) for p in SCREENSHOT_URL_PATTERNS):
return True
title_lower = page_title.lower()
return any(s in title_lower for s in SCREENSHOT_TITLE_SIGNALS)
def smart_extract(url: str, question: str) -> dict:
# Start with a scrape
scrape_result = client.scrape(url=url)
markdown = scrape_result.markdown or ""
is_markdown_empty = len(markdown.strip()) < 200
has_screenshot_signal = should_use_screenshot(
url, getattr(scrape_result, "title", "") or ""
)
if not is_markdown_empty and not has_screenshot_signal:
# Markdown path
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Answer the question based on the webpage content provided.",
},
{
"role": "user",
"content": f"Webpage content:\n\n{markdown}\n\nQuestion: {question}",
},
],
)
return {
"answer": response.choices[0].message.content,
"method": "markdown",
}
# Screenshot path
screenshot_result = client.screenshot(url=url)
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{screenshot_result.image}"
},
},
{"type": "text", "text": question},
],
}],
)
return {
"answer": response.choices[0].message.content,
"method": "screenshot",
}
# Usage
result = smart_extract(
"https://example.com/analytics/monthly-report",
"What is the total number of active users shown?",
)
print(f"Method: {result['method']}")
print(result["answer"])
Cost and Latency Optimization
For production systems processing thousands of URLs per day, the cost difference between approaches adds up quickly:
| Volume | Markdown only | Screenshot only | Smart routing (est. 70% markdown) |
|---|---|---|---|
| 1,000 pages/day | $4/day | $22/day | $8/day |
| 10,000 pages/day | $40/day | $220/day | $80/day |
| 100,000 pages/day | $400/day | $2,200/day | $800/day |
Estimates based on GPT-4o pricing as of March 2026 and typical page sizes.
Smart routing at 70% markdown (the realistic split for most diverse web corpora) delivers 64% cost savings over pure screenshot and only 2x the cost of pure markdown — while maintaining high accuracy across all page types.
When Screenshots Are Non-Negotiable
Beyond dashboards and charts, there are specific scenarios where screenshots are the only viable approach:
Canvas-rendered applications: Some apps (Figma, Google Maps, certain financial charting tools) render entirely on an HTML5 canvas. There is no DOM text to extract. The only way to read them is vision.
PDF embeds: Inline PDF viewers present their content as a visual layer. Markdown extraction returns nothing. A screenshot captures the visible document content.
Image-based text: Some sites intentionally render prices or contact details as images to prevent scraping. A vision model can read these. Markdown extraction cannot.
Complex data tables with merged cells: HTML tables with complex rowspan/colspan structures often produce garbled markdown. A screenshot preserves the visual table structure that a vision model can interpret.
Conclusion
There is no universally correct answer to "screenshot or markdown?" The right choice depends on what the page contains. For text-heavy pages (articles, documentation, product listings), markdown is faster, cheaper, and more accurate. For visual content (dashboards, charts, complex UIs), screenshots are the only reliable option.
The production-ready approach is smart routing: start with a scrape, detect whether the content is visually encoded, and fall back to screenshot only when needed. KnowledgeSDK gives you both endpoints — POST /v1/extract for markdown and POST /v1/screenshot for PNG — with the same API key and consistent response format, making smart routing straightforward to implement.
Start processing web content for your LLM. Sign up at knowledgesdk.com for 1,000 free requests per month — both scrape and screenshot endpoints included.