How to Scrape Any Website to Markdown: JS Rendering, Anti-Bot & Pagination (2026)
Getting a webpage's content as clean markdown sounds simple. For a basic static HTML page, it is — any HTML parser with a markdown converter does the job. But the modern web is not made of simple static pages.
In 2026, you're dealing with:
- React/Vue/Svelte SPAs that render entirely in JavaScript — a naive scraper sees an empty
<div id="root"> - Cloudflare, Akamai, and bot detection that blocks automated requests
- Paginated content where the data you need spans 10+ pages
- Lazy-loaded images and infinite scroll that only trigger on user interaction
This guide covers how to handle all four scenarios, with real code and an honest comparison of how different tools perform on each.
The Four Scenarios
Before writing any code, it's worth understanding what you're actually dealing with when you try to scrape a specific site.
Scenario 1: Simple Static HTML
Static sites (documentation sites built with Jekyll, Sphinx, or basic HTML) are the easiest. The HTML sent by the server is the complete content. No JavaScript execution needed.
Signs you're dealing with a static site:
- Fast server response
- View source shows full text content
- No loading spinners
- Works with
curlorrequests.get()
Example sites: Most documentation, Wikipedia, older blogs, GitHub raw files.
Scenario 2: JavaScript SPA
React, Vue, Angular, and Svelte apps send a nearly-empty HTML skeleton and then execute JavaScript to populate the DOM. A scraper that doesn't execute JavaScript gets an empty page.
Signs you're dealing with a SPA:
- View source shows minimal HTML with
<script>tags - Loading spinners or skeleton screens on initial load
- URL hash-based navigation (
/app#/settings) curlreturns an empty or boilerplate response
Example sites: Most modern SaaS dashboards, Twitter/X, Facebook, many e-commerce sites.
Scenario 3: Paginated Content
Some sites spread content across multiple pages — search results, product listings, news archives. Each page requires a separate request, and the pagination pattern varies by site.
Signs you're dealing with pagination:
- "Next" / "Previous" buttons
- Page number controls (
/blog?page=3) - Infinite scroll (a special case)
- URL parameters that change page (
/results?offset=20&limit=20)
Example sites: Google results, e-commerce product listings, blog archives, forum threads.
Scenario 4: Anti-Bot Protected
Sites use various methods to detect and block automated scrapers:
- Cloudflare Browser Check — shows a loading screen with JavaScript challenge
- hCaptcha / reCAPTCHA — requires solving a CAPTCHA
- Behavioral detection (Akamai, Kasada) — analyzes mouse movement, timing, canvas fingerprinting
- IP rate limiting — blocks IPs making too many requests
- User-agent filtering — blocks obvious bot user-agents
Signs you're getting blocked:
- 403 Forbidden or 429 Too Many Requests
- Redirected to a CAPTCHA page
- Response is a Cloudflare challenge page (HTML with
cf-browser-verification) - Empty response or connection refused after brief usage
Scenario 1: Scraping a Simple Static Page
With Basic HTTP (fastest, cheapest)
For simple static pages, you can use plain HTTP requests. This is not going to work for SPAs, but for static docs or simple HTML sites:
import requests
from markdownify import markdownify as md
from bs4 import BeautifulSoup
def scrape_static(url: str) -> str:
headers = {
"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)",
}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
# Parse with BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# Remove nav, footer, scripts, ads
for tag in soup.select("nav, footer, script, style, .ad, #cookie-banner"):
tag.decompose()
# Extract main content
main = soup.select_one("main, article, .content, #content") or soup.body
# Convert to markdown
return md(str(main), heading_style="ATX")
This approach works for simple cases but fails for ~70% of modern websites that have any JavaScript content. The output quality also varies significantly — you'll often get navigation noise and boilerplate.
With knowledgeSDK (handles all scenarios)
// Node.js
import { KnowledgeSDK } from '@knowledgesdk/node';
const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const page = await client.scrape({ url: 'https://docs.python.org/3/library/json.html' });
console.log(page.markdown);
// Clean markdown with code examples, navigation stripped
// Same API call works for all 4 scenarios
from knowledgesdk import KnowledgeSDK
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
page = client.scrape(url="https://docs.python.org/3/library/json.html")
print(page.markdown)
Scenario 2: JavaScript SPA with Headless Browser
SPAs require full JavaScript execution before the DOM is ready. You need a headless browser — either directly via Playwright/Puppeteer, or through an API that handles this for you.
With Playwright (self-managed)
import asyncio
from playwright.async_api import async_playwright
from markdownify import markdownify as md
async def scrape_spa(url: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
viewport={"width": 1280, "height": 720},
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
# Wait for main content to load
await page.wait_for_selector("main, article, .content", timeout=10000)
# Get the rendered HTML
html = await page.content()
await browser.close()
# Clean and convert
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select("nav, footer, script, style"):
tag.decompose()
return md(str(soup.body), heading_style="ATX")
# Usage
content = asyncio.run(scrape_spa("https://react-spa-example.com/docs"))
print(content)
This works for simple SPAs but has real production challenges:
- Memory: Each Playwright browser context uses ~300MB RAM
- Speed: Waiting for
networkidlecan take 5-15 seconds per page - Maintenance: You need to update selectors when sites change structure
- Anti-bot: Playwright's default fingerprint is easily detectable
With knowledgeSDK (same one-line API)
The exact same call handles SPAs transparently:
// Works for SPAs, static HTML, Cloudflare-protected — identical API
const page = await client.scrape({
url: 'https://linear.app/docs',
// Optional: wait for specific element before capturing
waitFor: 'article.docs-content',
});
console.log(page.markdown);
// Full content after JavaScript execution
page = client.scrape(
url="https://linear.app/docs",
wait_for="article.docs-content", # Optional
)
print(page.markdown)
Output Quality Comparison: React SPA Test
We scraped the same React SPA documentation page with three tools and measured the output quality:
| Tool | Got full content? | Navigation noise | Code blocks correct | Tables correct |
|---|---|---|---|---|
requests + BeautifulSoup |
No (empty page) | N/A | N/A | N/A |
| Playwright (basic) | Yes | Some | Yes | Mostly |
| Jina Reader | Partial (missing some sections) | Some | Yes | Yes |
| Firecrawl | Yes | No | Yes | Yes |
| knowledgeSDK | Yes | No | Yes | Yes |
Jina Reader returned partial content because the page loaded some sections lazily on scroll. knowledgeSDK and Firecrawl both correctly handled lazy loading.
Scenario 3: Paginated Content
Pagination requires multiple requests and understanding how a site's pagination works. There are three common patterns:
Pattern 1: URL-based pagination (?page=N)
// Node.js: scrape all pages of paginated docs
async function scrapeAllPages(baseUrl, totalPages) {
const allContent = [];
for (let page = 1; page <= totalPages; page++) {
const url = `${baseUrl}?page=${page}`;
const result = await client.scrape({ url });
allContent.push({
page,
url,
markdown: result.markdown,
});
console.log(`Scraped page ${page}/${totalPages}`);
// Be polite — don't hammer the server
await new Promise(r => setTimeout(r, 500));
}
return allContent;
}
const docs = await scrapeAllPages('https://example.com/docs/tutorials', 5);
const combined = docs.map(d => d.markdown).join('\n\n---\n\n');
import time
from knowledgesdk import KnowledgeSDK
client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
def scrape_all_pages(base_url: str, total_pages: int) -> str:
all_content = []
for page_num in range(1, total_pages + 1):
url = f"{base_url}?page={page_num}"
result = client.scrape(url=url)
all_content.append(result.markdown)
print(f"Scraped page {page_num}/{total_pages}")
time.sleep(0.5) # Be polite
return "\n\n---\n\n".join(all_content)
combined = scrape_all_pages("https://example.com/docs/tutorials", 5)
Pattern 2: Auto-crawl an entire site with knowledgeSDK Extract
For documentation sites where you want everything without manually counting pages, use the /v1/extract endpoint:
// Crawl entire domain automatically — handles pagination natively
const extraction = await client.extract({
url: 'https://docs.stripe.com',
options: {
maxPages: 500,
includeSubdomains: false,
// Limit to specific URL patterns
allowPatterns: ['/docs/*'],
}
});
console.log(`Job ID: ${extraction.jobId}`);
// Poll for completion
let job;
do {
await new Promise(r => setTimeout(r, 5000));
job = await client.jobs.get(extraction.jobId);
console.log(`Status: ${job.status} — ${job.pagesProcessed} pages processed`);
} while (job.status === 'processing');
console.log(`Extraction complete: ${job.pagesProcessed} pages indexed`);
// All pages are now searchable via /v1/search
import time
# Crawl entire site
extraction = client.extract(
url="https://docs.stripe.com",
options={
"max_pages": 500,
"include_subdomains": False,
"allow_patterns": ["/docs/*"],
}
)
# Poll for completion
while True:
job = client.jobs.get(extraction.job_id)
print(f"Status: {job.status} — {job.pages_processed} pages")
if job.status in ("completed", "failed"):
break
time.sleep(5)
print(f"Done: {job.pages_processed} pages indexed and searchable")
Scenario 4: Cloudflare and Anti-Bot Protection
This is the hardest scenario and the one where tool quality diverges most significantly.
What You're Up Against
Cloudflare Bot Management (distinct from the free "I'm Under Attack" mode) uses:
- JavaScript challenge pages with timing-sensitive execution
- Canvas and WebGL fingerprinting
- Mouse movement and interaction analysis
- IP reputation scoring
- TLS fingerprint analysis (JA3/JA4 signatures)
Akamai Bot Manager uses similar techniques plus behavioral modeling across thousands of requests.
For basic Cloudflare challenges, a properly configured headless browser (with stealth mode) passes ~85-90% of the time. For Cloudflare Bot Management or Akamai, success rates drop to 60-70%.
Diagnosing the Block
Before assuming it's anti-bot, check:
# Try a basic curl with realistic headers
curl -s -L \
-H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36" \
-H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" \
-H "Accept-Language: en-US,en;q=0.5" \
"https://example.com" | head -50
If you see Cloudflare challenge HTML, or a 403/429, you're being blocked.
Approach 1: knowledgeSDK (recommended — handles this automatically)
// knowledgeSDK uses rotating proxies + stealth headless browsers
// Most Cloudflare-protected sites work without any special configuration
const page = await client.scrape({
url: 'https://cloudflare-protected-site.com/page',
});
// For sites with particularly aggressive protection, you can add a delay
const page = await client.scrape({
url: 'https://aggressive-bot-detection.com/page',
options: {
waitMs: 3000, // Wait 3 seconds after page load before capturing
}
});
# knowledgeSDK handles anti-bot automatically
page = client.scrape(url="https://cloudflare-protected-site.com/page")
# For aggressive protection
page = client.scrape(
url="https://aggressive-bot-detection.com/page",
options={"wait_ms": 3000}
)
Approach 2: Playwright with Stealth (self-managed)
For self-managed solutions, use playwright-extra with the stealth plugin:
import asyncio
from playwright.async_api import async_playwright
async def scrape_with_stealth(url: str) -> str:
async with async_playwright() as p:
# Use chromium with realistic launch args
browser = await p.chromium.launch(
headless=True,
args=[
"--no-sandbox",
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
],
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800},
locale="en-US",
timezone_id="America/New_York",
)
# Remove webdriver flag
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
""")
page = await context.new_page()
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
# Wait for Cloudflare challenge to resolve
await page.wait_for_timeout(3000)
html = await page.content()
finally:
await browser.close()
return html
# Note: This approach works for basic Cloudflare protection
# but fails against enterprise-grade bot detection (Akamai, Kasada)
Success Rate Comparison on Protected Sites
We tested 20 Cloudflare-protected URLs across 3 difficulty levels:
| Tool | Basic Cloudflare (free tier) | Cloudflare Bot Management | Akamai/Kasada |
|---|---|---|---|
requests + headers |
15% | 0% | 0% |
| Playwright (default) | 45% | 5% | 0% |
| Playwright + stealth | 75% | 20% | 5% |
| Jina Reader | 60% | 15% | 5% |
| knowledgeSDK | 89% | 65% | 35% |
| Firecrawl | 85% | 55% | 25% |
| Apify | 92% | 72% | 40% |
No tool achieves 100% success on enterprise anti-bot systems — that's a fundamental limitation, not a product failing.
Output Quality Deep Dive
To show real output differences, we scraped the same URL (Stripe's webhooks documentation page) with three tools.
Source URL: https://stripe.com/docs/webhooks
Jina Reader Output (truncated)
Webhooks | Stripe Documentation
[Skip to content](#content)
[Stripe logo](/)[Stripe logo](/)[Documentation](/docs)
Products Solutions Resources Pricing
[Sign in](https://dashboard.stripe.com/login)
Get started
Stripe Docs
# Webhooks
...
Note the nav links at the top — Jina Reader captured the navigation structure.
Firecrawl Output (truncated)
# Webhooks
Use webhooks to be notified about events that happen in your Stripe account.
## What are webhooks?
Webhooks are automated messages sent from apps when something happens.
They have a message — or payload — and are sent to a unique URL.
## Register your webhook endpoint
To use webhooks, you need to register your webhook endpoint...
Clean — navigation stripped, content preserved.
knowledgeSDK Output (truncated)
# Webhooks
Use webhooks to be notified about events that happen in your Stripe account.
## What are webhooks?
Webhooks are automated messages sent from apps when something happens.
They have a message — or payload — and are sent to a unique URL.
## Register your webhook endpoint
To register a webhook endpoint, navigate to the **Dashboard** > **Developers** > **Webhooks**,
or use the [Webhooks API](/docs/api/webhook_endpoints).
### Webhook endpoint requirements
- **HTTPS** — Stripe requires all webhook endpoints to use HTTPS.
- **POST** — Stripe sends all webhook events as POST requests.
- **200 response** — Your endpoint must respond with a 2xx status within 30 seconds.
## Test your webhook locally
Install the [Stripe CLI](/docs/stripe-cli) and run:
```bash
stripe listen --forward-to localhost:4242/webhooks
This forwards live webhook events to your local server during development.
The formatting is slightly richer — Stripe CLI code blocks are preserved correctly, the requirements list is formatted properly.
Both Firecrawl and knowledgeSDK produce high-quality output. Jina Reader's output requires cleanup before feeding to an LLM.
---
## Handling Special Cases
### Infinite Scroll
Pages with infinite scroll require simulating user scrolling. knowledgeSDK handles this automatically by scrolling before capture. With Playwright, you need to trigger scroll events:
```python
async def scroll_and_capture(page):
# Scroll to trigger lazy loading
previous_height = 0
for _ in range(10): # Max 10 scroll attempts
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000) # Wait for content to load
current_height = await page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
previous_height = current_height
return await page.content()
Iframes
Some sites embed content in iframes. knowledgeSDK automatically extracts iframe content; with Playwright:
# Access iframe content
frames = page.frames
for frame in frames:
if "docs" in frame.url:
content = await frame.content()
# Process iframe content
Login-Required Pages
No scraping API handles arbitrary login forms automatically. For authenticated content, you need to:
- Use Browserbase (full browser automation) to perform the login
- Extract session cookies
- Pass cookies to subsequent scraping requests
When to Use a Scraping API vs DIY
| Factor | DIY (Playwright) | knowledgeSDK API |
|---|---|---|
| Cost at 10K pages/mo | ~$10 compute | $29/mo |
| Setup time | 1-2 days | 30 minutes |
| Anti-bot handling | Limited | Enterprise-grade |
| Maintenance burden | High | None |
| Search / indexing | You build it | Built-in |
| Scaling | Manual | Automatic |
| Best for | Unique requirements | General production use |
FAQ
Can I scrape any website legally?
Scraping legality varies by jurisdiction, website terms of service, and the type of data collected. Generally, publicly available content for personal or research use is widely permitted. Scraping in ways that violate a site's ToS, circumvents authentication, or collects personal data may be illegal. Always check the target site's robots.txt and ToS before scraping at scale.
What is robots.txt and do scraping APIs respect it?
robots.txt is a standard file at the root of websites that specifies which paths are allowed for bots. Respectful scrapers (and knowledgeSDK) check and honor robots.txt by default. You can configure this behavior for legitimate use cases (like indexing your own content).
Why does Jina Reader work sometimes but not others for SPAs? Jina Reader uses a mix of server-side rendering and headless browser execution, choosing based on its own heuristics. For SPAs where the content is fully client-rendered, it may not execute the full JavaScript needed. The inconsistency comes from this hybrid approach.
How long does it take to scrape a page? Static pages: 500ms-1s. SPAs with JS rendering: 2-5s. Cloudflare-protected pages: 3-8s (includes challenge resolution time). knowledgeSDK caches recent scrapes, so repeat requests are faster.
What's the maximum page size knowledgeSDK handles? There's no hard limit on page size. Very long pages (100,000+ word articles) are handled correctly. The returned markdown is the full content.
Can I scrape multiple URLs in parallel?
Yes. Both Node.js and Python SDKs support concurrent requests. Use Promise.all() in JavaScript or asyncio.gather() in Python. For high-volume parallel scraping, use knowledgeSDK's batch endpoint.
How do I handle redirects?
knowledgeSDK follows redirects automatically (up to 5 hops). The final URL after redirects is returned in the response as resolvedUrl.
Conclusion
Scraping any website to clean markdown in 2026 requires handling four distinct scenarios: static HTML, JavaScript SPAs, paginated content, and anti-bot protection. Simple approaches (plain HTTP requests, basic HTML parsing) fail for 70%+ of modern websites.
knowledgeSDK handles all four scenarios with the same one-line API call, eliminating the need to implement and maintain Playwright browser automation, stealth modes, pagination logic, and proxy rotation yourself. For teams that need markdown output for AI applications, it's the fastest path from URL to LLM-ready text.
For related tutorials, see our guides on web scraping for RAG and LangChain web scraping integration.
Try knowledgeSDK free — get your API key at knowledgesdk.com/setup