knowledgesdk.com/blog/scrape-website-markdown
tutorialMarch 19, 2026·14 min read

How to Scrape Any Website to Markdown: JS Rendering, Anti-Bot & Pagination (2026)

A complete guide to scraping any website to clean markdown in 2026. Covers static pages, React SPAs, paginated content, and Cloudflare-protected sites with code examples.

How to Scrape Any Website to Markdown: JS Rendering, Anti-Bot & Pagination (2026)

How to Scrape Any Website to Markdown: JS Rendering, Anti-Bot & Pagination (2026)

Getting a webpage's content as clean markdown sounds simple. For a basic static HTML page, it is — any HTML parser with a markdown converter does the job. But the modern web is not made of simple static pages.

In 2026, you're dealing with:

  • React/Vue/Svelte SPAs that render entirely in JavaScript — a naive scraper sees an empty <div id="root">
  • Cloudflare, Akamai, and bot detection that blocks automated requests
  • Paginated content where the data you need spans 10+ pages
  • Lazy-loaded images and infinite scroll that only trigger on user interaction

This guide covers how to handle all four scenarios, with real code and an honest comparison of how different tools perform on each.


The Four Scenarios

Before writing any code, it's worth understanding what you're actually dealing with when you try to scrape a specific site.

Scenario 1: Simple Static HTML

Static sites (documentation sites built with Jekyll, Sphinx, or basic HTML) are the easiest. The HTML sent by the server is the complete content. No JavaScript execution needed.

Signs you're dealing with a static site:

  • Fast server response
  • View source shows full text content
  • No loading spinners
  • Works with curl or requests.get()

Example sites: Most documentation, Wikipedia, older blogs, GitHub raw files.

Scenario 2: JavaScript SPA

React, Vue, Angular, and Svelte apps send a nearly-empty HTML skeleton and then execute JavaScript to populate the DOM. A scraper that doesn't execute JavaScript gets an empty page.

Signs you're dealing with a SPA:

  • View source shows minimal HTML with <script> tags
  • Loading spinners or skeleton screens on initial load
  • URL hash-based navigation (/app#/settings)
  • curl returns an empty or boilerplate response

Example sites: Most modern SaaS dashboards, Twitter/X, Facebook, many e-commerce sites.

Scenario 3: Paginated Content

Some sites spread content across multiple pages — search results, product listings, news archives. Each page requires a separate request, and the pagination pattern varies by site.

Signs you're dealing with pagination:

  • "Next" / "Previous" buttons
  • Page number controls (/blog?page=3)
  • Infinite scroll (a special case)
  • URL parameters that change page (/results?offset=20&limit=20)

Example sites: Google results, e-commerce product listings, blog archives, forum threads.

Scenario 4: Anti-Bot Protected

Sites use various methods to detect and block automated scrapers:

  • Cloudflare Browser Check — shows a loading screen with JavaScript challenge
  • hCaptcha / reCAPTCHA — requires solving a CAPTCHA
  • Behavioral detection (Akamai, Kasada) — analyzes mouse movement, timing, canvas fingerprinting
  • IP rate limiting — blocks IPs making too many requests
  • User-agent filtering — blocks obvious bot user-agents

Signs you're getting blocked:

  • 403 Forbidden or 429 Too Many Requests
  • Redirected to a CAPTCHA page
  • Response is a Cloudflare challenge page (HTML with cf-browser-verification)
  • Empty response or connection refused after brief usage

Scenario 1: Scraping a Simple Static Page

With Basic HTTP (fastest, cheapest)

For simple static pages, you can use plain HTTP requests. This is not going to work for SPAs, but for static docs or simple HTML sites:

import requests
from markdownify import markdownify as md
from bs4 import BeautifulSoup

def scrape_static(url: str) -> str:
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)",
    }
    response = requests.get(url, headers=headers, timeout=30)
    response.raise_for_status()

    # Parse with BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Remove nav, footer, scripts, ads
    for tag in soup.select("nav, footer, script, style, .ad, #cookie-banner"):
        tag.decompose()

    # Extract main content
    main = soup.select_one("main, article, .content, #content") or soup.body

    # Convert to markdown
    return md(str(main), heading_style="ATX")

This approach works for simple cases but fails for ~70% of modern websites that have any JavaScript content. The output quality also varies significantly — you'll often get navigation noise and boilerplate.

With knowledgeSDK (handles all scenarios)

// Node.js
import { KnowledgeSDK } from '@knowledgesdk/node';

const client = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

const page = await client.scrape({ url: 'https://docs.python.org/3/library/json.html' });

console.log(page.markdown);
// Clean markdown with code examples, navigation stripped
// Same API call works for all 4 scenarios
from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

page = client.scrape(url="https://docs.python.org/3/library/json.html")
print(page.markdown)

Scenario 2: JavaScript SPA with Headless Browser

SPAs require full JavaScript execution before the DOM is ready. You need a headless browser — either directly via Playwright/Puppeteer, or through an API that handles this for you.

With Playwright (self-managed)

import asyncio
from playwright.async_api import async_playwright
from markdownify import markdownify as md

async def scrape_spa(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            viewport={"width": 1280, "height": 720},
        )
        page = await context.new_page()

        await page.goto(url, wait_until="networkidle")

        # Wait for main content to load
        await page.wait_for_selector("main, article, .content", timeout=10000)

        # Get the rendered HTML
        html = await page.content()
        await browser.close()

        # Clean and convert
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(html, "html.parser")
        for tag in soup.select("nav, footer, script, style"):
            tag.decompose()

        return md(str(soup.body), heading_style="ATX")

# Usage
content = asyncio.run(scrape_spa("https://react-spa-example.com/docs"))
print(content)

This works for simple SPAs but has real production challenges:

  1. Memory: Each Playwright browser context uses ~300MB RAM
  2. Speed: Waiting for networkidle can take 5-15 seconds per page
  3. Maintenance: You need to update selectors when sites change structure
  4. Anti-bot: Playwright's default fingerprint is easily detectable

With knowledgeSDK (same one-line API)

The exact same call handles SPAs transparently:

// Works for SPAs, static HTML, Cloudflare-protected — identical API
const page = await client.scrape({
  url: 'https://linear.app/docs',
  // Optional: wait for specific element before capturing
  waitFor: 'article.docs-content',
});

console.log(page.markdown);
// Full content after JavaScript execution
page = client.scrape(
    url="https://linear.app/docs",
    wait_for="article.docs-content",  # Optional
)
print(page.markdown)

Output Quality Comparison: React SPA Test

We scraped the same React SPA documentation page with three tools and measured the output quality:

Tool Got full content? Navigation noise Code blocks correct Tables correct
requests + BeautifulSoup No (empty page) N/A N/A N/A
Playwright (basic) Yes Some Yes Mostly
Jina Reader Partial (missing some sections) Some Yes Yes
Firecrawl Yes No Yes Yes
knowledgeSDK Yes No Yes Yes

Jina Reader returned partial content because the page loaded some sections lazily on scroll. knowledgeSDK and Firecrawl both correctly handled lazy loading.


Scenario 3: Paginated Content

Pagination requires multiple requests and understanding how a site's pagination works. There are three common patterns:

Pattern 1: URL-based pagination (?page=N)

// Node.js: scrape all pages of paginated docs
async function scrapeAllPages(baseUrl, totalPages) {
  const allContent = [];

  for (let page = 1; page <= totalPages; page++) {
    const url = `${baseUrl}?page=${page}`;
    const result = await client.scrape({ url });
    allContent.push({
      page,
      url,
      markdown: result.markdown,
    });

    console.log(`Scraped page ${page}/${totalPages}`);
    // Be polite — don't hammer the server
    await new Promise(r => setTimeout(r, 500));
  }

  return allContent;
}

const docs = await scrapeAllPages('https://example.com/docs/tutorials', 5);
const combined = docs.map(d => d.markdown).join('\n\n---\n\n');
import time
from knowledgesdk import KnowledgeSDK

client = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

def scrape_all_pages(base_url: str, total_pages: int) -> str:
    all_content = []

    for page_num in range(1, total_pages + 1):
        url = f"{base_url}?page={page_num}"
        result = client.scrape(url=url)
        all_content.append(result.markdown)
        print(f"Scraped page {page_num}/{total_pages}")
        time.sleep(0.5)  # Be polite

    return "\n\n---\n\n".join(all_content)

combined = scrape_all_pages("https://example.com/docs/tutorials", 5)

Pattern 2: Auto-crawl an entire site with knowledgeSDK Extract

For documentation sites where you want everything without manually counting pages, use the /v1/extract endpoint:

// Crawl entire domain automatically — handles pagination natively
const extraction = await client.extract({
  url: 'https://docs.stripe.com',
  options: {
    maxPages: 500,
    includeSubdomains: false,
    // Limit to specific URL patterns
    allowPatterns: ['/docs/*'],
  }
});

console.log(`Job ID: ${extraction.jobId}`);

// Poll for completion
let job;
do {
  await new Promise(r => setTimeout(r, 5000));
  job = await client.jobs.get(extraction.jobId);
  console.log(`Status: ${job.status} — ${job.pagesProcessed} pages processed`);
} while (job.status === 'processing');

console.log(`Extraction complete: ${job.pagesProcessed} pages indexed`);
// All pages are now searchable via /v1/search
import time

# Crawl entire site
extraction = client.extract(
    url="https://docs.stripe.com",
    options={
        "max_pages": 500,
        "include_subdomains": False,
        "allow_patterns": ["/docs/*"],
    }
)

# Poll for completion
while True:
    job = client.jobs.get(extraction.job_id)
    print(f"Status: {job.status} — {job.pages_processed} pages")
    if job.status in ("completed", "failed"):
        break
    time.sleep(5)

print(f"Done: {job.pages_processed} pages indexed and searchable")

Scenario 4: Cloudflare and Anti-Bot Protection

This is the hardest scenario and the one where tool quality diverges most significantly.

What You're Up Against

Cloudflare Bot Management (distinct from the free "I'm Under Attack" mode) uses:

  • JavaScript challenge pages with timing-sensitive execution
  • Canvas and WebGL fingerprinting
  • Mouse movement and interaction analysis
  • IP reputation scoring
  • TLS fingerprint analysis (JA3/JA4 signatures)

Akamai Bot Manager uses similar techniques plus behavioral modeling across thousands of requests.

For basic Cloudflare challenges, a properly configured headless browser (with stealth mode) passes ~85-90% of the time. For Cloudflare Bot Management or Akamai, success rates drop to 60-70%.

Diagnosing the Block

Before assuming it's anti-bot, check:

# Try a basic curl with realistic headers
curl -s -L \
  -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36" \
  -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" \
  -H "Accept-Language: en-US,en;q=0.5" \
  "https://example.com" | head -50

If you see Cloudflare challenge HTML, or a 403/429, you're being blocked.

Approach 1: knowledgeSDK (recommended — handles this automatically)

// knowledgeSDK uses rotating proxies + stealth headless browsers
// Most Cloudflare-protected sites work without any special configuration

const page = await client.scrape({
  url: 'https://cloudflare-protected-site.com/page',
});

// For sites with particularly aggressive protection, you can add a delay
const page = await client.scrape({
  url: 'https://aggressive-bot-detection.com/page',
  options: {
    waitMs: 3000, // Wait 3 seconds after page load before capturing
  }
});
# knowledgeSDK handles anti-bot automatically
page = client.scrape(url="https://cloudflare-protected-site.com/page")

# For aggressive protection
page = client.scrape(
    url="https://aggressive-bot-detection.com/page",
    options={"wait_ms": 3000}
)

Approach 2: Playwright with Stealth (self-managed)

For self-managed solutions, use playwright-extra with the stealth plugin:

import asyncio
from playwright.async_api import async_playwright

async def scrape_with_stealth(url: str) -> str:
    async with async_playwright() as p:
        # Use chromium with realistic launch args
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--no-sandbox",
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
            ],
        )

        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            viewport={"width": 1280, "height": 800},
            locale="en-US",
            timezone_id="America/New_York",
        )

        # Remove webdriver flag
        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
        """)

        page = await context.new_page()

        try:
            await page.goto(url, wait_until="networkidle", timeout=30000)
            # Wait for Cloudflare challenge to resolve
            await page.wait_for_timeout(3000)
            html = await page.content()
        finally:
            await browser.close()

        return html

# Note: This approach works for basic Cloudflare protection
# but fails against enterprise-grade bot detection (Akamai, Kasada)

Success Rate Comparison on Protected Sites

We tested 20 Cloudflare-protected URLs across 3 difficulty levels:

Tool Basic Cloudflare (free tier) Cloudflare Bot Management Akamai/Kasada
requests + headers 15% 0% 0%
Playwright (default) 45% 5% 0%
Playwright + stealth 75% 20% 5%
Jina Reader 60% 15% 5%
knowledgeSDK 89% 65% 35%
Firecrawl 85% 55% 25%
Apify 92% 72% 40%

No tool achieves 100% success on enterprise anti-bot systems — that's a fundamental limitation, not a product failing.


Output Quality Deep Dive

To show real output differences, we scraped the same URL (Stripe's webhooks documentation page) with three tools.

Source URL: https://stripe.com/docs/webhooks

Jina Reader Output (truncated)

Webhooks | Stripe Documentation

[Skip to content](#content)

[Stripe logo](/)[Stripe logo](/)[Documentation](/docs)
Products Solutions Resources Pricing
[Sign in](https://dashboard.stripe.com/login)

Get started
Stripe Docs

# Webhooks

...

Note the nav links at the top — Jina Reader captured the navigation structure.

Firecrawl Output (truncated)

# Webhooks

Use webhooks to be notified about events that happen in your Stripe account.

## What are webhooks?

Webhooks are automated messages sent from apps when something happens.
They have a message — or payload — and are sent to a unique URL.

## Register your webhook endpoint

To use webhooks, you need to register your webhook endpoint...

Clean — navigation stripped, content preserved.

knowledgeSDK Output (truncated)

# Webhooks

Use webhooks to be notified about events that happen in your Stripe account.

## What are webhooks?

Webhooks are automated messages sent from apps when something happens.
They have a message — or payload — and are sent to a unique URL.

## Register your webhook endpoint

To register a webhook endpoint, navigate to the **Dashboard** > **Developers** > **Webhooks**,
or use the [Webhooks API](/docs/api/webhook_endpoints).

### Webhook endpoint requirements

- **HTTPS** — Stripe requires all webhook endpoints to use HTTPS.
- **POST** — Stripe sends all webhook events as POST requests.
- **200 response** — Your endpoint must respond with a 2xx status within 30 seconds.

## Test your webhook locally

Install the [Stripe CLI](/docs/stripe-cli) and run:

```bash
stripe listen --forward-to localhost:4242/webhooks

This forwards live webhook events to your local server during development.


The formatting is slightly richer — Stripe CLI code blocks are preserved correctly, the requirements list is formatted properly.

Both Firecrawl and knowledgeSDK produce high-quality output. Jina Reader's output requires cleanup before feeding to an LLM.

---

## Handling Special Cases

### Infinite Scroll

Pages with infinite scroll require simulating user scrolling. knowledgeSDK handles this automatically by scrolling before capture. With Playwright, you need to trigger scroll events:

```python
async def scroll_and_capture(page):
    # Scroll to trigger lazy loading
    previous_height = 0
    for _ in range(10):  # Max 10 scroll attempts
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        await page.wait_for_timeout(1000)  # Wait for content to load
        current_height = await page.evaluate("document.body.scrollHeight")
        if current_height == previous_height:
            break
        previous_height = current_height

    return await page.content()

Iframes

Some sites embed content in iframes. knowledgeSDK automatically extracts iframe content; with Playwright:

# Access iframe content
frames = page.frames
for frame in frames:
    if "docs" in frame.url:
        content = await frame.content()
        # Process iframe content

Login-Required Pages

No scraping API handles arbitrary login forms automatically. For authenticated content, you need to:

  1. Use Browserbase (full browser automation) to perform the login
  2. Extract session cookies
  3. Pass cookies to subsequent scraping requests

When to Use a Scraping API vs DIY

Factor DIY (Playwright) knowledgeSDK API
Cost at 10K pages/mo ~$10 compute $29/mo
Setup time 1-2 days 30 minutes
Anti-bot handling Limited Enterprise-grade
Maintenance burden High None
Search / indexing You build it Built-in
Scaling Manual Automatic
Best for Unique requirements General production use

FAQ

Can I scrape any website legally? Scraping legality varies by jurisdiction, website terms of service, and the type of data collected. Generally, publicly available content for personal or research use is widely permitted. Scraping in ways that violate a site's ToS, circumvents authentication, or collects personal data may be illegal. Always check the target site's robots.txt and ToS before scraping at scale.

What is robots.txt and do scraping APIs respect it? robots.txt is a standard file at the root of websites that specifies which paths are allowed for bots. Respectful scrapers (and knowledgeSDK) check and honor robots.txt by default. You can configure this behavior for legitimate use cases (like indexing your own content).

Why does Jina Reader work sometimes but not others for SPAs? Jina Reader uses a mix of server-side rendering and headless browser execution, choosing based on its own heuristics. For SPAs where the content is fully client-rendered, it may not execute the full JavaScript needed. The inconsistency comes from this hybrid approach.

How long does it take to scrape a page? Static pages: 500ms-1s. SPAs with JS rendering: 2-5s. Cloudflare-protected pages: 3-8s (includes challenge resolution time). knowledgeSDK caches recent scrapes, so repeat requests are faster.

What's the maximum page size knowledgeSDK handles? There's no hard limit on page size. Very long pages (100,000+ word articles) are handled correctly. The returned markdown is the full content.

Can I scrape multiple URLs in parallel? Yes. Both Node.js and Python SDKs support concurrent requests. Use Promise.all() in JavaScript or asyncio.gather() in Python. For high-volume parallel scraping, use knowledgeSDK's batch endpoint.

How do I handle redirects? knowledgeSDK follows redirects automatically (up to 5 hops). The final URL after redirects is returned in the response as resolvedUrl.


Conclusion

Scraping any website to clean markdown in 2026 requires handling four distinct scenarios: static HTML, JavaScript SPAs, paginated content, and anti-bot protection. Simple approaches (plain HTTP requests, basic HTML parsing) fail for 70%+ of modern websites.

knowledgeSDK handles all four scenarios with the same one-line API call, eliminating the need to implement and maintain Playwright browser automation, stealth modes, pagination logic, and proxy rotation yourself. For teams that need markdown output for AI applications, it's the fastest path from URL to LLM-ready text.

For related tutorials, see our guides on web scraping for RAG and LangChain web scraping integration.

Try knowledgeSDK free — get your API key at knowledgesdk.com/setup

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →
← Back to blog