knowledgesdk.com/blog/javascript-rendering-web-scraping

technicalMarch 19, 2026·13 min read

How to Scrape JavaScript-Rendered Pages in 2026 (SPA, React, Vue)

Why JS-rendered scraping is hard in 2026, how headless browsers work under the hood, and when to use a managed API vs rolling your own Playwright setup.

In 2018, roughly a third of web pages required JavaScript execution to render meaningful content. In 2026, that number is over 70%. React, Next.js, Vue, Nuxt, Angular, Remix, Astro — the modern web is built on frameworks that render content client-side, often after multiple API calls, lazy loading events, and framework hydration cycles.

For developers building web scrapers, this shift is the single biggest source of complexity and failure. This article explains exactly what happens when you try to scrape a JavaScript-rendered page without a headless browser, how headless browsers solve the problem, and when it makes sense to manage one yourself versus using a managed API.

What "JavaScript Rendering" Actually Means

When a browser loads a page, it does multiple things in sequence:

HTTP request — Fetches the HTML document
HTML parsing — Builds the initial DOM tree
CSS loading — Applies stylesheets
JavaScript execution — Runs scripts that may modify the DOM
API calls — JavaScript may fetch additional data
Framework hydration — React/Vue/Angular attach to the DOM and potentially replace it
Lazy loading triggers — Images, components, and data load as the user scrolls or interacts
Dynamic content — Infinite scroll, tab panels, accordion sections load on interaction

A curl request or a basic HTTP client only gets step 1 — the raw HTML. For a server-side rendered page (old-school PHP, Rails, Django), that's enough. For a React SPA, that HTML might be:

<div id="root"></div>
<script src="/static/js/main.abc123.js"></script>

That's it. The actual content doesn't exist until steps 3-8 complete. Your scraper gets an empty shell.

The Five Failure Modes of JS-Rendered Scraping

Understanding why scrapers fail on JS-rendered pages helps you diagnose problems faster:

1. Empty Content

The most obvious failure: you get the HTML shell but no content. The DOM is empty because React hasn't run yet. This happens with pure client-side rendered (CSR) applications where all content is fetched and rendered by JavaScript.

// This returns empty content for CSR pages
const response = await fetch('https://example-spa.com/products/123');
const html = await response.text();
// html contains: <div id="app"></div>
// Products data: undefined

2. Partial Content

Some pages use server-side rendering (SSR) or static site generation (SSG) for the initial HTML, but load additional content asynchronously — reviews, recommendations, dynamic pricing, personalized sections. You get some content but miss the important parts.

3. Hydration Race Conditions

With Next.js and similar hybrid rendering frameworks, the server sends pre-rendered HTML, but then the client "hydrates" — React takes over the DOM and may temporarily replace content. If you scrape too early (before hydration), you might catch a temporary state where the DOM doesn't match what a real user would see.

4. Lazy-Loaded Content

Infinite scroll pages, product carousels, and collapsible sections don't load their content until a user event triggers them. A scroll to the bottom of the page, a click on a "Show more" button, or a hover on a tab can all reveal content that a static scraper never sees.

5. Dynamic Data Fetching

Some pages make their own API calls after loading — fetching prices, inventory levels, personalized recommendations, or user-specific content. These appear in the rendered page but aren't in the initial HTML and aren't served by the page's static files.

How Headless Browsers Solve This

A headless browser (Chromium without a display) runs the full browser engine, including V8 JavaScript execution. When you point it at a URL, it:

Fetches the HTML (same as fetch())
Parses it and builds the DOM
Downloads and executes all JavaScript
Waits for network requests to complete
Runs the full framework initialization cycle
Returns the final, fully rendered DOM

The result is identical to what a real user sees in Chrome. Every React component renders, every API call completes, every lazy-loaded element populates.

The challenge is managing this at scale.

Playwright vs Puppeteer vs a Managed API

Playwright (Self-Managed)

Playwright is the current standard for programmatic browser automation:

import { chromium } from 'playwright';

async function scrapeWithPlaywright(url) {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle' });

  // Wait for specific content to appear
  await page.waitForSelector('.product-price', { timeout: 10000 });

  const content = await page.content();
  await browser.close();

  return content;
}

This works well for one-off scripts and development. At production scale, it breaks in a dozen ways:

Memory leaks. Each browser instance consumes 100-300MB RAM. Running 10 concurrent scrapers requires 1-3GB just for browsers. Memory leaks in long-running browser processes are common and hard to detect.

Crash recovery. Chromium crashes. Pages hang. Network timeouts leave browser processes as zombies. You need retry logic, process watchdogs, and crash recovery at every level.

Version management. Playwright bundles Chromium. Keeping versions in sync with your production environment, managing binary downloads in Docker, and handling OS-level dependencies is a continuous maintenance burden.

IP rotation. A single IP running Playwright at scale gets blocked quickly. You need residential proxy rotation, which adds cost and complexity.

Scaling. Horizontal scaling of headless browsers requires a browser pool, load balancing, health checking, and warm-up management. This is a full infrastructure project.

Puppeteer (Self-Managed)

Puppeteer is older and slightly simpler, but the same operational concerns apply:

import puppeteer from 'puppeteer';

async function scrapeWithPuppeteer(url) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });
  const content = await page.content();

  await browser.close();
  return content;
}

Managed APIs (KnowledgeSDK)

A managed scraping API handles the browser infrastructure for you:

import KnowledgeSDK from '@knowledgesdk/node';

const sdk = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

async function scrapeWithKnowledgeSDK(url) {
  const result = await sdk.scrape({ url });
  return result.markdown; // Clean text, not raw HTML
}

The API call is backed by a managed fleet of headless browsers with:

Automatic retry on crash or timeout
IP rotation across residential and datacenter proxies
Automatic wait logic (waits for the right content before returning)
Anti-bot bypass at the infrastructure level
Clean markdown output rather than raw HTML
No infrastructure to maintain

The tradeoff is cost per request versus infrastructure ownership. At low volume (under 10k requests/month), a managed API is almost always cheaper when you account for engineering time. At very high volume with predictable patterns, self-managed infrastructure can be more cost-efficient.

Specific Edge Cases and How They're Handled

Auth Walls and Cookie Consent

Many sites show cookie consent banners before rendering content. Headless browsers see these banners and — if your automation doesn't dismiss them — may scrape the banner text instead of the page content. The page also often renders differently before consent is given.

KnowledgeSDK automatically handles common cookie consent patterns. For sites with custom consent implementations, you can pass pre-accepted cookie headers:

const result = await sdk.scrape({
  url: 'https://example.com/product',
  headers: {
    'Cookie': 'consent=accepted; gdpr_accepted=1',
  },
});

Infinite Scroll and Pagination

Infinite scroll presents a specific challenge: content loads as the user scrolls, not all at once. A headless browser needs to trigger scroll events to load additional content.

For single-page content (product pages, articles), this usually isn't an issue — the above-fold content is what you need. For collection pages (search results, category listings), you may need to handle scroll-triggered loading:

// With Playwright: scroll to trigger lazy loading
async function scrapeInfiniteScrollPage(url, maxScrolls = 5) {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle' });

  for (let i = 0; i < maxScrolls; i++) {
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    await page.waitForTimeout(1500); // Wait for new content to load
  }

  const content = await page.content();
  await browser.close();
  return content;
}

With KnowledgeSDK's extract endpoint, multi-page pagination is handled automatically — the API follows "Next page" links and collects content across pages.

Single-Page Applications with Hash Routing

Some older SPAs use hash-based routing (/#/products/123). These are technically the same URL as the root (/) with a client-side route. This can confuse scrapers that treat the hash fragment as irrelevant.

Proper headless browsers handle this correctly — they navigate to the full URL including the hash and execute the JavaScript routing. HTTP clients that strip hash fragments will always land on the root page.

Dynamic CSS Class Names

Webpack and build tools like to generate dynamic class names: .product-price-abc123. These change on every build. Any scraper that relies on CSS selectors for finding specific content — like document.querySelector('.product-price') — breaks whenever the site deploys.

This is why extracting semantic meaning (markdown text, structured content) rather than DOM scraping is more resilient. KnowledgeSDK returns clean text content, not HTML with fragile selectors.

Shadow DOM Components

Web components using Shadow DOM encapsulate their markup in a separate DOM tree that isn't accessible via standard querySelector. Libraries like Lit, Shoelace, and native web components use this pattern.

Piercing the shadow DOM requires browser-level access — you can't do it with HTTP clients. Headless browsers have full access to shadow DOM via the browser's DevTools Protocol.

Performance Considerations

JavaScript rendering is inherently slower than static HTML parsing. Here's a realistic performance comparison:

Approach	Requests/second	Notes
`fetch()` (static HTML)	50-200	Fastest — no JS execution
Playwright (self-managed, single process)	3-8	Limited by browser per-process overhead
Playwright (browser pool, 10 instances)	15-40	Improves with pooling
Managed API (KnowledgeSDK)	5-20	Throughput scales with plan

For high-throughput scraping, design your pipeline to be async and queue-based rather than synchronous. Fire scrape requests and process results as they arrive rather than waiting for each request before starting the next.

// Queue-based approach for high throughput
import pLimit from 'p-limit';

const CONCURRENCY = 10;
const limit = pLimit(CONCURRENCY);

async function scrapeUrlsParallel(urls) {
  const tasks = urls.map(url =>
    limit(() => sdk.scrape({ url }).catch(err => ({ url, error: err.message })))
  );

  const results = await Promise.all(tasks);
  return results;
}

# Python: async parallel scraping
import asyncio
from typing import List, Dict

async def scrape_url(sdk, url: str) -> Dict:
    try:
        return sdk.scrape(url=url)
    except Exception as e:
        return {"url": url, "error": str(e)}

async def scrape_urls_parallel(sdk, urls: List[str], concurrency: int = 10) -> List[Dict]:
    semaphore = asyncio.Semaphore(concurrency)

    async def limited_scrape(url):
        async with semaphore:
            return await scrape_url(sdk, url)

    return await asyncio.gather(*[limited_scrape(url) for url in urls])

Choosing the Right Wait Strategy

The most common source of incomplete scraping is not waiting long enough for content to load. Different wait strategies have different tradeoffs:

waitUntil: 'load' — Waits for the initial page load event. Misses most async content.

waitUntil: 'networkidle' — Waits until no network requests have been made for 500ms. Good for most pages, slow for pages with background polling.

waitForSelector — Waits for a specific element to appear. Most reliable when you know what you're looking for.

Fixed timeout — Waits a set number of milliseconds. Unreliable — slow networks need more time, fast networks waste time waiting.

KnowledgeSDK automatically selects the appropriate wait strategy based on the page type, reducing the need to tune this manually.

FAQ

My page works in Chrome but not in Playwright. Why? The most common cause is that Playwright's headless mode is detectable — some sites return different content to detected automation tools. Another cause is user-agent mismatch: Playwright's default user agent may trigger different page behavior. KnowledgeSDK's browser pool uses hardened browser configurations that minimize detection.

Can I scrape content behind authentication? Yes — pass session cookies or authorization headers in your request. KnowledgeSDK supports custom headers for this purpose. Only access content you are authorized to access.

Why does the same URL return different content each time? Dynamic content (personalization, A/B tests, server-side user data) can vary per request. Some sites serve different content based on geolocation. Use a consistent proxy location if geographic consistency matters.

Is there a way to speed up scraping without compromising accuracy? For known page types, you can often skip waiting for low-priority resources (images, ads, analytics scripts) and only wait for the content-critical elements. Playwright's page.route() lets you block specific resource types. KnowledgeSDK does this automatically for most page types.

Does JavaScript rendering affect SEO-friendliness of scraped data? Not as an end user of the scraper. The content you get from a headless browser is identical to what Google's crawler sees (Google runs a full Chromium rendering engine). If anything, headless-rendered content is more accurate than SEO-oriented meta tag extraction.

Stop fighting JavaScript rendering. Let a managed browser fleet handle the hard parts at knowledgesdk.com/setup.

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →

technical

AST-Aware Code Chunking for RAG: Why Text Splitting Fails on Code

technical

Which Embedding Model Should You Use in 2026? (Full MTEB Benchmark Guide)

technical

Incremental Web Crawling: Only Scrape What Changed (With Webhooks)

technical

Scraping JavaScript SPAs: React, Vue, and Angular Without Running a Browser

← Back to blog

How to Scrape JavaScript-Rendered Pages in 2026 (SPA, React, Vue)

What "JavaScript Rendering" Actually Means

The Five Failure Modes of JS-Rendered Scraping

1. Empty Content

2. Partial Content

3. Hydration Race Conditions

4. Lazy-Loaded Content

5. Dynamic Data Fetching

How Headless Browsers Solve This

Playwright vs Puppeteer vs a Managed API

Playwright (Self-Managed)

Puppeteer (Self-Managed)

Managed APIs (KnowledgeSDK)

Specific Edge Cases and How They're Handled

Auth Walls and Cookie Consent

Infinite Scroll and Pagination

Single-Page Applications with Hash Routing

Dynamic CSS Class Names

Shadow DOM Components

Performance Considerations

Choosing the Right Wait Strategy

FAQ

Scrape, search, and monitor any website with one API.

Related Articles