Scraping JavaScript SPAs: React, Vue, and Angular Without Running a Browser
If you've tried to scrape a modern web application with curl or a basic HTTP client, you've seen the problem: the response is an empty HTML shell with a <div id="root"></div> and a bundle of JavaScript references. The actual content — the product listings, the pricing tiers, the article text — is nowhere to be found.
This is the JavaScript rendering problem, and it affects the majority of modern websites built with React, Vue, Angular, or any other SPA framework. Understanding why this happens and what your options are is essential for any developer building AI applications that need current web data.
Why SPAs Break Traditional Scrapers
A traditional website sends fully-rendered HTML from the server. You make an HTTP GET request, you receive a complete document with all the content visible in the source. This is how the web worked for its first two decades.
Modern SPAs flip this model. The server sends a minimal HTML document — typically just a root element and some JavaScript bundle references. The browser downloads the JavaScript, executes it, makes XHR or fetch requests to APIs, and then renders the content into the DOM. By the time a human user sees anything meaningful, several hundred milliseconds of client-side execution have happened.
When a scraper makes the same HTTP request the browser made, it gets the pre-execution HTML — the empty shell. There's no JavaScript execution, no API calls, no rendered content.
# What you hope to get
curl https://example-spa.com/products
# What you actually get:
# <!DOCTYPE html><html><head>...</head>
# <body><div id="root"></div>
# <script src="/static/js/main.8f3d2.js"></script>
# </body></html>
How to Know If a Page Needs JS Rendering
Before choosing your scraping approach, check whether the target actually needs JS rendering. Not all sites are SPAs:
Method 1: curl test
curl -s https://example.com/pricing | grep -i "price\|plan\|per month"
If you get matches, the content is server-rendered. If you get nothing, it's likely JS-rendered.
Method 2: View source vs. inspect element Right-click the page and select "View Page Source" (not "Inspect"). If the content you see in the browser isn't in the page source, it's being rendered by JavaScript.
Method 3: Disable JavaScript In Chrome DevTools, go to Settings → Debugger → Disable JavaScript, then reload the page. If the content disappears, it's JS-dependent.
Method 4: Check the network tab In DevTools, look for XHR/fetch requests made after page load. If the content comes from a JSON API call rather than the initial HTML response, you're dealing with a SPA.
Option 1: Find the Underlying API (Ideal, But Rare)
The cleanest solution is to skip the rendered HTML entirely and call the same JSON API the SPA calls. The data is there — the browser is fetching it from somewhere.
Open the Network tab in DevTools, filter for XHR/Fetch requests, and look for API calls that return the data you want. You'll often find calls to /api/products, /api/pricing, or similar endpoints that return clean JSON.
// If you find the underlying API, just call it directly
const response = await fetch('https://example.com/api/pricing', {
headers: {
'Accept': 'application/json',
'Referer': 'https://example.com/pricing',
},
});
const data = await response.json();
The catch: this approach is fragile. APIs change without notice, often require authentication tokens that expire, and may have rate limits or bot detection of their own. For production systems monitoring sites you don't control, it's not a reliable foundation.
Option 2: Reverse-Engineer the JSON API (Fragile)
An extension of option 1 — when APIs require auth tokens, you can sometimes extract them from the page's initial HTML or from JavaScript bundle globals. Frameworks like Next.js inject data into window.__NEXT_DATA__, which is accessible without JS execution.
const response = await fetch('https://nextjs-site.com/pricing');
const html = await response.text();
// Extract Next.js pre-loaded data
const match = html.match(/<script id="__NEXT_DATA__"[^>]*>(.*?)<\/script>/s);
if (match) {
const nextData = JSON.parse(match[1]);
// Navigate the props structure
const pricingData = nextData.props.pageProps;
}
This works for some Next.js sites, but it's specific to the framework and breaks when the data structure changes. It's useful for quick scripts but not for production monitoring.
Option 3: Headless Browser (Slow, Ops Burden)
Running a headless browser like Playwright or Puppeteer is the reliable way to handle JS rendering — it executes the JavaScript exactly as a real browser would. But it comes with significant trade-offs:
import { chromium } from 'playwright';
async function scrapeReactApp(url: string): Promise<string> {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
});
const page = await context.newPage();
await page.goto(url, { waitUntil: 'networkidle' });
// Wait for specific content to appear
await page.waitForSelector('[data-testid="pricing-table"]', {
timeout: 10000,
});
const content = await page.evaluate(() => {
return document.body.innerText;
});
await browser.close();
return content;
}
The problem in production: browser launches take 2-5 seconds of overhead. Memory usage is 200-500MB per instance. You need a server running persistent browser processes. Headless Chrome fingerprinting can trigger Cloudflare. Browsers crash and need restart logic. For AI applications that need sub-second responses or high concurrency, this is a meaningful infrastructure burden.
Option 4: Scraping API with JS Rendering (Easiest for AI Use Cases)
Modern scraping APIs run browser infrastructure on your behalf. You make a simple HTTP call, and the API handles JS rendering, waiting for content to load, and returning clean output.
Different APIs handle this differently:
- Scrape.do uses a
render=trueparameter to enable headless Chrome rendering - ScrapingBee uses
render_js=trueand supports custom JavaScript execution - Firecrawl uses its proprietary Fire-engine for rendering — you don't configure it, it decides automatically
- KnowledgeSDK handles JS rendering automatically for every request — no flag required
import KnowledgeSDK from '@knowledgesdk/node';
const ks = new KnowledgeSDK({ apiKey: 'knowledgesdk_live_...' });
// JS rendering is automatic — no configuration needed
const result = await ks.extract({
url: 'https://react-app.com/pricing',
});
// Returns clean markdown, even from React/Vue/Angular SPAs
console.log(result.markdown);
The key difference from running your own Playwright: you get clean markdown output instead of raw HTML, error handling and retries are built in, and there's no browser infrastructure to manage.
How Different APIs Handle SPA Rendering
| API | JS Rendering Approach | Configuration | Output Format |
|---|---|---|---|
| Scrape.do | Headless Chrome via render=true |
Explicit flag required | Raw HTML |
| ScrapingBee | Chrome via render_js=true |
Explicit flag required | Raw HTML or text |
| Firecrawl (Fire-engine) | Proprietary, automatic | No flag needed | Markdown |
| KnowledgeSDK | Automatic for all requests | No flag needed | Clean markdown |
The automatic approach matters for AI applications: you don't need to know in advance whether a page is server-rendered or client-rendered. The API handles both, and returns consistent markdown output either way.
The Speed vs. JS Support Trade-off
JS rendering is inherently slower than raw HTTP requests. A server-rendered page can be fetched and parsed in 200-500ms. A JS-rendered page needs 1-3 seconds for the browser to load, execute JavaScript, and wait for async data fetches to complete.
For AI applications, this trade-off is usually acceptable:
- Batch extraction (indexing a documentation site): Speed matters less than completeness. A 2-second per-page overhead is fine.
- On-demand agent requests: 1-3 second latency is acceptable within a conversational AI flow.
- Real-time streaming: If you need sub-second web data in a streaming context, JS rendering is a problem.
If latency is critical, check whether the page actually needs JS rendering before paying the rendering cost. Many pages that look like SPAs actually use server-side rendering or static generation (Next.js SSG, for example) and return full content in the initial response.
Putting It Together: A Reliable SPA Extraction Pattern
import KnowledgeSDK from '@knowledgesdk/node';
const ks = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY! });
async function extractSpaKnowledge(url: string) {
// Single call handles both static and JS-rendered pages
const page = await ks.extract({ url });
if (!page.markdown || page.markdown.length < 100) {
throw new Error(`Insufficient content extracted from ${url}`);
}
return {
url,
markdown: page.markdown,
title: page.title,
extractedAt: new Date().toISOString(),
};
}
// Works on React, Vue, Angular, Next.js, or static HTML — same API
const examples = [
'https://react-dashboard.com/features',
'https://vue-storefront.com/products',
'https://angular-enterprise.com/pricing',
'https://nextjs-blog.com/post/latest',
];
const results = await Promise.all(examples.map(extractSpaKnowledge));
The Bottom Line
JavaScript SPA scraping is genuinely hard to do correctly with custom infrastructure. The options — finding hidden APIs, reverse-engineering data structures, or running headless browsers — all have significant maintenance costs and failure modes.
For AI applications, the right answer is almost always a scraping API that handles JS rendering automatically. You get consistent results across static and dynamic pages, clean markdown output ready for LLM consumption, and no browser infrastructure to maintain.
KnowledgeSDK's 1,000 free monthly requests are enough to test against your target SPAs before committing to a paid plan. Start there — if the content comes back clean, the hardest part is already solved.