What Is API Scraping?
API scraping is the practice of extracting data by calling a website's internal, private, or undocumented HTTP APIs — typically JSON endpoints — rather than parsing the visible HTML of the page. Modern web applications often load their data from a backend API and render it client-side with JavaScript. By intercepting or reverse-engineering those API calls, a scraper can retrieve clean, structured JSON directly, bypassing the need to parse HTML at all.
How It Differs from HTML Scraping
| Approach | Data Source | Output Format | Brittleness |
|---|---|---|---|
| HTML scraping | Rendered DOM | Raw HTML → parsed fields | High (breaks on redesign) |
| API scraping | Internal API endpoints | JSON / structured data | Medium (breaks on API changes) |
API scraping produces cleaner, more structured data with less parsing effort. The trade-off is that the API endpoints are undocumented and may change or require authentication tokens.
How to Find Internal API Endpoints
The standard technique is to use browser developer tools to inspect network requests while browsing the target site:
- Open Chrome DevTools → Network tab
- Filter by Fetch/XHR to see only API requests
- Browse the site normally — search, scroll, click
- Identify JSON-returning requests that carry the data you need
- Copy the request URL, headers, and cookies
For example, a social media site's feed might be loaded by:
GET https://api.example.com/v2/feed?user_id=12345&limit=20
Authorization: Bearer eyJhbGc...
Common Patterns in API Scraping
- Pagination tokens — APIs often return a
next_cursororpage_tokenfor paginated results; your scraper must follow these to collect all records - Authentication headers — many internal APIs require session cookies or bearer tokens obtained by simulating a login
- Rate limiting — internal APIs have rate limits; honor them with delays or your token will be revoked
- GraphQL endpoints — some sites use GraphQL; you can query exactly the fields you need
- WebSocket streams — real-time data (prices, scores, feeds) may arrive over a WebSocket rather than REST
Example: Collecting Product Data via Internal API
// Intercepted endpoint from browser DevTools
const response = await fetch(
'https://api.shop.example.com/products?category=electronics&page=1',
{
headers: {
'Authorization': 'Bearer <session-token>',
'x-client-version': '3.14.0',
}
}
);
const { products, next_page } = await response.json();
When to Use API Scraping vs. HTML Scraping
Use API scraping when:
- The site loads data via clearly identifiable XHR/fetch requests
- You need large volumes of records (faster than HTML parsing)
- The page is a complex SPA where HTML scraping is unreliable
Use HTML scraping (or KnowledgeSDK's POST /v1/scrape / POST /v1/extract) when:
- Content is server-rendered and lives directly in the HTML
- No clear API endpoint is discoverable
- You need the full rendered page including text, metadata, and structure
Legal and Ethical Considerations
Accessing undocumented private APIs without authorization may violate a site's terms of service and, in some jurisdictions, the Computer Fraud and Abuse Act (CFAA) or equivalent laws. Always review the site's terms before proceeding, and prefer official public APIs when available.