What Is DOM Parsing?
The Document Object Model (DOM) is the browser's in-memory, tree-structured representation of an HTML document. DOM parsing, in the context of web scraping, refers to navigating this tree with selector queries to locate and extract specific elements and their text or attribute values.
Every modern web scraping tool — from Cheerio and BeautifulSoup to Puppeteer and Playwright — uses DOM parsing as its extraction mechanism.
The DOM Tree Structure
An HTML document is parsed into a hierarchy of nodes:
document
└── html
├── head
│ ├── title
│ └── meta
└── body
├── h1 "Product Name"
├── div.price
│ └── span "$49.99"
└── ul.features
├── li "Feature A"
└── li "Feature B"
DOM parsers let you navigate this tree by element type, class, ID, attribute, or position.
CSS Selectors
CSS selectors are the most commonly used DOM traversal method in web scraping:
// Select by element type
document.querySelectorAll('h1')
// Select by class
document.querySelector('.product-price .amount')
// Select by ID
document.querySelector('#main-content')
// Select by attribute
document.querySelector('meta[name="description"]')
// Select by relationship (child combinator)
document.querySelectorAll('ul.features > li')
// Select by nth position
document.querySelector('table tr:nth-child(2) td:first-child')
XPath
XPath is a more powerful alternative that allows traversal by text content and complex relational paths:
// Select a heading containing specific text
//h2[contains(text(), "Pricing")]
// Select the next sibling element
//label[text()="Price"]/following-sibling::span
// Select elements by attribute prefix
//*[starts-with(@class, "product-")]
XPath is especially useful when CSS classes are dynamic or obfuscated (common in React/Vue applications that hash class names).
Server-Side DOM Parsing Libraries
| Language | Library |
|---|---|
| JavaScript | Cheerio (static), Playwright / Puppeteer (live browser) |
| Python | BeautifulSoup, lxml, Playwright |
| Ruby | Nokogiri |
| Go | goquery |
| Java | Jsoup |
Cheerio and BeautifulSoup work on raw HTML strings. Playwright and Puppeteer operate on a live browser's DOM after JavaScript has executed.
The Fragility Problem
CSS selector-based scrapers are notoriously brittle. A typical failure scenario:
- You write a selector:
.product-card__price - The site's frontend team ships a redesign
- The new class is
.ProductCard_price__3kX9a(hashed by a CSS module) - Your selector matches nothing; extraction fails silently
Why AI Extraction Supersedes DOM Parsing
Tools like KnowledgeSDK's POST /v1/extract use AI to understand page semantics instead of relying on fragile selectors. The model can locate a price even when the class name changes, because it understands that a currency-formatted number near the product title is likely the price.
POST /v1/extract
Authorization: Bearer knowledgesdk_live_...
{
"url": "https://shop.example.com/products/widget",
"schema": { "price": "number", "product_name": "string" }
}
This produces reliable output without a single CSS selector or XPath expression.
When DOM Parsing Still Makes Sense
- You fully control the target site and its HTML structure is stable
- You need maximum speed with no LLM latency
- You are building a browser extension that runs in a live browser context
- The extraction logic is simple enough that selectors will not break frequently