DOM Parsing

Traversing and extracting content from a browser's Document Object Model tree using selectors like CSS or XPath.

What Is DOM Parsing?

The Document Object Model (DOM) is the browser's in-memory, tree-structured representation of an HTML document. DOM parsing, in the context of web scraping, refers to navigating this tree with selector queries to locate and extract specific elements and their text or attribute values.

Every modern web scraping tool — from Cheerio and BeautifulSoup to Puppeteer and Playwright — uses DOM parsing as its extraction mechanism.

The DOM Tree Structure

An HTML document is parsed into a hierarchy of nodes:

document
└── html
    ├── head
    │   ├── title
    │   └── meta
    └── body
        ├── h1 "Product Name"
        ├── div.price
        │   └── span "$49.99"
        └── ul.features
            ├── li "Feature A"
            └── li "Feature B"

DOM parsers let you navigate this tree by element type, class, ID, attribute, or position.

CSS Selectors

CSS selectors are the most commonly used DOM traversal method in web scraping:

// Select by element type
document.querySelectorAll('h1')

// Select by class
document.querySelector('.product-price .amount')

// Select by ID
document.querySelector('#main-content')

// Select by attribute
document.querySelector('meta[name="description"]')

// Select by relationship (child combinator)
document.querySelectorAll('ul.features > li')

// Select by nth position
document.querySelector('table tr:nth-child(2) td:first-child')

XPath

XPath is a more powerful alternative that allows traversal by text content and complex relational paths:

// Select a heading containing specific text
//h2[contains(text(), "Pricing")]

// Select the next sibling element
//label[text()="Price"]/following-sibling::span

// Select elements by attribute prefix
//*[starts-with(@class, "product-")]

XPath is especially useful when CSS classes are dynamic or obfuscated (common in React/Vue applications that hash class names).

Server-Side DOM Parsing Libraries

Language	Library
JavaScript	Cheerio (static), Playwright / Puppeteer (live browser)
Python	BeautifulSoup, lxml, Playwright
Ruby	Nokogiri
Go	goquery
Java	Jsoup

Cheerio and BeautifulSoup work on raw HTML strings. Playwright and Puppeteer operate on a live browser's DOM after JavaScript has executed.

The Fragility Problem

CSS selector-based scrapers are notoriously brittle. A typical failure scenario:

You write a selector: .product-card__price
The site's frontend team ships a redesign
The new class is .ProductCard_price__3kX9a (hashed by a CSS module)
Your selector matches nothing; extraction fails silently

Why AI Extraction Supersedes DOM Parsing

Tools like KnowledgeSDK's POST /v1/extract use AI to understand page semantics instead of relying on fragile selectors. The model can locate a price even when the class name changes, because it understands that a currency-formatted number near the product title is likely the price.

POST /v1/extract
Authorization: Bearer knowledgesdk_live_...

{
  "url": "https://shop.example.com/products/widget",
  "schema": { "price": "number", "product_name": "string" }
}

This produces reliable output without a single CSS selector or XPath expression.