Web Scraping with Python for LLMs: From BeautifulSoup to Knowledge APIs

A practical guide to web scraping in Python for LLM applications — from DIY with BeautifulSoup to production-ready knowledge extraction APIs.

Web Scraping with Python for LLMs: From BeautifulSoup to Knowledge APIs

Python is the language of AI. Most LLM frameworks — LangChain, LlamaIndex, DSPy, Haystack — are Python-first. Your model training and fine-tuning pipelines are probably Python. Your vector database client is Python. It makes sense, then, that Python is where most teams start when they need to feed web data into their LLM applications.

The problem is that web scraping in Python has a wide spectrum of approaches, and the right one for an LLM application is not necessarily the most commonly Googled one. BeautifulSoup tutorials dominate search results, but BeautifulSoup alone fails on the majority of modern websites. At the other extreme, running Selenium or Playwright for every page you want to read is slower and more expensive than it needs to be.

This guide walks through the full evolution — from DIY requests to production knowledge APIs — with working code for each approach and honest assessments of where each fits.

Approach 1: requests + BeautifulSoup

The classic. requests fetches the HTML, BeautifulSoup parses it. Fast to write, zero cost to run, and completely breaks on JavaScript-rendered sites.

import requests
from bs4 import BeautifulSoup

def scrape_static(url: str) -> str:
    headers = {'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)'}
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Remove noise
    for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
        tag.decompose()
    
    # Get main content
    main = soup.find('main') or soup.find('article') or soup.body
    return main.get_text(separator='\n', strip=True) if main else ''

text = scrape_static('https://en.wikipedia.org/wiki/Python_(programming_language)')
print(text[:500])

Where this works: Wikipedia, government sites, static blogs, any site that serves full HTML on first load. You can tell if a site works with this approach by disabling JavaScript in your browser — if the content appears, this approach will work.

Where this fails: React/Vue/Angular SPAs, any site that loads content asynchronously, sites with anti-bot measures that block requests without a real browser fingerprint.

For LLM applications specifically, the text extraction is also crude — you lose heading structure, code block formatting, and table structure. What you get is a flat wall of text, which is not ideal context.

Approach 2: Selenium / Playwright

When static scraping fails, most developers reach for a headless browser. Selenium is the older standard; Playwright is the modern replacement with better async support and a cleaner API.

from playwright.sync_api import sync_playwright
import markdownify

def scrape_with_playwright(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        # Set a realistic user agent
        page.set_extra_http_headers({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })
        
        page.goto(url, wait_until='networkidle')
        
        # Remove nav, footer, ads
        page.evaluate("""
            ['nav', 'footer', 'header', '.ads', '#sidebar'].forEach(sel => {
                document.querySelectorAll(sel).forEach(el => el.remove());
            });
        """)
        
        html = page.content()
        browser.close()
    
    # Convert to markdown
    return markdownify.markdownify(html, heading_style='ATX')

markdown = scrape_with_playwright('https://docs.python.org/3/library/asyncio.html')
print(markdown[:1000])

Dependencies: pip install playwright markdownify && playwright install chromium

Where this works: JavaScript-rendered sites, SPAs, pages that require interaction to load content.

Where this fails: Sites with sophisticated bot detection (Cloudflare challenge, PerimeterX, Akamai Bot Manager) that fingerprint browser environments and detect automation. Playwright's browser fingerprint is detectable by these systems. It also fails at scale — running 100 browser instances is a significant infrastructure investment.

For LLM use: Playwright gets you rendered HTML, but converting HTML to LLM-ready markdown is still your problem. The markdownify library helps, but you'll still get boilerplate — navigation links, cookie notices, footer content — that wastes tokens.

Approach 3: Scrapy for Bulk Crawling

If you need to crawl an entire site rather than individual pages, Scrapy is the right tool. It handles request queuing, rate limiting, deduplication, and parallel crawling out of the box.

import scrapy
from markdownify import markdownify

class KnowledgeSpider(scrapy.Spider):
    name = 'knowledge'
    start_urls = ['https://docs.example.com']
    
    def parse(self, response):
        # Extract main content
        content_html = response.css('article, main, .content').get('')
        markdown = markdownify(content_html, heading_style='ATX')
        
        yield {
            'url': response.url,
            'title': response.css('h1::text').get(''),
            'markdown': markdown,
        }
        
        # Follow links to other docs pages
        for href in response.css('a[href]::attr(href)').getall():
            if href.startswith('/') or 'docs.example.com' in href:
                yield response.follow(href, self.parse)

Where this works: Large-scale crawls of static or server-rendered sites, building datasets, indexing entire domains.

Where this fails: JavaScript-rendered sites (Scrapy uses raw HTTP, no browser). You can add Scrapy-Splash or Scrapy-Playwright middleware, but at that point you're running significant infrastructure. Also, Scrapy's output is structured for data pipelines, not LLM consumption — you still need to clean and format the content yourself.

Approach 4: API-Based Extraction

This is where most production LLM applications should land: call an API, get back LLM-ready content, pipe it into your chain. No browser management, no proxy rotation, no HTML cleaning logic in your codebase.

import requests

KNOWLEDGESDK_API_KEY = 'knowledgesdk_live_...'

def scrape_to_markdown(url: str) -> str:
    response = requests.post(
        'https://api.knowledgesdk.com/v1/extract',
        headers={
            'x-api-key': KNOWLEDGESDK_API_KEY,
            'Content-Type': 'application/json',
        },
        json={'url': url},
        timeout=30,
    )
    response.raise_for_status()
    return response.json()['markdown']

# Clean, LLM-ready markdown in one call
markdown = scrape_to_markdown('https://docs.example.com/api-reference')
print(markdown)

The response is clean markdown: proper heading hierarchy, code blocks with language tags, tables formatted as markdown tables, no navigation or footer content. Ready to pass directly to an LLM.

Putting It All Together: LangChain RAG Pipeline

Here's a complete example showing how to build a RAG pipeline from web sources using the API approach with LangChain:

import requests
from langchain.text_splitter import MarkdownTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA

KNOWLEDGESDK_API_KEY = 'knowledgesdk_live_...'

def fetch_markdown(url: str) -> str:
    response = requests.post(
        'https://api.knowledgesdk.com/v1/extract',
        headers={'x-api-key': KNOWLEDGESDK_API_KEY, 'Content-Type': 'application/json'},
        json={'url': url},
        timeout=30,
    )
    response.raise_for_status()
    return response.json()['markdown']

# URLs to index
urls = [
    'https://docs.example.com/getting-started',
    'https://docs.example.com/authentication',
    'https://docs.example.com/api-reference',
]

# Fetch and split
splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)
all_docs = []

for url in urls:
    markdown = fetch_markdown(url)
    chunks = splitter.create_documents([markdown], metadatas=[{'source': url}])
    all_docs.extend(chunks)

print(f'Indexed {len(all_docs)} chunks from {len(urls)} pages')

# Build vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(all_docs, embeddings)

# Create RAG chain
llm = ChatOpenAI(model='gpt-4o', temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={'k': 4}),
    return_source_documents=True,
)

# Query
result = qa_chain.invoke({'query': 'How do I authenticate API requests?'})
print(result['result'])

This is a working RAG pipeline over web sources in under 50 lines of Python. The API approach eliminates the HTML parsing, boilerplate removal, and markdown conversion work you'd otherwise write and maintain.

Approach Comparison

Approach	JS Support	Anti-bot	LLM-Ready Output	Maintenance Burden	Cost
requests + BeautifulSoup	No	Minimal	Poor (flat text)	Low	Free
Playwright	Yes	Limited	Medium (needs cleaning)	Medium	Infrastructure
Scrapy	No (without addon)	Minimal	Poor	Medium	Infrastructure
Scraping API	Yes	Good-Excellent	Excellent	None	Per-request

When Each Approach Makes Sense

Use requests + BeautifulSoup if: You're scraping a specific static site you control or know well, you're doing one-off data collection, or you're prototyping before investing in a more robust solution.

Use Playwright/Selenium if: You need to interact with pages (login, click, scroll), the sites you're targeting require it, and you're running at moderate scale where infrastructure cost is acceptable.

Use Scrapy if: You're building a crawler for a large static site, you need fine-grained control over crawl behavior, and you have backend infrastructure to run it.

Use an API if: You're building a production LLM application, you need clean markdown reliably across diverse sites, you want to avoid operational overhead, and you're extracting content rather than automating interactions.

For most teams building LLM applications in 2026, the API approach is the right default. The operational burden of maintaining your own scraping infrastructure is significant, and the quality of managed scraping APIs has improved to the point where DIY rarely pays off unless you have very specific requirements.

Try it now