Web Scraping with Python for LLMs: From BeautifulSoup to Knowledge APIs
Python is the language of AI. Most LLM frameworks — LangChain, LlamaIndex, DSPy, Haystack — are Python-first. Your model training and fine-tuning pipelines are probably Python. Your vector database client is Python. It makes sense, then, that Python is where most teams start when they need to feed web data into their LLM applications.
The problem is that web scraping in Python has a wide spectrum of approaches, and the right one for an LLM application is not necessarily the most commonly Googled one. BeautifulSoup tutorials dominate search results, but BeautifulSoup alone fails on the majority of modern websites. At the other extreme, running Selenium or Playwright for every page you want to read is slower and more expensive than it needs to be.
This guide walks through the full evolution — from DIY requests to production knowledge APIs — with working code for each approach and honest assessments of where each fits.
Approach 1: requests + BeautifulSoup
The classic. requests fetches the HTML, BeautifulSoup parses it. Fast to write, zero cost to run, and completely breaks on JavaScript-rendered sites.
import requests
from bs4 import BeautifulSoup
def scrape_static(url: str) -> str:
headers = {'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)'}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Remove noise
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
# Get main content
main = soup.find('main') or soup.find('article') or soup.body
return main.get_text(separator='\n', strip=True) if main else ''
text = scrape_static('https://en.wikipedia.org/wiki/Python_(programming_language)')
print(text[:500])
Where this works: Wikipedia, government sites, static blogs, any site that serves full HTML on first load. You can tell if a site works with this approach by disabling JavaScript in your browser — if the content appears, this approach will work.
Where this fails: React/Vue/Angular SPAs, any site that loads content asynchronously, sites with anti-bot measures that block requests without a real browser fingerprint.
For LLM applications specifically, the text extraction is also crude — you lose heading structure, code block formatting, and table structure. What you get is a flat wall of text, which is not ideal context.
Approach 2: Selenium / Playwright
When static scraping fails, most developers reach for a headless browser. Selenium is the older standard; Playwright is the modern replacement with better async support and a cleaner API.
from playwright.sync_api import sync_playwright
import markdownify
def scrape_with_playwright(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Set a realistic user agent
page.set_extra_http_headers({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
page.goto(url, wait_until='networkidle')
# Remove nav, footer, ads
page.evaluate("""
['nav', 'footer', 'header', '.ads', '#sidebar'].forEach(sel => {
document.querySelectorAll(sel).forEach(el => el.remove());
});
""")
html = page.content()
browser.close()
# Convert to markdown
return markdownify.markdownify(html, heading_style='ATX')
markdown = scrape_with_playwright('https://docs.python.org/3/library/asyncio.html')
print(markdown[:1000])
Dependencies: pip install playwright markdownify && playwright install chromium
Where this works: JavaScript-rendered sites, SPAs, pages that require interaction to load content.
Where this fails: Sites with sophisticated bot detection (Cloudflare challenge, PerimeterX, Akamai Bot Manager) that fingerprint browser environments and detect automation. Playwright's browser fingerprint is detectable by these systems. It also fails at scale — running 100 browser instances is a significant infrastructure investment.
For LLM use: Playwright gets you rendered HTML, but converting HTML to LLM-ready markdown is still your problem. The markdownify library helps, but you'll still get boilerplate — navigation links, cookie notices, footer content — that wastes tokens.
Approach 3: Scrapy for Bulk Crawling
If you need to crawl an entire site rather than individual pages, Scrapy is the right tool. It handles request queuing, rate limiting, deduplication, and parallel crawling out of the box.
import scrapy
from markdownify import markdownify
class KnowledgeSpider(scrapy.Spider):
name = 'knowledge'
start_urls = ['https://docs.example.com']
def parse(self, response):
# Extract main content
content_html = response.css('article, main, .content').get('')
markdown = markdownify(content_html, heading_style='ATX')
yield {
'url': response.url,
'title': response.css('h1::text').get(''),
'markdown': markdown,
}
# Follow links to other docs pages
for href in response.css('a[href]::attr(href)').getall():
if href.startswith('/') or 'docs.example.com' in href:
yield response.follow(href, self.parse)
Where this works: Large-scale crawls of static or server-rendered sites, building datasets, indexing entire domains.
Where this fails: JavaScript-rendered sites (Scrapy uses raw HTTP, no browser). You can add Scrapy-Splash or Scrapy-Playwright middleware, but at that point you're running significant infrastructure. Also, Scrapy's output is structured for data pipelines, not LLM consumption — you still need to clean and format the content yourself.
Approach 4: API-Based Extraction
This is where most production LLM applications should land: call an API, get back LLM-ready content, pipe it into your chain. No browser management, no proxy rotation, no HTML cleaning logic in your codebase.
import requests
KNOWLEDGESDK_API_KEY = 'knowledgesdk_live_...'
def scrape_to_markdown(url: str) -> str:
response = requests.post(
'https://api.knowledgesdk.com/v1/extract',
headers={
'x-api-key': KNOWLEDGESDK_API_KEY,
'Content-Type': 'application/json',
},
json={'url': url},
timeout=30,
)
response.raise_for_status()
return response.json()['markdown']
# Clean, LLM-ready markdown in one call
markdown = scrape_to_markdown('https://docs.example.com/api-reference')
print(markdown)
The response is clean markdown: proper heading hierarchy, code blocks with language tags, tables formatted as markdown tables, no navigation or footer content. Ready to pass directly to an LLM.
Putting It All Together: LangChain RAG Pipeline
Here's a complete example showing how to build a RAG pipeline from web sources using the API approach with LangChain:
import requests
from langchain.text_splitter import MarkdownTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
KNOWLEDGESDK_API_KEY = 'knowledgesdk_live_...'
def fetch_markdown(url: str) -> str:
response = requests.post(
'https://api.knowledgesdk.com/v1/extract',
headers={'x-api-key': KNOWLEDGESDK_API_KEY, 'Content-Type': 'application/json'},
json={'url': url},
timeout=30,
)
response.raise_for_status()
return response.json()['markdown']
# URLs to index
urls = [
'https://docs.example.com/getting-started',
'https://docs.example.com/authentication',
'https://docs.example.com/api-reference',
]
# Fetch and split
splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)
all_docs = []
for url in urls:
markdown = fetch_markdown(url)
chunks = splitter.create_documents([markdown], metadatas=[{'source': url}])
all_docs.extend(chunks)
print(f'Indexed {len(all_docs)} chunks from {len(urls)} pages')
# Build vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(all_docs, embeddings)
# Create RAG chain
llm = ChatOpenAI(model='gpt-4o', temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={'k': 4}),
return_source_documents=True,
)
# Query
result = qa_chain.invoke({'query': 'How do I authenticate API requests?'})
print(result['result'])
This is a working RAG pipeline over web sources in under 50 lines of Python. The API approach eliminates the HTML parsing, boilerplate removal, and markdown conversion work you'd otherwise write and maintain.
Approach Comparison
| Approach | JS Support | Anti-bot | LLM-Ready Output | Maintenance Burden | Cost |
|---|---|---|---|---|---|
| requests + BeautifulSoup | No | Minimal | Poor (flat text) | Low | Free |
| Playwright | Yes | Limited | Medium (needs cleaning) | Medium | Infrastructure |
| Scrapy | No (without addon) | Minimal | Poor | Medium | Infrastructure |
| Scraping API | Yes | Good-Excellent | Excellent | None | Per-request |
When Each Approach Makes Sense
Use requests + BeautifulSoup if: You're scraping a specific static site you control or know well, you're doing one-off data collection, or you're prototyping before investing in a more robust solution.
Use Playwright/Selenium if: You need to interact with pages (login, click, scroll), the sites you're targeting require it, and you're running at moderate scale where infrastructure cost is acceptable.
Use Scrapy if: You're building a crawler for a large static site, you need fine-grained control over crawl behavior, and you have backend infrastructure to run it.
Use an API if: You're building a production LLM application, you need clean markdown reliably across diverse sites, you want to avoid operational overhead, and you're extracting content rather than automating interactions.
For most teams building LLM applications in 2026, the API approach is the right default. The operational burden of maintaining your own scraping infrastructure is significant, and the quality of managed scraping APIs has improved to the point where DIY rarely pays off unless you have very specific requirements.