knowledgesdk.com/blog/knowledgesdk-python

tutorialMarch 19, 2026·13 min read

Python Web Scraping for AI: Complete KnowledgeSDK Tutorial (2026)

Learn to scrape URLs to clean markdown, build a semantic search index, and subscribe to webhooks using the KnowledgeSDK Python SDK with async support.

Python is the dominant language for AI and data engineering. It's where most RAG pipelines, LLM applications, and data workflows live. But web scraping in Python has historically been painful: BeautifulSoup handles static HTML but chokes on JavaScript-heavy sites, Scrapy is powerful but complex to deploy, and Playwright/Selenium require managing browser infrastructure you'd rather not own.

KnowledgeSDK's Python SDK — knowledgesdk on PyPI — gives Python developers a clean, three-line path from URL to AI-ready data. This tutorial covers everything: installation, synchronous and asynchronous usage, building a semantic search index, webhook subscriptions, and a realistic async pipeline that processes 50 URLs concurrently.

Why Not Just Use BeautifulSoup or Scrapy?

Before diving in, it's worth being direct about the trade-offs.

BeautifulSoup is excellent for parsing HTML you've already fetched. It does not render JavaScript, so any page that loads content dynamically (React, Vue, Angular apps, infinite scroll, lazy-loaded images) returns empty or incomplete data. The majority of websites in 2026 render meaningful content client-side.

Scrapy is a full crawling framework. It's excellent for large-scale crawls of sites you control or have permission to crawl aggressively. It requires significant setup, Spider classes, middleware configuration, and its own deployment infrastructure. For AI use cases where you need 10-100 specific URLs processed reliably, it's overkill.

Playwright/Selenium render JavaScript but require running a Chrome or Firefox binary. Managing headless browser infrastructure — memory limits, crashes, anti-bot detection, rotating proxies — is a full-time engineering problem.

KnowledgeSDK handles all of the above as a managed service. You get JavaScript rendering, anti-bot bypass, and clean markdown output with a single API call. The trade-off is that you're calling an external API rather than running locally — which for AI applications is almost always the right call given that you're already calling OpenAI, Anthropic, or another external API.

Feature	BeautifulSoup	Scrapy	Playwright	KnowledgeSDK
JS rendering	No	No	Yes	Yes
Anti-bot bypass	No	No	Limited	Yes
Clean markdown output	Manual	Manual	Manual	Built-in
Semantic search	No	No	No	Built-in
Async/concurrent	Manual	Built-in	Built-in	Built-in
Setup complexity	Low	High	Medium	Low
Infrastructure to maintain	None	Medium	High	None
Best for	Static HTML parsing	Large-scale crawls	Browser automation	AI data pipelines

Installation

pip install knowledgesdk

For async support (recommended for production):

pip install knowledgesdk aiohttp

Set your API key as an environment variable:

export KNOWLEDGESDK_API_KEY=knowledgesdk_live_your_key_here

Or pass it directly in code — though environment variables are preferred for production.

Basic Usage: Scrape a URL

from knowledgesdk import KnowledgeSDK

ks = KnowledgeSDK(api_key="knowledgesdk_live_your_key")

result = ks.extract("https://example.com")

print(result.title)
print(result.markdown[:500])
print(f"Word count: {len(result.markdown.split())}")

The result object has:

result.markdown — clean text with all navigation, ads, and boilerplate stripped
result.title — page title
result.url — canonical URL after redirects
result.links — list of outbound links (if include_links=True)

Structured Extraction

When you need specific fields rather than raw text:

from knowledgesdk import KnowledgeSDK

ks = KnowledgeSDK(api_key="knowledgesdk_live_your_key")

result = ks.extract(
    url="https://startup.com",
    schema={
        "company_name": "string",
        "founding_year": "number",
        "pricing_plans": "array",
        "tech_stack": "array",
        "team_size": "number",
        "headquarters": "string",
    }
)

data = result.data
print(f"Company: {data['company_name']}")
print(f"Plans: {data['pricing_plans']}")

The extraction AI reads the entire page (and linked pages if needed) and returns a Python dict matching your schema. No XPath, no CSS selectors, no fragile scraper maintenance.

Semantic Search

KnowledgeSDK maintains a knowledge base per API key. Every URL you scrape or extract from is automatically indexed for semantic search.

# First, scrape some pages to build the knowledge base
urls = [
    "https://competitor-a.com/pricing",
    "https://competitor-b.com/pricing",
    "https://competitor-c.com/pricing",
]

for url in urls:
    ks.extract(url)
    print(f"Indexed: {url}")

# Now search semantically
results = ks.search(
    query="enterprise pricing unlimited seats",
    limit=5,
    hybrid=True  # combines semantic + keyword ranking
)

for hit in results.hits:
    print(f"Score: {hit.score:.3f} | {hit.url}")
    print(f"  {hit.content[:200]}")
    print()

Search returns hits ranked by relevance. The hybrid mode combines dense vector search (semantic similarity) with BM25 keyword matching, giving you the best of both worlds: results that match the meaning of your query and results that contain exact keywords.

Webhook Subscriptions

Webhooks let you monitor URLs for changes without polling. KnowledgeSDK checks your subscribed URLs periodically and sends an HTTP POST to your endpoint when content changes.

from knowledgesdk import KnowledgeSDK

ks = KnowledgeSDK(api_key="knowledgesdk_live_your_key")

# Subscribe to changes on competitor pricing pages
webhook = ks.webhooks.create(
    url="https://your-app.com/webhooks/knowledgesdk",
    watch_urls=[
        "https://competitor.com/pricing",
        "https://competitor.com/features",
    ],
    events=["content.changed"],
)

print(f"Webhook ID: {webhook.id}")
print(f"Secret token: {webhook.token}")  # Use this to verify incoming webhooks

When a change is detected, your endpoint receives:

# Flask handler for incoming webhooks
from flask import Flask, request
import hmac, hashlib

app = Flask(__name__)

@app.route("/webhooks/knowledgesdk", methods=["POST"])
def handle_change():
    # Verify the webhook signature
    signature = request.headers.get("X-KnowledgeSDK-Signature")
    expected = hmac.new(
        WEBHOOK_TOKEN.encode(),
        request.data,
        hashlib.sha256
    ).hexdigest()
    if not hmac.compare_digest(signature, expected):
        return "Unauthorized", 401

    payload = request.json
    print(f"URL changed: {payload['url']}")
    print(f"Added content: {payload['diff']['added']}")
    print(f"Removed content: {payload['diff']['removed']}")

    # Process the change — re-index, alert, etc.
    return "OK", 200

Full Async Pipeline

For production workloads processing many URLs, synchronous calls are too slow. Here is a complete async pipeline using Python's asyncio:

import asyncio
import os
from knowledgesdk import AsyncKnowledgeSDK

async def scrape_url(ks: AsyncKnowledgeSDK, url: str) -> dict:
    """Scrape a single URL and return structured result."""
    try:
        result = await ks.extract(url)
        return {
            "url": url,
            "title": result.title,
            "markdown": result.markdown,
            "word_count": len(result.markdown.split()),
            "error": None,
        }
    except Exception as e:
        return {
            "url": url,
            "title": None,
            "markdown": None,
            "word_count": 0,
            "error": str(e),
        }

async def process_urls_concurrent(
    urls: list[str],
    concurrency: int = 5,
) -> list[dict]:
    """Process URLs with controlled concurrency."""
    ks = AsyncKnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded_scrape(url: str) -> dict:
        async with semaphore:
            return await scrape_url(ks, url)

    results = await asyncio.gather(
        *[bounded_scrape(url) for url in urls],
        return_exceptions=False,
    )
    return list(results)

async def build_knowledge_base(urls: list[str]) -> None:
    """Scrape all URLs and then run a test search."""
    print(f"Processing {len(urls)} URLs...")
    results = await process_urls_concurrent(urls, concurrency=5)

    successful = [r for r in results if not r["error"]]
    failed = [r for r in results if r["error"]]

    print(f"Scraped: {len(successful)} | Failed: {len(failed)}")
    for r in failed:
        print(f"  FAILED: {r['url']} — {r['error']}")

    total_words = sum(r["word_count"] for r in successful)
    print(f"Total content: {total_words:,} words indexed")

    # Run a test search
    ks = AsyncKnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
    search_results = await ks.search(query="pricing plans enterprise", limit=3)
    print(f"\nTop search results for 'pricing plans enterprise':")
    for hit in search_results.hits:
        print(f"  [{hit.score:.3f}] {hit.url}")

if __name__ == "__main__":
    urls = [
        "https://competitor-a.com/pricing",
        "https://competitor-b.com/pricing",
        "https://competitor-b.com/features",
        "https://competitor-c.com",
        "https://competitor-c.com/docs",
        # Add more URLs here
    ]
    asyncio.run(build_knowledge_base(urls))

The asyncio.Semaphore(5) limits concurrency to 5 simultaneous requests. Adjust based on your KnowledgeSDK plan's rate limits. With a semaphore of 5, you can process 50 URLs in roughly the time it takes to process 10 sequentially.

Integrating with OpenAI for RAG

Here is a complete retrieval-augmented generation example:

import os
from openai import OpenAI
from knowledgesdk import KnowledgeSDK

ks = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def answer_with_web_context(question: str) -> str:
    """Answer a question using live web data as context."""

    # Search the knowledge base
    search_results = ks.search(query=question, limit=5)

    # Build context from search hits
    context_parts = []
    for hit in search_results.hits:
        context_parts.append(f"Source: {hit.url}\n\n{hit.content[:1000]}")

    context = "\n\n---\n\n".join(context_parts)

    if not context_parts:
        context = "No relevant content found in knowledge base."

    # Call the LLM with context
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. Answer questions using the provided "
                    "web content. Always cite the source URL. If the context doesn't "
                    "contain the answer, say so."
                ),
            },
            {
                "role": "user",
                "content": f"Context from web:\n\n{context}\n\nQuestion: {question}",
            },
        ],
    )

    return response.choices[0].message.content

# Example usage
answer = answer_with_web_context(
    "What are the differences in enterprise pricing between the main competitors?"
)
print(answer)

Periodic Refresh Pipeline

For AI applications that need fresh data, schedule periodic re-scrapes:

import asyncio
import schedule
import time
from knowledgesdk import AsyncKnowledgeSDK

MONITORED_URLS = [
    "https://competitor.com/pricing",
    "https://competitor.com/blog",
    "https://industry-news.com/latest",
]

async def refresh_knowledge_base():
    """Re-scrape all monitored URLs to update the knowledge base."""
    ks = AsyncKnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
    print(f"Refreshing {len(MONITORED_URLS)} URLs...")

    tasks = [ks.extract(url) for url in MONITORED_URLS]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    success_count = sum(1 for r in results if not isinstance(r, Exception))
    print(f"Refreshed: {success_count}/{len(MONITORED_URLS)}")

def run_refresh():
    asyncio.run(refresh_knowledge_base())

# Schedule weekly refresh
schedule.every().monday.at("08:00").do(run_refresh)

if __name__ == "__main__":
    run_refresh()  # Run immediately on start
    while True:
        schedule.run_pending()
        time.sleep(60)

For production, replace schedule with a proper task queue (Celery, APScheduler, or a cron job). KnowledgeSDK's webhook system is a better alternative for change detection — only re-process when content actually changes rather than on a fixed schedule.

Error Handling and Retries

from knowledgesdk import KnowledgeSDK
from knowledgesdk.exceptions import (
    RateLimitError,
    ScrapingError,
    InvalidURLError,
    AuthenticationError,
)
import time

ks = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

def scrape_with_retry(url: str, max_retries: int = 3) -> dict | None:
    """Scrape with exponential backoff retry logic."""
    for attempt in range(max_retries):
        try:
            result = ks.extract(url)
            return result
        except RateLimitError:
            wait = 2 ** attempt  # 1s, 2s, 4s
            print(f"Rate limited. Waiting {wait}s before retry {attempt + 1}/{max_retries}")
            time.sleep(wait)
        except InvalidURLError as e:
            print(f"Invalid URL {url}: {e}")
            return None  # Don't retry invalid URLs
        except ScrapingError as e:
            print(f"Scraping failed for {url}: {e}")
            if attempt == max_retries - 1:
                return None
            time.sleep(1)
        except AuthenticationError:
            raise  # Don't retry auth errors — fix the key

    return None

Using the SDK with Type Hints

The KnowledgeSDK Python SDK is fully typed. If you're using mypy or Pyright, you get full autocompletion and type checking:

from knowledgesdk import KnowledgeSDK
from knowledgesdk.types import ExtractResult, SearchResult

ks = KnowledgeSDK(api_key="knowledgesdk_live_your_key")

extract_result: ExtractResult = ks.extract("https://example.com")
search_result: SearchResult = ks.search(query="pricing", limit=5)
extract_result: ExtractResult = ks.extract(
    url="https://example.com",
    schema={"name": "string", "price": "number"}
)

Type stubs are included in the package — no separate types-knowledgesdk package needed.

FAQ

Does the Python SDK support Python 3.8? The async client requires Python 3.10+ due to asyncio.TaskGroup. The synchronous client works with Python 3.8+. For Python 3.8/3.9 async support, use asyncio.gather instead of task groups.

How does the Python SDK compare to using httpx directly? Using httpx directly gives you full control but requires handling authentication, error parsing, retry logic, and response deserialization yourself. The SDK handles all of this and provides typed response objects.

Can I use the SDK in a Django or FastAPI application? Yes. For FastAPI, use the AsyncKnowledgeSDK client with await in your route handlers. For Django, use the synchronous KnowledgeSDK client or run async operations with asyncio.run() from a synchronous view (not ideal — prefer FastAPI or Django async views for async workloads).

Is there a way to inspect the raw API response? Every result object exposes a ._raw attribute with the full API response as a dict. Useful for debugging or accessing fields not yet exposed in the typed API.

How do I handle pages that require authentication? Pass a cookies dict to the scrape call:

result = ks.extract(
    url="https://app.example.com/dashboard",
    cookies={"session_id": "abc123", "auth_token": "xyz"}
)

Can I self-host the scraping infrastructure? KnowledgeSDK is a managed API service — there's no self-hosted option. This is by design: managing browser fleets, proxy rotation, and anti-bot detection requires significant infrastructure investment that most teams don't want to own.

Start building your Python AI data pipeline today. Get your API key at knowledgesdk.com/setup.

Try it now