Extract
Extract knowledge from any URL and make it searchable.
The extract endpoint is the core of KnowledgeSDK. It takes a URL, crawls the site, classifies the business, and returns structured knowledge items that are automatically indexed for search. Three variants are available depending on your use case.
Sync Extract
/v1/extractx-api-keySynchronously extract knowledge from a URL. The request blocks until extraction is complete, typically 1-3 minutes depending on the number of pages.
Request Parameters
urlstringrequiredThe URL to extract knowledge from. Can be any page on the site — KnowledgeSDK will discover and crawl related pages automatically via sitemap and link analysis.
maxPagesnumberMaximum number of pages to scrape. Higher values give more comprehensive results but take longer. Minimum: 1, Maximum: 20.
singlePagebooleanWhen true, only scrape the provided URL without crawling additional pages. Useful for extracting knowledge from a single page like a pricing or features page.
includePatternsstring[]Only scrape URLs matching at least one of these glob patterns. Supports * (single segment) and ** (any depth). Example: ["/pricing*", "/features/**"].
excludePatternsstring[]Skip URLs matching any of these glob patterns. Takes priority over includePatterns. Example: ["/blog/**", "/careers*"].
urlsstring[]Explicit list of URLs to scrape. When provided, KnowledgeSDK skips discovery and scrapes these exact URLs. Other filtering params (singlePage, includePatterns, excludePatterns) are ignored. Maximum: 20 URLs. Useful for re-indexing specific pages or processing suggested URLs from a previous extraction.
Response
businessobjectClassified business information including name, domain, category, description, and logo.
knowledgeItemsarrayArray of extracted knowledge items. Each item contains title, description, content, category, and source (the URL it was extracted from).
pagesScrapednumberTotal number of pages successfully scraped during extraction.
urlsDiscoverednumberTotal number of unique URLs discovered via sitemap and link analysis (after locale-aware deduplication).
durationMsnumberTotal extraction time in milliseconds.
startedAtstringISO 8601 timestamp of when extraction started.
finishedAtstringISO 8601 timestamp of when extraction completed.
suggestedUrlsarrayURLs discovered during scraping that the AI considers worth indexing but were not scraped due to page budget. Each item contains url (the page URL) and reason (why the AI thinks it's valuable). Pass these URLs back via the urls parameter to extract them.
Example
Example Response
{
"business": {
"name": "Linear",
"domain": "linear.app",
"category": "Project Management",
"description": "Linear is a modern project management tool built for software teams.",
"logo": "https://linear.app/static/logo.png"
},
"knowledgeItems": [
{
"title": "Issue Tracking",
"description": "Create, assign, and track issues across your team with real-time sync.",
"content": "Linear provides fast issue tracking with keyboard shortcuts, automated workflows, and real-time collaboration. Issues can be organized into projects and cycles...",
"category": "FEATURE",
"source": "https://linear.app/features"
},
{
"title": "Pro Plan",
"description": "For growing teams that need advanced features and integrations.",
"content": "The Pro plan costs $8 per user per month and includes unlimited issues, custom workflows, GitHub and GitLab integrations, Slack integration...",
"category": "PRICING",
"source": "https://linear.app/pricing"
}
],
"pagesScraped": 10,
"urlsDiscovered": 34,
"durationMs": 68420,
"startedAt": "2026-03-20T10:00:00.000Z",
"finishedAt": "2026-03-20T10:01:08.420Z",
"suggestedUrls": [
{
"url": "https://linear.app/changelog",
"reason": "Product changelog with recent feature updates"
},
{
"url": "https://linear.app/integrations/slack",
"reason": "Integration details for a popular tool"
}
]
}Async Extract
/v1/extract/asyncx-api-keyStart an extraction job in the background. Returns a jobId immediately so your application does not need to wait. Poll with /v1/jobs/{jobId} or provide a callbackUrl to receive the result via webhook.
Request Parameters
urlstringrequiredThe URL to extract knowledge from.
maxPagesnumberMaximum number of pages to scrape. Minimum: 1, Maximum: 20.
singlePagebooleanOnly scrape the provided URL without crawling additional pages.
includePatternsstring[]Only scrape URLs matching at least one of these glob patterns.
excludePatternsstring[]Skip URLs matching any of these glob patterns. Takes priority over includePatterns.
urlsstring[]Explicit list of URLs to scrape, skipping discovery. Maximum: 20.
callbackUrlstringA URL to receive a POST request when extraction completes. The request body will contain the full extraction result, identical to the sync response format.
Response
{
"jobId": "job_abc123def456",
"status": "pending",
"message": "Extraction job started. Poll /v1/jobs/job_abc123def456 for status."
}The callback request includes a X-KnowledgeSDK-Signature header you can use to verify the request originated from KnowledgeSDK.
Polling Example
Callback Example
curl -X POST https://api.knowledgesdk.com/v1/extract/async \
-H "Content-Type: application/json" \
-H "x-api-key: sk_ks_your_api_key" \
-d '{
"url": "https://linear.app",
"callbackUrl": "https://your-app.com/webhooks/extraction-complete"
}'Stream Extract
/v1/extract/streamx-api-keyStream extraction progress via Server-Sent Events (SSE). Ideal for building real-time UIs that show extraction progress as it happens.
Request Parameters
urlstringrequiredThe URL to extract knowledge from.
maxPagesnumberMaximum number of pages to scrape. Minimum: 1, Maximum: 20.
singlePagebooleanOnly scrape the provided URL without crawling additional pages.
includePatternsstring[]Only scrape URLs matching at least one of these glob patterns.
excludePatternsstring[]Skip URLs matching any of these glob patterns. Takes priority over includePatterns.
urlsstring[]Explicit list of URLs to scrape, skipping discovery. Maximum: 20.
Use /v1/extract/stream for real-time progress in your UI. It sends granular events for each step of the extraction process, so you can show users exactly what is happening.
SSE Events
| Event | Description | Data |
|---|---|---|
connected | Connection established, extraction starting | { "message": "Connected" } |
progress | General progress update | { "step": "sitemap", "message": "Discovering pages..." } |
business_classified | Business has been identified and classified | { "business": { "name", "domain", "category", ... } } |
pages_planned | Pages to be scraped have been determined | { "pages": ["url1", "url2", ...], "count": 10 } |
page_scraped | A single page has been scraped successfully | { "url": "...", "index": 3, "total": 10 } |
urls_triaged | URL triage complete, suggested URLs available | { "suggestedUrls": [{ "url": "...", "reason": "..." }] } |
complete | Extraction finished, full result available | Full extraction result (same as sync response) |
error | An error occurred during extraction | { "message": "Error description" } |
Stream Example
const response = await fetch("https://api.knowledgesdk.com/v1/extract/stream", {
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": "sk_ks_your_api_key",
},
body: JSON.stringify({
url: "https://linear.app",
maxPages: 10,
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop(); // keep incomplete line in buffer
for (const line of lines) {
if (line.startsWith("event: ")) {
const eventType = line.slice(7);
console.log("Event:", eventType);
}
if (line.startsWith("data: ")) {
const data = JSON.parse(line.slice(6));
switch (data.type) {
case "business_classified":
console.log("Business:", data.business.name);
break;
case "page_scraped":
console.log(`Scraped ${data.index}/${data.total}: ${data.url}`);
break;
case "complete":
console.log("Done!", data.knowledgeItems.length, "items extracted");
break;
case "error":
console.error("Error:", data.message);
break;
}
}
}
}Knowledge Item Categories
Extracted knowledge items are automatically categorized into the following types:
| Category | Description |
|---|---|
PRODUCT | Core product information and descriptions |
FEATURE | Specific features and capabilities |
PRICING | Plans, pricing tiers, and billing information |
FAQ | Frequently asked questions and answers |
SUPPORT | Help articles, troubleshooting, and contact info |
COMPANY | About the company, team, mission, and values |
LEGAL | Terms of service, privacy policy, and compliance |
OTHER | Content that does not fit other categories |
Smart URL Discovery
KnowledgeSDK automatically discovers and triages URLs during extraction. After scraping the planned pages, all links found across every scraped page are collected and analyzed by AI to determine which additional pages are worth indexing.
How It Works
- During scraping, links are collected from every page (not just the root)
- After scraping, unvisited URLs are computed by removing already-scraped pages
- An AI model triages the unvisited URLs, ranking them by knowledge value
- If page budget remains, top-ranked URLs are automatically scraped
- Remaining valuable URLs are returned as
suggestedUrlsin the response
Extracting Suggested URLs
Use the urls parameter to extract knowledge from suggested URLs:
import KnowledgeSDK from "@knowledgesdk/node";
const ks = new KnowledgeSDK("sk_ks_your_api_key");
// First extraction
const result = await ks.extract({ url: "https://linear.app", maxPages: 5 });
console.log(result.suggestedUrls);
// [{ url: "https://linear.app/changelog", reason: "Product changelog" }, ...]
// Extract suggested URLs
if (result.suggestedUrls.length > 0) {
const followUp = await ks.extract({
url: "https://linear.app",
urls: result.suggestedUrls.map(s => s.url),
});
}The urls parameter skips all URL discovery and scrapes exactly the URLs you provide. Business classification still runs against the base url field.