Extract
docs/API Reference/Extract

Extract

Extract knowledge from any URL and make it searchable.

Endpoints

The extract endpoint is the core of KnowledgeSDK. It takes a URL, crawls the site, classifies the business, and returns structured knowledge items that are automatically indexed for search. Three variants are available depending on your use case.

Sync Extract

POST/v1/extractx-api-key

Synchronously extract knowledge from a URL. The request blocks until extraction is complete, typically 1-3 minutes depending on the number of pages.

Request Parameters

urlstringrequired

The URL to extract knowledge from. Can be any page on the site — KnowledgeSDK will discover and crawl related pages automatically via sitemap and link analysis.

maxPagesnumber

Maximum number of pages to scrape. Higher values give more comprehensive results but take longer. Minimum: 1, Maximum: 20.

singlePageboolean

When true, only scrape the provided URL without crawling additional pages. Useful for extracting knowledge from a single page like a pricing or features page.

includePatternsstring[]

Only scrape URLs matching at least one of these glob patterns. Supports * (single segment) and ** (any depth). Example: ["/pricing*", "/features/**"].

excludePatternsstring[]

Skip URLs matching any of these glob patterns. Takes priority over includePatterns. Example: ["/blog/**", "/careers*"].

urlsstring[]

Explicit list of URLs to scrape. When provided, KnowledgeSDK skips discovery and scrapes these exact URLs. Other filtering params (singlePage, includePatterns, excludePatterns) are ignored. Maximum: 20 URLs. Useful for re-indexing specific pages or processing suggested URLs from a previous extraction.

Response

businessobject

Classified business information including name, domain, category, description, and logo.

knowledgeItemsarray

Array of extracted knowledge items. Each item contains title, description, content, category, and source (the URL it was extracted from).

pagesScrapednumber

Total number of pages successfully scraped during extraction.

urlsDiscoverednumber

Total number of unique URLs discovered via sitemap and link analysis (after locale-aware deduplication).

durationMsnumber

Total extraction time in milliseconds.

startedAtstring

ISO 8601 timestamp of when extraction started.

finishedAtstring

ISO 8601 timestamp of when extraction completed.

suggestedUrlsarray

URLs discovered during scraping that the AI considers worth indexing but were not scraped due to page budget. Each item contains url (the page URL) and reason (why the AI thinks it's valuable). Pass these URLs back via the urls parameter to extract them.

Example

Example Response

json snippet{}json
{
  "business": {
    "name": "Linear",
    "domain": "linear.app",
    "category": "Project Management",
    "description": "Linear is a modern project management tool built for software teams.",
    "logo": "https://linear.app/static/logo.png"
  },
  "knowledgeItems": [
    {
      "title": "Issue Tracking",
      "description": "Create, assign, and track issues across your team with real-time sync.",
      "content": "Linear provides fast issue tracking with keyboard shortcuts, automated workflows, and real-time collaboration. Issues can be organized into projects and cycles...",
      "category": "FEATURE",
      "source": "https://linear.app/features"
    },
    {
      "title": "Pro Plan",
      "description": "For growing teams that need advanced features and integrations.",
      "content": "The Pro plan costs $8 per user per month and includes unlimited issues, custom workflows, GitHub and GitLab integrations, Slack integration...",
      "category": "PRICING",
      "source": "https://linear.app/pricing"
    }
  ],
  "pagesScraped": 10,
  "urlsDiscovered": 34,
  "durationMs": 68420,
  "startedAt": "2026-03-20T10:00:00.000Z",
  "finishedAt": "2026-03-20T10:01:08.420Z",
  "suggestedUrls": [
    {
      "url": "https://linear.app/changelog",
      "reason": "Product changelog with recent feature updates"
    },
    {
      "url": "https://linear.app/integrations/slack",
      "reason": "Integration details for a popular tool"
    }
  ]
}

Async Extract

POST/v1/extract/asyncx-api-key

Start an extraction job in the background. Returns a jobId immediately so your application does not need to wait. Poll with /v1/jobs/{jobId} or provide a callbackUrl to receive the result via webhook.

Request Parameters

urlstringrequired

The URL to extract knowledge from.

maxPagesnumber

Maximum number of pages to scrape. Minimum: 1, Maximum: 20.

singlePageboolean

Only scrape the provided URL without crawling additional pages.

includePatternsstring[]

Only scrape URLs matching at least one of these glob patterns.

excludePatternsstring[]

Skip URLs matching any of these glob patterns. Takes priority over includePatterns.

urlsstring[]

Explicit list of URLs to scrape, skipping discovery. Maximum: 20.

callbackUrlstring

A URL to receive a POST request when extraction completes. The request body will contain the full extraction result, identical to the sync response format.

Response

json snippet{}json
{
  "jobId": "job_abc123def456",
  "status": "pending",
  "message": "Extraction job started. Poll /v1/jobs/job_abc123def456 for status."
}

The callback request includes a X-KnowledgeSDK-Signature header you can use to verify the request originated from KnowledgeSDK.

Polling Example

Callback Example

terminal>_bash
curl -X POST https://api.knowledgesdk.com/v1/extract/async \
  -H "Content-Type: application/json" \
  -H "x-api-key: sk_ks_your_api_key" \
  -d '{
    "url": "https://linear.app",
    "callbackUrl": "https://your-app.com/webhooks/extraction-complete"
  }'

Stream Extract

POST/v1/extract/streamx-api-key

Stream extraction progress via Server-Sent Events (SSE). Ideal for building real-time UIs that show extraction progress as it happens.

Request Parameters

urlstringrequired

The URL to extract knowledge from.

maxPagesnumber

Maximum number of pages to scrape. Minimum: 1, Maximum: 20.

singlePageboolean

Only scrape the provided URL without crawling additional pages.

includePatternsstring[]

Only scrape URLs matching at least one of these glob patterns.

excludePatternsstring[]

Skip URLs matching any of these glob patterns. Takes priority over includePatterns.

urlsstring[]

Explicit list of URLs to scrape, skipping discovery. Maximum: 20.

Use /v1/extract/stream for real-time progress in your UI. It sends granular events for each step of the extraction process, so you can show users exactly what is happening.

SSE Events

EventDescriptionData
connectedConnection established, extraction starting{ "message": "Connected" }
progressGeneral progress update{ "step": "sitemap", "message": "Discovering pages..." }
business_classifiedBusiness has been identified and classified{ "business": { "name", "domain", "category", ... } }
pages_plannedPages to be scraped have been determined{ "pages": ["url1", "url2", ...], "count": 10 }
page_scrapedA single page has been scraped successfully{ "url": "...", "index": 3, "total": 10 }
urls_triagedURL triage complete, suggested URLs available{ "suggestedUrls": [{ "url": "...", "reason": "..." }] }
completeExtraction finished, full result availableFull extraction result (same as sync response)
errorAn error occurred during extraction{ "message": "Error description" }

Stream Example

typescript snippetTStypescript
const response = await fetch("https://api.knowledgesdk.com/v1/extract/stream", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "x-api-key": "sk_ks_your_api_key",
  },
  body: JSON.stringify({
    url: "https://linear.app",
    maxPages: 10,
  }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split("\n");
  buffer = lines.pop(); // keep incomplete line in buffer

  for (const line of lines) {
    if (line.startsWith("event: ")) {
      const eventType = line.slice(7);
      console.log("Event:", eventType);
    }
    if (line.startsWith("data: ")) {
      const data = JSON.parse(line.slice(6));

      switch (data.type) {
        case "business_classified":
          console.log("Business:", data.business.name);
          break;
        case "page_scraped":
          console.log(`Scraped ${data.index}/${data.total}: ${data.url}`);
          break;
        case "complete":
          console.log("Done!", data.knowledgeItems.length, "items extracted");
          break;
        case "error":
          console.error("Error:", data.message);
          break;
      }
    }
  }
}

Knowledge Item Categories

Extracted knowledge items are automatically categorized into the following types:

CategoryDescription
PRODUCTCore product information and descriptions
FEATURESpecific features and capabilities
PRICINGPlans, pricing tiers, and billing information
FAQFrequently asked questions and answers
SUPPORTHelp articles, troubleshooting, and contact info
COMPANYAbout the company, team, mission, and values
LEGALTerms of service, privacy policy, and compliance
OTHERContent that does not fit other categories

Smart URL Discovery

KnowledgeSDK automatically discovers and triages URLs during extraction. After scraping the planned pages, all links found across every scraped page are collected and analyzed by AI to determine which additional pages are worth indexing.

How It Works

  1. During scraping, links are collected from every page (not just the root)
  2. After scraping, unvisited URLs are computed by removing already-scraped pages
  3. An AI model triages the unvisited URLs, ranking them by knowledge value
  4. If page budget remains, top-ranked URLs are automatically scraped
  5. Remaining valuable URLs are returned as suggestedUrls in the response

Extracting Suggested URLs

Use the urls parameter to extract knowledge from suggested URLs:

typescript snippetTStypescript
import KnowledgeSDK from "@knowledgesdk/node";

const ks = new KnowledgeSDK("sk_ks_your_api_key");

// First extraction
const result = await ks.extract({ url: "https://linear.app", maxPages: 5 });

console.log(result.suggestedUrls);
// [{ url: "https://linear.app/changelog", reason: "Product changelog" }, ...]

// Extract suggested URLs
if (result.suggestedUrls.length > 0) {
  const followUp = await ks.extract({
    url: "https://linear.app",
    urls: result.suggestedUrls.map(s => s.url),
  });
}

The urls parameter skips all URL discovery and scrapes exactly the URLs you provide. Business classification still runs against the base url field.