n8n has become the automation platform of choice for developers who want visual workflows without losing the ability to drop into code when needed. KnowledgeSDK turns any URL into clean, AI-ready data. Together they unlock a category of workflows that used to require a dedicated scraping infrastructure team: scheduled competitive intelligence, document monitoring pipelines, AI-powered news digests, and more.
This guide walks through four concrete n8n workflows you can import and run today, covers the webhook-trigger pattern that lets KnowledgeSDK push changes to n8n instead of polling, and includes a full JSON workflow snippet you can paste directly into your n8n instance.
Why n8n for Web Data Workflows?
n8n sits in a sweet spot: it has native HTTP Request nodes that can call any REST API, a rich library of destination integrations (Slack, Gmail, Notion, Airtable, Postgres), and a self-hostable open-source edition. When you pair it with KnowledgeSDK's API — which handles JavaScript rendering, anti-bot evasion, and pagination — you get a complete no-code scraping stack.
The alternative is writing and hosting your own scraping scripts, managing Puppeteer/Playwright infrastructure, dealing with Cloudflare blocks, and building the downstream delivery logic yourself. That's weeks of work. The n8n + KnowledgeSDK combination collapses it to an afternoon.
Prerequisites
- An n8n instance (cloud at app.n8n.cloud or self-hosted via Docker)
- A KnowledgeSDK API key — get one at knowledgesdk.com/setup
- A Slack webhook URL (optional, for the notification step)
Store your KnowledgeSDK API key as an n8n credential: Settings → Credentials → New → Header Auth. Set the header name to x-api-key and the value to your sk_ks_* key.
Workflow 1: Scheduled URL Scraper → Slack Digest
This is the simplest starting point. Every morning at 8 AM, scrape a list of URLs and post the extracted content to a Slack channel.
Nodes:
- Schedule Trigger — runs at
0 8 * * *(8 AM daily) - Code Node — defines your URL list
- HTTP Request (loop) — calls
POST /v1/scrapefor each URL - Aggregate — collects all markdown results
- Slack — posts a formatted digest
The HTTP Request node configuration for scraping
Set the HTTP Request node to:
- Method: POST
- URL:
https://api.knowledgesdk.com/v1/scrape - Authentication: Header Auth (your stored credential)
- Body (JSON):
{
"url": "{{ $json.url }}",
"includeLinks": false
}
The response contains a markdown field with clean, stripped content — no nav bars, no cookie banners, no ad clutter. Pass this into subsequent nodes.
Code node to define your URL list
return [
{ json: { url: "https://example.com/blog" } },
{ json: { url: "https://competitor.com/pricing" } },
{ json: { url: "https://news.ycombinator.com" } }
];
Use a Split In Batches node after this to process each URL through the HTTP Request node individually.
Workflow 2: Scrape → Search → Email Report
This workflow is useful for competitive intelligence. It scrapes a set of pages, indexes them, then runs a semantic search query and emails the matching results.
Nodes:
- Schedule Trigger — weekly on Monday
- HTTP Request: Scrape — call
/v1/scrapefor each URL - HTTP Request: Extract — call
/v1/extractto get structured AI output - HTTP Request: Search — call
/v1/searchwith a semantic query - Gmail / SendGrid — email the search results
Search node configuration
{
"query": "pricing tiers enterprise discount",
"limit": 10,
"hybrid": true
}
The search endpoint returns results ranked by relevance with a score field. You can filter in n8n's Filter node to only keep results above a certain confidence threshold before emailing.
Workflow 3: Full Site Extraction to Airtable
The /v1/extract endpoint returns structured JSON — product names, prices, contact info, team members, whatever the page contains — rather than raw markdown. This makes it ideal for populating databases.
Nodes:
- Schedule Trigger
- HTTP Request: Extract
- Set Node — map extraction fields to Airtable column names
- Airtable: Create/Update Record
Extract request body
{
"url": "https://startup.com",
"schema": {
"companyName": "string",
"founded": "number",
"pricingPlans": "array",
"founderNames": "array",
"techStack": "array"
}
}
KnowledgeSDK's AI extraction reads the entire site and returns a clean JSON object matching your schema. No XPath selectors, no CSS selectors that break when the site redesigns.
Workflow 4: Webhook-Triggered Change Alert
Instead of polling on a schedule, KnowledgeSDK can notify your n8n instance the moment a page changes. This is the most efficient pattern for monitoring: zero unnecessary API calls, near-real-time alerts.
Step 1: Create a Webhook node in n8n
Add a Webhook node as your trigger. n8n will give you a URL like:
https://your-n8n.app.n8n.cloud/webhook/abc123
Copy this URL.
Step 2: Register the webhook with KnowledgeSDK
Call the KnowledgeSDK webhooks API once to subscribe:
curl -X POST https://api.knowledgesdk.com/v1/webhooks \
-H "x-api-key: sk_ks_your_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://your-n8n.app.n8n.cloud/webhook/abc123",
"watchUrls": [
"https://competitor.com/pricing",
"https://competitor.com/features"
],
"events": ["content.changed"]
}'
You can also register this inside n8n itself using an Execute Once HTTP Request node — run it manually once to set up the subscription, then deactivate that node.
Step 3: Process the webhook payload
When KnowledgeSDK detects a change, it sends a POST to your n8n webhook URL with this payload:
{
"event": "content.changed",
"url": "https://competitor.com/pricing",
"changedAt": "2026-03-19T14:22:00Z",
"diff": {
"added": ["Enterprise plan now $299/month"],
"removed": ["Enterprise plan $249/month"]
},
"newContent": "...full markdown of updated page..."
}
Wire the webhook output into a Slack node, a Gmail node, or a Postgres insert. Your team gets alerted within seconds of any pricing change, product update, or competitor announcement.
Full JSON Workflow Export (Scrape + Slack)
Here is a minimal n8n workflow JSON you can import via File → Import from JSON:
{
"name": "KnowledgeSDK Daily Scrape → Slack",
"nodes": [
{
"id": "schedule-1",
"name": "Daily Schedule",
"type": "n8n-nodes-base.scheduleTrigger",
"parameters": {
"rule": { "interval": [{ "field": "cronExpression", "expression": "0 8 * * *" }] }
},
"position": [240, 300]
},
{
"id": "code-1",
"name": "URL List",
"type": "n8n-nodes-base.code",
"parameters": {
"jsCode": "return [{json:{url:'https://competitor.com/pricing'}},{json:{url:'https://competitor.com/features'}}];"
},
"position": [460, 300]
},
{
"id": "split-1",
"name": "Split URLs",
"type": "n8n-nodes-base.splitInBatches",
"parameters": { "batchSize": 1 },
"position": [680, 300]
},
{
"id": "http-scrape",
"name": "Scrape URL",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"method": "POST",
"url": "https://api.knowledgesdk.com/v1/scrape",
"authentication": "headerAuth",
"body": { "url": "={{ $json.url }}" },
"sendBody": true,
"bodyContentType": "json"
},
"position": [900, 300]
},
{
"id": "aggregate-1",
"name": "Aggregate Results",
"type": "n8n-nodes-base.aggregate",
"parameters": { "aggregate": "aggregateAllItemData" },
"position": [1120, 300]
},
{
"id": "slack-1",
"name": "Post to Slack",
"type": "n8n-nodes-base.slack",
"parameters": {
"channel": "#competitive-intel",
"text": "={{ $json.items.map(i => i.url + '\\n' + i.markdown.slice(0,500)).join('\\n\\n---\\n\\n') }}"
},
"position": [1340, 300]
}
],
"connections": {
"Daily Schedule": { "main": [[{ "node": "URL List", "type": "main", "index": 0 }]] },
"URL List": { "main": [[{ "node": "Split URLs", "type": "main", "index": 0 }]] },
"Split URLs": { "main": [[{ "node": "Scrape URL", "type": "main", "index": 0 }]] },
"Scrape URL": { "main": [[{ "node": "Aggregate Results", "type": "main", "index": 0 }]] },
"Aggregate Results": { "main": [[{ "node": "Post to Slack", "type": "main", "index": 0 }]] }
}
}
Import this, connect your credentials, and activate. Done.
KnowledgeSDK vs. n8n's Built-in Scraping Nodes
| Feature | n8n HTTP Request (raw) | n8n + KnowledgeSDK |
|---|---|---|
| JavaScript rendering | No | Yes |
| Anti-bot / Cloudflare bypass | No | Yes |
| Pagination handling | Manual | Automatic |
| Clean markdown output | No (raw HTML) | Yes |
| AI-structured extraction | No | Yes |
| Semantic search over scraped data | No | Yes |
| Change detection webhooks | No | Yes |
| Setup time | Low | Low |
The raw HTTP Request node works for static HTML pages that don't require authentication or JavaScript. The moment you hit a React SPA, a Cloudflare-protected site, or a page that requires scrolling to load content, it fails silently and you get a partial or empty response. KnowledgeSDK handles all of that transparently.
Advanced Patterns
AI Summarization after Scraping
After the scrape node, add an OpenAI node (n8n has a native integration). Pass the markdown field as the prompt context:
Summarize the following web page content in 3 bullet points:
{{ $json.markdown }}
This gives you a daily AI-powered briefing on competitor pages, industry news, or documentation updates.
Error Handling and Retries
n8n's Error Trigger node can catch failed scrape requests. Wire it to a Slack alert so you know when a URL becomes unreachable. KnowledgeSDK returns standard HTTP status codes: 200 for success, 422 for invalid URLs, 429 for rate limit exceeded, 503 for temporarily unreachable pages.
Add a Wait node between batches to respect rate limits — 1 second between requests is a safe default.
Storing Results in Postgres
Instead of emailing or Slacking, insert scraped content into a Postgres table for long-term storage and diffing:
INSERT INTO scraped_pages (url, markdown, scraped_at)
VALUES ($1, $2, NOW())
ON CONFLICT (url) DO UPDATE
SET markdown = EXCLUDED.markdown,
scraped_at = EXCLUDED.scraped_at;
Use n8n's Postgres node with the Execute Query operation. Now you can diff current vs. previous content inside n8n using a Code node.
Webhook Pattern Deep Dive
The webhook trigger pattern deserves extra attention because it's the most scalable approach at volume. Instead of running 100 scrape calls every hour, you register subscriptions once and KnowledgeSDK monitors those URLs continuously. Your n8n workflow only runs when something actually changes.
This is especially valuable for:
- Legal monitoring — terms of service or privacy policy changes
- Pricing intelligence — competitor price updates
- Inventory tracking — product availability changes
- News monitoring — new articles on specific pages
The webhook payload includes a diff object with added and removed text arrays, so you can build sophisticated change-analysis logic in your n8n Code node without storing or diffing the full page content yourself.
Production Considerations
Rate limits: KnowledgeSDK's default rate limits depend on your plan. Add a Wait node between scrape calls in batch workflows — 500ms to 1000ms between requests keeps you well within limits.
Deduplication: If you're running both scheduled scrapes and webhook triggers, you may process the same URL twice. Use n8n's If node to check a timestamp field against a Postgres or Airtable record to skip already-processed content.
Secret management: Never hardcode your sk_ks_* key in n8n workflow JSON. Always use n8n's credential store. If you export and share workflows, credentials are automatically excluded.
Cost awareness: The /v1/extract endpoint uses AI processing and is billed per extraction. The /v1/scrape endpoint is cheaper for cases where you only need raw markdown. Use extract only when you need structured JSON output.
FAQ
Can I use KnowledgeSDK with n8n Cloud?
Yes. n8n Cloud can reach the KnowledgeSDK API. Use n8n's Header Auth credential type with x-api-key as the key name.
How do I scrape pages that require login?
KnowledgeSDK handles cookie-based sessions. Pass cookies in the scrape request body as a key-value object. For OAuth-protected pages, you'll need to obtain a session cookie from the site first.
Can n8n receive webhooks from KnowledgeSDK on a self-hosted instance? Yes, as long as your n8n instance is publicly reachable. If it's behind a NAT, expose the webhook port or use a tunneling service like ngrok for testing.
What's the maximum number of URLs I can watch with webhooks? This depends on your KnowledgeSDK plan. Check current limits at knowledgesdk.com/setup.
Can I chain multiple scrapes in one workflow? Absolutely. Use the Split In Batches → HTTP Request → Merge pattern to fan out and fan back in.
Does KnowledgeSDK handle pagination automatically? Yes. When you scrape a URL with paginated content (blog listing pages, product catalogs), KnowledgeSDK follows pagination links and returns all pages combined into one markdown document.
Ready to build your first n8n + KnowledgeSDK workflow? Get your API key and start scraping in minutes at knowledgesdk.com/setup.