knowledgesdk.com/blog/lead-enrichment-web-scraping
use-caseMarch 19, 2026·12 min read

Enrich CRM Leads with Real-Time Web Data Using AI

Build a lead enrichment pipeline that scrapes company websites, extracts structured data—description, pricing, tech stack—and feeds it directly into your CRM.

Enrich CRM Leads with Real-Time Web Data Using AI

Every sales team has the same problem: a CRM full of company domains with almost no context. You know a prospect's name and email. You don't know if they're a 10-person startup or a 500-person enterprise, what product they sell, whether they're actively hiring, or what their pricing looks like.

Manual research takes 10-15 minutes per lead. Multiply that by hundreds of leads and it's clear why most CRMs stay shallow.

Automated lead enrichment using web scraping solves this at scale. Take a list of company domains, scrape their websites, extract structured intelligence with an LLM, and write it back to your CRM — all without touching a keyboard. This guide builds that pipeline.

What We're Extracting

For each company domain, the enrichment pipeline collects:

  • Company description — what they do, their market, their customer
  • Product summary — key features, use case, value proposition
  • Pricing signals — pricing page detection, plan tiers, price points if public
  • Team size signals — headcount indicators from about pages, job posts
  • Tech stack — inferred from job postings, stack mentions, and metadata
  • Target customer — B2B vs B2C, SMB vs enterprise, industry focus
  • Recent news — press releases, blog announcements, funding
  • Hiring signals — whether they're actively hiring and in which teams

Architecture

┌──────────────┐     ┌────────────────────┐     ┌──────────────────┐
│ Domain List  │────▶│ Enrichment Pipeline│────▶│ Structured Data  │
│ (CSV / CRM)  │     │                    │     │ (JSON per company│
└──────────────┘     └────────────────────┘     └──────────────────┘
                              │                          │
                    ┌─────────▼──────────┐               ▼
                    │ KnowledgeSDK       │     ┌──────────────────┐
                    │ /v1/scrape         │     │ CRM Write-Back   │
                    │ /v1/extract        │     │ (HubSpot, SFDC,  │
                    └────────────────────┘     │ Pipedrive, etc.) │
                                               └──────────────────┘

Step 1: The Core Enrichment Function

Node.js

import KnowledgeSDK from '@knowledgesdk/node';
import OpenAI from 'openai';

const sdk = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });
const openai = new OpenAI();

async function enrichCompany(domain) {
  console.log(`Enriching: ${domain}`);

  const baseUrl = `https://${domain}`;
  const pagesToScrape = [
    { url: baseUrl, type: 'homepage' },
    { url: `${baseUrl}/about`, type: 'about' },
    { url: `${baseUrl}/pricing`, type: 'pricing' },
    { url: `${baseUrl}/careers`, type: 'careers' },
  ];

  const scrapedPages = {};

  for (const page of pagesToScrape) {
    try {
      const result = await sdk.scrape({ url: page.url });
      scrapedPages[page.type] = {
        url: page.url,
        title: result.title,
        content: result.markdown,
        wordCount: result.markdown.split(/\s+/).length,
      };
    } catch {
      // Page doesn't exist or failed — skip silently
      scrapedPages[page.type] = null;
    }
  }

  // Build context for LLM extraction
  const context = Object.entries(scrapedPages)
    .filter(([, page]) => page && page.wordCount > 50)
    .map(([type, page]) => `=== ${type.toUpperCase()} PAGE ===\n${page.content.slice(0, 3000)}`)
    .join('\n\n');

  if (!context) {
    return { domain, error: 'No content scraped', enriched: false };
  }

  // Extract structured data
  const extraction = await extractStructuredData(domain, context);

  return {
    domain,
    enriched: true,
    scrapedAt: new Date().toISOString(),
    pagesFound: Object.keys(scrapedPages).filter(k => scrapedPages[k] !== null),
    ...extraction,
  };
}

async function extractStructuredData(domain, context) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are a B2B sales intelligence analyst. Extract structured company data from the provided website content.

Return a JSON object with these fields:
- companyName: string (official company name)
- oneLiner: string (1 sentence description, max 20 words)
- description: string (3-5 sentence detailed description)
- product: string (what they sell, key features)
- targetCustomer: object { segment: "SMB"|"Mid-Market"|"Enterprise"|"Consumer"|"Developer", industries: string[], companySize: string }
- pricingModel: "Free"|"Freemium"|"Usage-Based"|"Subscription"|"Enterprise"|"Contact Sales"|"Unknown"
- lowestPrice: number|null (lowest public price per month in USD, null if unknown)
- techStack: string[] (technologies mentioned or inferred, max 8)
- teamSizeSignal: "1-10"|"11-50"|"51-200"|"201-500"|"500+"|"Unknown"
- isHiring: boolean
- hiringTeams: string[] (teams actively hiring)
- fundingStatus: "Bootstrapped"|"Seed"|"Series A"|"Series B+"|"Public"|"Unknown"
- recentNews: string[] (notable recent announcements, max 3)
- competitors: string[] (mentioned competitors or similar tools, max 5)

Return only valid JSON.`,
      },
      {
        role: 'user',
        content: `Domain: ${domain}\n\n${context}`,
      },
    ],
    response_format: { type: 'json_object' },
    max_tokens: 1000,
  });

  try {
    return JSON.parse(response.choices[0].message.content);
  } catch {
    return { error: 'Failed to parse LLM response' };
  }
}

Python

import os
import json
import asyncio
from typing import Dict, List, Optional
from knowledgesdk import KnowledgeSDK
from openai import OpenAI

sdk = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])
openai_client = OpenAI()

EXTRACTION_PROMPT = """You are a B2B sales intelligence analyst. Extract structured company data from the provided website content.

Return a JSON object with these fields:
- company_name: string
- one_liner: string (max 20 words)
- description: string (3-5 sentences)
- product: string (what they sell)
- target_customer: object with segment, industries, company_size
- pricing_model: "Free"|"Freemium"|"Usage-Based"|"Subscription"|"Enterprise"|"Unknown"
- lowest_price: number or null (USD per month)
- tech_stack: list of strings (max 8)
- team_size_signal: "1-10"|"11-50"|"51-200"|"201-500"|"500+"|"Unknown"
- is_hiring: boolean
- hiring_teams: list of strings
- funding_status: "Bootstrapped"|"Seed"|"Series A"|"Series B+"|"Public"|"Unknown"
- recent_news: list of strings (max 3)
- competitors: list of strings (max 5)

Return only valid JSON."""

def scrape_company_pages(domain: str) -> Dict[str, Optional[Dict]]:
    base_url = f"https://{domain}"
    pages = {
        "homepage": base_url,
        "about": f"{base_url}/about",
        "pricing": f"{base_url}/pricing",
        "careers": f"{base_url}/careers",
    }

    scraped = {}
    for page_type, url in pages.items():
        try:
            result = sdk.scrape(url=url)
            word_count = len(result["markdown"].split())
            if word_count > 50:
                scraped[page_type] = {
                    "url": url,
                    "title": result.get("title", ""),
                    "content": result["markdown"],
                    "word_count": word_count,
                }
        except Exception:
            scraped[page_type] = None

    return scraped

def extract_structured_data(domain: str, context: str) -> Dict:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": EXTRACTION_PROMPT},
            {"role": "user", "content": f"Domain: {domain}\n\n{context}"},
        ],
        response_format={"type": "json_object"},
        max_tokens=1000,
    )

    try:
        return json.loads(response.choices[0].message.content)
    except Exception:
        return {"error": "Failed to parse response"}

def enrich_company(domain: str) -> Dict:
    print(f"Enriching: {domain}")

    scraped_pages = scrape_company_pages(domain)

    context_parts = []
    for page_type, page in scraped_pages.items():
        if page:
            context_parts.append(f"=== {page_type.upper()} PAGE ===\n{page['content'][:3000]}")

    if not context_parts:
        return {"domain": domain, "error": "No content scraped", "enriched": False}

    context = "\n\n".join(context_parts)
    extraction = extract_structured_data(domain, context)

    return {
        "domain": domain,
        "enriched": True,
        "scraped_at": __import__("datetime").datetime.utcnow().isoformat(),
        "pages_found": [k for k, v in scraped_pages.items() if v],
        **extraction,
    }

Step 2: Batch Processing a Lead List

Process hundreds of leads with controlled concurrency:

import fs from 'fs';

async function enrichLeadList(inputCsvPath, outputJsonPath) {
  // Read domains from CSV
  const csv = fs.readFileSync(inputCsvPath, 'utf8');
  const domains = csv
    .split('\n')
    .slice(1) // Skip header
    .map(row => row.split(',')[0].trim().replace(/^https?:\/\//, '').replace(/\/$/, ''))
    .filter(Boolean);

  console.log(`Processing ${domains.length} domains...`);

  const results = [];
  const CONCURRENCY = 3; // 3 concurrent enrichments
  const DELAY_MS = 2000; // 2 second delay between batches

  for (let i = 0; i < domains.length; i += CONCURRENCY) {
    const batch = domains.slice(i, i + CONCURRENCY);

    const batchResults = await Promise.allSettled(
      batch.map(domain => enrichCompany(domain))
    );

    for (const result of batchResults) {
      if (result.status === 'fulfilled') {
        results.push(result.value);
      } else {
        results.push({ domain: batch[results.length % CONCURRENCY], error: result.reason.message });
      }
    }

    const completed = Math.min(i + CONCURRENCY, domains.length);
    console.log(`Progress: ${completed}/${domains.length}`);

    // Save intermediate results
    fs.writeFileSync(outputJsonPath, JSON.stringify(results, null, 2));

    if (i + CONCURRENCY < domains.length) {
      await new Promise(r => setTimeout(r, DELAY_MS));
    }
  }

  const successCount = results.filter(r => r.enriched).length;
  console.log(`\nComplete: ${successCount}/${domains.length} enriched successfully`);
  return results;
}

// Run
await enrichLeadList('./leads.csv', './enriched-leads.json');

Step 3: Writing Back to Your CRM

HubSpot Integration

async function writeToHubspot(enrichedCompany) {
  const hubspotToken = process.env.HUBSPOT_ACCESS_TOKEN;

  // Search for existing company
  const searchResponse = await fetch('https://api.hubapi.com/crm/v3/objects/companies/search', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${hubspotToken}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      filterGroups: [{
        filters: [{
          propertyName: 'domain',
          operator: 'EQ',
          value: enrichedCompany.domain,
        }],
      }],
    }),
  });

  const searchData = await searchResponse.json();
  const existingCompany = searchData.results?.[0];

  const properties = {
    domain: enrichedCompany.domain,
    name: enrichedCompany.companyName,
    description: enrichedCompany.description,
    hs_lead_status: 'CONNECTED',
    // Custom properties (create these in HubSpot first)
    enrichment_one_liner: enrichedCompany.oneLiner,
    enrichment_pricing_model: enrichedCompany.pricingModel,
    enrichment_team_size: enrichedCompany.teamSizeSignal,
    enrichment_target_segment: enrichedCompany.targetCustomer?.segment,
    enrichment_tech_stack: enrichedCompany.techStack?.join(', '),
    enrichment_scraped_at: enrichedCompany.scrapedAt,
  };

  if (existingCompany) {
    // Update existing company
    await fetch(`https://api.hubapi.com/crm/v3/objects/companies/${existingCompany.id}`, {
      method: 'PATCH',
      headers: { 'Authorization': `Bearer ${hubspotToken}`, 'Content-Type': 'application/json' },
      body: JSON.stringify({ properties }),
    });
    return { action: 'updated', id: existingCompany.id };
  } else {
    // Create new company
    const createResponse = await fetch('https://api.hubapi.com/crm/v3/objects/companies', {
      method: 'POST',
      headers: { 'Authorization': `Bearer ${hubspotToken}`, 'Content-Type': 'application/json' },
      body: JSON.stringify({ properties }),
    });
    const newCompany = await createResponse.json();
    return { action: 'created', id: newCompany.id };
  }
}

Generic CSV Export for Any CRM

# Python: export enriched data to CRM-importable CSV
import csv
import json

def enriched_to_csv_row(company: Dict) -> Dict:
    return {
        "domain": company.get("domain", ""),
        "company_name": company.get("companyName") or company.get("company_name", ""),
        "one_liner": company.get("oneLiner") or company.get("one_liner", ""),
        "description": company.get("description", ""),
        "pricing_model": company.get("pricingModel") or company.get("pricing_model", ""),
        "lowest_price": company.get("lowestPrice") or company.get("lowest_price", ""),
        "team_size": company.get("teamSizeSignal") or company.get("team_size_signal", ""),
        "target_segment": (company.get("targetCustomer") or {}).get("segment", ""),
        "tech_stack": ", ".join(company.get("techStack") or company.get("tech_stack", [])),
        "funding_status": company.get("fundingStatus") or company.get("funding_status", ""),
        "is_hiring": company.get("isHiring") or company.get("is_hiring", ""),
        "competitors": ", ".join(company.get("competitors", [])),
        "enriched_at": company.get("scraped_at", ""),
    }

def export_to_csv(enriched_companies: List[Dict], output_path: str):
    if not enriched_companies:
        return

    rows = [enriched_to_csv_row(c) for c in enriched_companies if c.get("enriched")]
    fieldnames = list(rows[0].keys())

    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)

    print(f"Exported {len(rows)} enriched companies to {output_path}")

# Usage
with open("enriched-leads.json") as f:
    enriched = json.load(f)

export_to_csv(enriched, "enriched-leads-for-crm.csv")

Step 4: Scoring Leads Based on Enriched Data

Once you have structured data, you can score leads automatically:

function scoreLead(enrichedCompany) {
  let score = 0;
  const signals = [];

  // Company size fit
  const sizeFit = {
    '11-50': 15,
    '51-200': 20,
    '201-500': 15,
    '500+': 5,
    '1-10': 10,
  };
  score += sizeFit[enrichedCompany.teamSizeSignal] || 0;
  if (enrichedCompany.teamSizeSignal !== 'Unknown') {
    signals.push(`Team size: ${enrichedCompany.teamSizeSignal}`);
  }

  // Target segment fit (customize for your ICP)
  if (enrichedCompany.targetCustomer?.segment === 'Developer') {
    score += 20;
    signals.push('Developer-focused product');
  } else if (enrichedCompany.targetCustomer?.segment === 'SMB') {
    score += 15;
  }

  // Hiring signal (growing company)
  if (enrichedCompany.isHiring) {
    score += 10;
    signals.push('Actively hiring');
  }

  // Tech stack match (customize for your integration ecosystem)
  const relevantTech = ['Node.js', 'Python', 'React', 'TypeScript', 'Next.js'];
  const techMatches = (enrichedCompany.techStack || []).filter(t =>
    relevantTech.some(rt => t.toLowerCase().includes(rt.toLowerCase()))
  );
  if (techMatches.length > 0) {
    score += techMatches.length * 5;
    signals.push(`Tech stack match: ${techMatches.join(', ')}`);
  }

  // Pricing model (shows willingness to pay for tools)
  if (['Subscription', 'Usage-Based', 'Enterprise'].includes(enrichedCompany.pricingModel)) {
    score += 10;
    signals.push(`Pays for software: ${enrichedCompany.pricingModel}`);
  }

  return {
    score: Math.min(score, 100),
    grade: score >= 70 ? 'A' : score >= 50 ? 'B' : score >= 30 ? 'C' : 'D',
    signals,
  };
}

Real-World Performance Benchmarks

For planning purposes, here's what to expect from this pipeline:

  • Scraping speed: ~2-4 seconds per domain (homepage + 3 additional pages)
  • LLM extraction: ~2-3 seconds per domain with gpt-4o-mini
  • Total per domain: ~8-12 seconds end-to-end
  • Batch of 100 leads: ~15-20 minutes with 3 concurrent workers
  • Success rate: ~85-90% (10-15% of domains have no scrapeable content or block requests)

FAQ

What if a company website is mostly images or uses unusual frameworks? KnowledgeSDK runs a full headless browser, so it handles React, Vue, Angular, and most JS frameworks. For sites that are truly image-heavy (marketing agencies, portfolio sites), extraction accuracy will be lower — the LLM will have less text to work with.

How do I handle companies with multiple domains or subsidiaries? Process each domain separately. The LLM often detects parent company relationships from the content and will include that context in the extracted data.

Is it worth enriching every lead, or just qualified ones? For most sales teams, enriching at the moment of lead creation (or on inbound form submit) is most efficient. You get context exactly when you need it without wasting API calls on leads that will never be worked.

How often should I re-enrich existing accounts? Enrich new leads immediately. Re-enrich existing accounts quarterly, or when you detect a trigger event (new funding, hiring surge, product launch). The pricing and team size data changes slowly — monthly re-enrichment is overkill for most accounts.

Can I extract social proof and customer logos? Yes. Companies often list customer logos on their homepage. The LLM can extract these from the scraped content, giving you an "their customers include..." field that's useful for sales context.


Stop sending cold emails to companies you know nothing about. Enrich your CRM in minutes at knowledgesdk.com/setup.

Try it now

Scrape, search, and monitor any website with one API.

Get your API key in 30 seconds. First 1,000 requests free.

GET API KEY →
← Back to blog