Scrape Financial Data for AI Agents: Earnings, Press Releases, Filings

Build a financial monitoring agent that scrapes IR pages, earnings press releases, and public filings to alert on new disclosures and extract key metrics.

Investor relations pages are some of the most information-dense websites on the internet. Earnings press releases, 8-K filings, annual reports, and management commentary — all published publicly, all scrape-able, all valuable for AI agents that need to reason about business performance.

The challenge is that this content is scattered, inconsistently structured, and often published as PDFs. This guide walks through building a financial monitoring agent that handles the scrape-able portion well: HTML-based IR pages, press releases, and SEC filing index pages. We'll also be clear about PDF limitations so you can make an informed architectural decision.

Important compliance note: This guide covers only the scraping of publicly available financial information. Do not use this for insider information, trading on scraped data in violation of securities law, or accessing any data that is not publicly disclosed. All techniques described here apply to publicly accessible, investor-facing disclosures.

What Financial Data Is Available to Scrape

Understanding the data landscape before building the scraper:

Investor relations pages — Every public company maintains an IR page (e.g., investor.company.com or company.com/investors). These list press releases, SEC filings, webcasts, and financial calendars. They're HTML pages, fully scrape-able.

Earnings press releases — Published as HTML pages on IR sites before or immediately after earnings calls. These contain headline revenue, EPS, guidance, and management commentary. Excellent for automated summarization.

SEC EDGAR filings — The SEC's EDGAR system publishes all required filings (10-K, 10-Q, 8-K, etc.) in HTML and XBRL format. The filing index pages are HTML and list individual documents. The documents themselves vary: 10-Ks are often long HTML documents with inline CSS; 8-Ks (current reports) are usually shorter and more scrape-friendly.

Press releases via PRNewswire / BusinessWire — Many companies also publish press releases through wire services. These are fully HTML-rendered and consistently structured.

PDF-heavy content. Annual reports, investor presentations, and proxy statements are commonly PDF-only. KnowledgeSDK's scrape endpoint returns the text content of PDFs, but complex financial tables, charts, and multi-column layouts may not parse cleanly. For heavy PDF workflows, consider a dedicated PDF extraction tool (like Reducto or Adobe PDF Extract API) alongside KnowledgeSDK for HTML content.

Architecture: Financial Monitoring Agent

┌─────────────────────┐     ┌─────────────────┐     ┌──────────────────┐
│ IR Pages + EDGAR    │────▶│ KnowledgeSDK    │────▶│ Disclosure DB    │
│ (monitored list)    │     │ Scrape + Index  │     │ (Postgres)       │
└─────────────────────┘     └─────────────────┘     └──────────────────┘
                                                              │
                                                              ▼
                                                    ┌─────────────────┐
                                                    │ AI Extraction   │
                                                    │ (metrics, KPIs) │
                                                    └─────────────────┘
                                                              │
                                                              ▼
                                                    ┌─────────────────┐
                                                    │ Alert System    │
                                                    │ (new filing     │
                                                    │  detected)      │
                                                    └─────────────────┘

Step 1: Database Schema

CREATE TABLE monitored_companies (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  ticker TEXT NOT NULL UNIQUE,
  name TEXT,
  ir_url TEXT NOT NULL,
  sec_cik TEXT, -- SEC Central Index Key, for EDGAR queries
  active BOOLEAN DEFAULT true,
  last_checked_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE financial_disclosures (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  company_id UUID REFERENCES monitored_companies(id) ON DELETE CASCADE,
  title TEXT NOT NULL,
  url TEXT,
  disclosure_type TEXT, -- 'earnings', '8-K', '10-Q', '10-K', 'press-release', 'other'
  published_at TIMESTAMPTZ,
  content TEXT,
  content_hash TEXT,
  -- Extracted metrics (nullable)
  revenue BIGINT, -- in thousands USD
  revenue_growth_pct DECIMAL(6, 2),
  eps DECIMAL(8, 4),
  guidance_revenue_low BIGINT,
  guidance_revenue_high BIGINT,
  raw_metrics JSONB, -- full extracted JSON from LLM
  first_seen_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE(company_id, content_hash)
);

CREATE INDEX idx_disclosures_company ON financial_disclosures(company_id);
CREATE INDEX idx_disclosures_type ON financial_disclosures(disclosure_type);
CREATE INDEX idx_disclosures_published ON financial_disclosures(published_at DESC);

CREATE TABLE disclosure_alerts (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  disclosure_id UUID REFERENCES financial_disclosures(id),
  alert_type TEXT,
  notified_at TIMESTAMPTZ DEFAULT NOW()
);

Step 2: Scraping IR Pages

Node.js

import KnowledgeSDK from '@knowledgesdk/node';
import crypto from 'crypto';

const sdk = new KnowledgeSDK({ apiKey: process.env.KNOWLEDGESDK_API_KEY });

async function scrapeIRPage(company) {
  console.log(`Scanning IR page for ${company.ticker}...`);

  try {
    const result = await sdk.scrape({ url: company.ir_url });
    const markdown = result.markdown;

    // Extract press release and filing links
    const linkPattern = /\[([^\]]+)\]\((https?:\/\/[^\)]+)\)/g;
    const disclosureLinks = [];
    let match;

    const earningsPatterns = [
      /earnings/i, /quarterly results/i, /q[1-4] \d{4}/i,
      /annual results/i, /financial results/i, /revenue/i,
    ];

    const filingPatterns = [
      /10-[kq]/i, /8-k/i, /proxy/i, /def 14a/i, /annual report/i,
    ];

    while ((match = linkPattern.exec(markdown)) !== null) {
      const [, title, url] = match;

      const isEarnings = earningsPatterns.some(p => p.test(title));
      const isFiling = filingPatterns.some(p => p.test(title));

      if (isEarnings || isFiling) {
        disclosureLinks.push({
          title: title.trim(),
          url,
          type: isEarnings ? 'earnings' : 'filing',
        });
      }
    }

    return disclosureLinks;
  } catch (err) {
    console.error(`Failed to scan IR page for ${company.ticker}:`, err.message);
    return [];
  }
}

async function scrapeDisclosure(link, company) {
  try {
    const result = await sdk.scrape({ url: link.url });
    const contentHash = crypto
      .createHash('sha256')
      .update(result.markdown.slice(0, 3000))
      .digest('hex')
      .slice(0, 16);

    // Detect disclosure type more precisely
    const disclosureType = detectDisclosureType(link.title, result.markdown);

    return {
      companyId: company.id,
      title: result.title || link.title,
      url: link.url,
      disclosureType,
      content: result.markdown,
      contentHash,
    };
  } catch (err) {
    console.error(`Failed to scrape disclosure ${link.url}:`, err.message);
    return null;
  }
}

function detectDisclosureType(title, content) {
  const combined = `${title} ${content.slice(0, 500)}`.toLowerCase();

  if (/10-k|annual report/i.test(combined)) return '10-K';
  if (/10-q|quarterly report/i.test(combined)) return '10-Q';
  if (/8-k/i.test(combined)) return '8-K';
  if (/earnings|quarterly results|q[1-4] \d{4}/i.test(combined)) return 'earnings';
  if (/press release/i.test(combined)) return 'press-release';
  return 'other';
}

Python

import os
import re
import hashlib
from typing import List, Dict, Optional
from knowledgesdk import KnowledgeSDK

sdk = KnowledgeSDK(api_key=os.environ["KNOWLEDGESDK_API_KEY"])

EARNINGS_PATTERNS = [
    r'earnings', r'quarterly results', r'q[1-4] \d{4}',
    r'annual results', r'financial results',
]

FILING_PATTERNS = [r'10-[kq]', r'8-k', r'proxy', r'annual report']

def scrape_ir_page(company: Dict) -> List[Dict]:
    print(f"Scanning IR page for {company['ticker']}...")

    try:
        result = sdk.scrape(url=company["ir_url"])
        markdown = result["markdown"]

        link_pattern = r'\[([^\]]+)\]\((https?://[^\)]+)\)'
        disclosure_links = []

        for match in re.finditer(link_pattern, markdown):
            title, url = match.group(1), match.group(2)

            is_earnings = any(re.search(p, title, re.IGNORECASE) for p in EARNINGS_PATTERNS)
            is_filing = any(re.search(p, title, re.IGNORECASE) for p in FILING_PATTERNS)

            if is_earnings or is_filing:
                disclosure_links.append({
                    "title": title.strip(),
                    "url": url,
                    "type": "earnings" if is_earnings else "filing",
                })

        return disclosure_links
    except Exception as e:
        print(f"Failed to scan {company['ticker']}: {e}")
        return []

Step 3: Scraping SEC EDGAR

The SEC's EDGAR system is publicly accessible and reliable. Use the EDGAR full-text search API to find recent filings by CIK:

async function getRecentSECFilings(cik, formTypes = ['8-K', '10-Q', '10-K']) {
  const paddedCIK = cik.padStart(10, '0');
  const filings = [];

  for (const formType of formTypes) {
    const edgarUrl = `https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=${paddedCIK}&type=${encodeURIComponent(formType)}&dateb=&owner=include&count=5&search_text=`;

    try {
      const result = await sdk.scrape({ url: edgarUrl });
      const markdown = result.markdown;

      // Extract filing index URLs
      const linkPattern = /\[([^\]]+)\]\((https?:\/\/www\.sec\.gov\/Archives\/[^\)]+)\)/g;
      let match;

      while ((match = linkPattern.exec(markdown)) !== null) {
        const [, title, url] = match;
        filings.push({
          title: title.trim(),
          url,
          formType,
          source: 'sec-edgar',
        });
      }
    } catch (err) {
      console.error(`EDGAR query failed for CIK ${cik}, form ${formType}:`, err.message);
    }
  }

  return filings;
}

// Get the actual filing document from an EDGAR filing index page
async function scrapeEDGARFiling(filingIndexUrl) {
  try {
    // First, scrape the index page to find the main document
    const indexResult = await sdk.scrape({ url: filingIndexUrl });

    // Find the primary document link in the index
    const primaryDocPattern = /\[(.*?)\]\((https:\/\/www\.sec\.gov\/Archives\/.*?\.htm)\)/;
    const primaryMatch = indexResult.markdown.match(primaryDocPattern);

    if (!primaryMatch) {
      return { url: filingIndexUrl, content: indexResult.markdown };
    }

    // Scrape the actual filing document
    const docResult = await sdk.scrape({ url: primaryMatch[2] });
    return { url: primaryMatch[2], content: docResult.markdown };
  } catch (err) {
    console.error(`Failed to scrape EDGAR filing ${filingIndexUrl}:`, err.message);
    return null;
  }
}

Step 4: AI Extraction of Financial Metrics

Extract structured financial data from earnings press releases:

import OpenAI from 'openai';

const openai = new OpenAI();

async function extractFinancialMetrics(disclosure) {
  if (!['earnings', 'press-release', '10-Q'].includes(disclosure.disclosureType)) {
    return null;
  }

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are a financial analyst extracting key metrics from earnings disclosures.

Extract the following from the provided financial disclosure and return as JSON:
- quarter: string (e.g., "Q4 2025" or "FY 2025")
- revenue_usd_millions: number or null (total revenue in millions USD)
- revenue_growth_yoy_pct: number or null (year-over-year growth %)
- gross_profit_usd_millions: number or null
- gross_margin_pct: number or null
- operating_income_usd_millions: number or null
- net_income_usd_millions: number or null
- eps_diluted: number or null (earnings per share, diluted)
- eps_vs_consensus: "beat" | "met" | "miss" | null
- guidance_next_quarter_revenue_low: number or null (millions USD)
- guidance_next_quarter_revenue_high: number or null (millions USD)
- guidance_raised: boolean or null (did they raise guidance?)
- key_highlights: string[] (top 3-5 business highlights from the report)
- management_tone: "positive" | "cautious" | "negative" (overall management commentary tone)
- risks_mentioned: string[] (key risks mentioned, max 3)

Return only valid JSON. Use null for fields not found.`,
      },
      {
        role: 'user',
        content: `Company: ${disclosure.ticker}\nTitle: ${disclosure.title}\n\n${disclosure.content.slice(0, 6000)}`,
      },
    ],
    response_format: { type: 'json_object' },
    max_tokens: 800,
  });

  try {
    return JSON.parse(response.choices[0].message.content);
  } catch {
    return null;
  }
}

# Python: extract financial metrics
import json
from openai import OpenAI

openai_client = OpenAI()

METRICS_PROMPT = """You are a financial analyst extracting key metrics from earnings disclosures.

Extract the following and return as JSON:
- quarter: string (e.g., "Q4 2025")
- revenue_usd_millions: number or null
- revenue_growth_yoy_pct: number or null
- gross_margin_pct: number or null
- eps_diluted: number or null
- eps_vs_consensus: "beat" | "met" | "miss" | null
- guidance_next_quarter_low: number or null (millions USD)
- guidance_next_quarter_high: number or null (millions USD)
- guidance_raised: boolean or null
- key_highlights: list of strings (max 5)
- management_tone: "positive" | "cautious" | "negative"
- risks_mentioned: list of strings (max 3)

Return only valid JSON."""

def extract_financial_metrics(disclosure: Dict) -> Optional[Dict]:
    if disclosure.get("disclosure_type") not in ["earnings", "press-release", "10-Q"]:
        return None

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": METRICS_PROMPT},
            {
                "role": "user",
                "content": f"Title: {disclosure['title']}\n\n{disclosure['content'][:6000]}",
            },
        ],
        response_format={"type": "json_object"},
        max_tokens=800,
    )

    try:
        return json.loads(response.choices[0].message.content)
    except Exception:
        return None

Step 5: Alert on New Filings

The monitoring agent runs on a schedule and fires alerts when new disclosures are detected:

import pg from 'pg';

const db = new pg.Pool({ connectionString: process.env.DATABASE_URL });

async function runFinancialMonitorAgent() {
  console.log(`[${new Date().toISOString()}] Financial monitor starting...`);

  const { rows: companies } = await db.query(
    `SELECT * FROM monitored_companies WHERE active = true ORDER BY last_checked_at ASC NULLS FIRST LIMIT 20`
  );

  const newDisclosures = [];

  for (const company of companies) {
    // Get links from IR page
    const irLinks = await scrapeIRPage(company);

    // Also check SEC EDGAR if CIK is known
    let edgarLinks = [];
    if (company.sec_cik) {
      edgarLinks = await getRecentSECFilings(company.sec_cik, ['8-K', '10-Q']);
    }

    const allLinks = [...irLinks, ...edgarLinks];

    for (const link of allLinks.slice(0, 10)) { // Limit per company
      const disclosure = await scrapeDisclosure(link, company);
      if (!disclosure) continue;

      // Try to insert — if content hash already exists, it's not new
      const result = await db.query(`
        INSERT INTO financial_disclosures
          (company_id, title, url, disclosure_type, content, content_hash)
        VALUES ($1, $2, $3, $4, $5, $6)
        ON CONFLICT (company_id, content_hash) DO NOTHING
        RETURNING id
      `, [
        disclosure.companyId,
        disclosure.title,
        disclosure.url,
        disclosure.disclosureType,
        disclosure.content,
        disclosure.contentHash,
      ]);

      if (result.rows.length > 0) {
        // New disclosure found
        const disclosureId = result.rows[0].id;

        // Extract financial metrics if it's an earnings report
        const metrics = await extractFinancialMetrics({
          ...disclosure,
          ticker: company.ticker,
        });

        if (metrics) {
          await db.query(`
            UPDATE financial_disclosures
            SET revenue = $2, revenue_growth_pct = $3, eps = $4, raw_metrics = $5
            WHERE id = $1
          `, [
            disclosureId,
            metrics.revenue_usd_millions ? Math.round(metrics.revenue_usd_millions * 1000) : null,
            metrics.revenue_growth_yoy_pct,
            metrics.eps_diluted,
            JSON.stringify(metrics),
          ]);
        }

        newDisclosures.push({
          company,
          disclosure: { ...disclosure, id: disclosureId },
          metrics,
        });
      }
    }

    await db.query(
      `UPDATE monitored_companies SET last_checked_at = NOW() WHERE id = $1`,
      [company.id]
    );

    // Brief pause between companies
    await new Promise(r => setTimeout(r, 2000));
  }

  // Fire alerts for new disclosures
  for (const item of newDisclosures) {
    await fireDisclosureAlert(item);
  }

  console.log(`Agent complete. ${newDisclosures.length} new disclosures found.`);
}

async function fireDisclosureAlert({ company, disclosure, metrics }) {
  const webhookUrl = process.env.ALERT_WEBHOOK_URL;
  if (!webhookUrl) return;

  const payload = {
    event: 'financial_disclosure.new',
    ticker: company.ticker,
    companyName: company.name,
    disclosure: {
      title: disclosure.title,
      url: disclosure.url,
      type: disclosure.disclosureType,
    },
    metrics: metrics ? {
      quarter: metrics.quarter,
      revenue: metrics.revenue_usd_millions,
      revenueGrowth: metrics.revenue_growth_yoy_pct,
      eps: metrics.eps_diluted,
      guidanceRaised: metrics.guidance_raised,
      managementTone: metrics.management_tone,
    } : null,
    timestamp: new Date().toISOString(),
  };

  await fetch(webhookUrl, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(payload),
  });
}

A Note on PDFs

Many IR page filings — particularly 10-K annual reports, investor presentations, and proxy statements — are published as PDFs. KnowledgeSDK can extract text content from PDFs via the scrape endpoint, and this works well for simple, text-heavy PDFs like 8-K press releases and short earnings releases.

However, complex PDF layouts present challenges:

Multi-column financial tables — columns may merge incorrectly in text extraction
Charts and graphs — visual data is not extracted
Footnotes in running text — footnote references may interrupt sentence flow
Scanned documents — older filings are image-based PDFs with no text layer

For workflows that depend heavily on structured financial tables from 10-K or 10-Q filings, consider complementing KnowledgeSDK with a dedicated PDF extraction service. For HTML-based earnings press releases and 8-K current reports, KnowledgeSDK is fully sufficient.

Search Across Historical Disclosures

Use KnowledgeSDK's search endpoint to query financial disclosures semantically:

async function searchFinancialData(query, ticker) {
  const results = await sdk.search({
    query,
    limit: 10,
  });

  return results.results.map(r => ({
    title: r.title,
    url: r.url,
    excerpt: r.content.slice(0, 300),
  }));
}

// Example queries
const cloudResults = await searchFinancialData('cloud revenue growth AWS Azure', 'MSFT');
const guidanceResults = await searchFinancialData('raised full-year guidance fiscal 2026', null);
const riskResults = await searchFinancialData('AI competition risk mentioned earnings', null);

FAQ

Is scraping earnings press releases legal? Yes. Earnings press releases are public disclosures published by companies for investors. Scraping publicly available financial disclosures is legal. Always respect robots.txt and never access non-public information.

Can I use this for trading decisions? We make no representations about the suitability of this data for trading. Financial data extracted from web scraping may contain errors, delays, or misparses. Always verify material data against official sources before making investment decisions.

How do I get a company's SEC CIK number? The SEC provides a company search at https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany. You can also query the EDGAR API directly: https://efts.sec.gov/LATEST/search-index?q=%22COMPANY+NAME%22&dateRange=custom.

How quickly after publication are filings available to scrape? EDGAR filings are typically available within minutes of acceptance. IR pages are updated at the time of publication. Running the monitor agent every 2-4 hours is sufficient for most use cases.

What about real-time market data (stock prices, options)? Real-time market data requires exchange licenses and is not something to scrape from retail finance sites. Use official market data providers (Polygon.io, Alpaca, Interactive Brokers API) for that use case.

Monitor public financial disclosures automatically and never miss an earnings release again. Get started at knowledgesdk.com/setup.

Try it now