Latency

The time delay between sending an API request and receiving the response, a critical metric for real-time AI applications.

What Is Latency?

Latency is the elapsed time between initiating an API request and receiving the first byte (or full response) from the server. It is measured in milliseconds (ms) and is one of the most visible performance characteristics of any API-driven application.

High latency translates directly to slow user experiences: a search box that takes two seconds to return results, or an AI agent that pauses noticeably between tool calls.

Components of API Latency

Total latency is the sum of several distinct delays:

Network latency — the time for packets to travel between client and server (speed of light over physical distance).
DNS resolution — looking up the server's IP address (typically 5–100 ms, cacheable).
TLS handshake — negotiating an encrypted connection (adds 1–2 round trips on first connection).
Server processing time — the API executing its logic: database queries, AI inference, web scraping, etc.
Response serialization — converting the result to JSON and streaming it back.

For KnowledgeSDK endpoints like POST /v1/search, most of the latency budget is spent on vector similarity search in Typesense, not on the network.

Latency Percentiles

Averages are misleading. A single slow request can skew the mean dramatically. Instead, track percentiles:

Metric	Meaning
p50	Median — half of requests are faster than this
p95	95% of requests are faster than this
p99	99% of requests are faster than this
p99.9	The "worst" 1-in-1000 request

A well-tuned API might have p50 = 80 ms and p99 = 500 ms. If your p99 is 10 seconds, users will notice — even if the average looks fine.

Latency vs. Throughput

These two metrics are related but distinct:

Latency — how long a single request takes.
Throughput — how many requests the system can handle per second.

You can often trade one for the other. Batching multiple requests together increases throughput but adds latency to individual items. Processing requests individually minimizes latency but may reduce throughput.

Reducing Latency in Practice

For synchronous endpoints (`POST /v1/scrape`, `POST /v1/search`)

Colocate your server with the API. If KnowledgeSDK runs in us-east-1, deploy your backend there too.
Reuse HTTP connections. Use a connection pool or HTTP/2 to avoid per-request TLS handshake overhead.
Cache responses. KnowledgeSDK caches extraction results — if you request the same URL twice, the second call returns faster.

For long-running operations (`POST /v1/extract`)

Switch to async. POST /v1/extract/async returns a jobId in milliseconds, then delivers the result via webhook when the 1–3 minute extraction is complete. Your application remains responsive while waiting.
Stream results where possible. For operations that produce incremental output, streaming reduces time-to-first-token dramatically.

Measuring Latency

const start = performance.now();
const res = await fetch("https://api.knowledgesdk.com/v1/search", {
  method: "POST",
  headers: { "x-api-key": process.env.KNOWLEDGE_API_KEY! },
  body: JSON.stringify({ query: "machine learning pipelines" }),
});
const data = await res.json();
const latencyMs = performance.now() - start;
console.log(`Search latency: ${latencyMs.toFixed(1)} ms`);

Track this metric over time in your observability stack to detect regressions before users notice them.