knowledgesdk.com/glossary/latency
Infrastructure & DevOpsbeginner

Also known as: response time, API latency

Latency

The time delay between sending an API request and receiving the response, a critical metric for real-time AI applications.

What Is Latency?

Latency is the elapsed time between initiating an API request and receiving the first byte (or full response) from the server. It is measured in milliseconds (ms) and is one of the most visible performance characteristics of any API-driven application.

High latency translates directly to slow user experiences: a search box that takes two seconds to return results, or an AI agent that pauses noticeably between tool calls.

Components of API Latency

Total latency is the sum of several distinct delays:

  • Network latency — the time for packets to travel between client and server (speed of light over physical distance).
  • DNS resolution — looking up the server's IP address (typically 5–100 ms, cacheable).
  • TLS handshake — negotiating an encrypted connection (adds 1–2 round trips on first connection).
  • Server processing time — the API executing its logic: database queries, AI inference, web scraping, etc.
  • Response serialization — converting the result to JSON and streaming it back.

For KnowledgeSDK endpoints like POST /v1/search, most of the latency budget is spent on vector similarity search in Typesense, not on the network.

Latency Percentiles

Averages are misleading. A single slow request can skew the mean dramatically. Instead, track percentiles:

Metric Meaning
p50 Median — half of requests are faster than this
p95 95% of requests are faster than this
p99 99% of requests are faster than this
p99.9 The "worst" 1-in-1000 request

A well-tuned API might have p50 = 80 ms and p99 = 500 ms. If your p99 is 10 seconds, users will notice — even if the average looks fine.

Latency vs. Throughput

These two metrics are related but distinct:

  • Latency — how long a single request takes.
  • Throughput — how many requests the system can handle per second.

You can often trade one for the other. Batching multiple requests together increases throughput but adds latency to individual items. Processing requests individually minimizes latency but may reduce throughput.

Reducing Latency in Practice

For synchronous endpoints (POST /v1/scrape, POST /v1/search)

  • Colocate your server with the API. If KnowledgeSDK runs in us-east-1, deploy your backend there too.
  • Reuse HTTP connections. Use a connection pool or HTTP/2 to avoid per-request TLS handshake overhead.
  • Cache responses. KnowledgeSDK caches extraction results — if you request the same URL twice, the second call returns faster.

For long-running operations (POST /v1/extract)

  • Switch to async. POST /v1/extract/async returns a jobId in milliseconds, then delivers the result via webhook when the 1–3 minute extraction is complete. Your application remains responsive while waiting.
  • Stream results where possible. For operations that produce incremental output, streaming reduces time-to-first-token dramatically.

Measuring Latency

const start = performance.now();
const res = await fetch("https://api.knowledgesdk.com/v1/search", {
  method: "POST",
  headers: { "x-api-key": process.env.KNOWLEDGE_API_KEY! },
  body: JSON.stringify({ query: "machine learning pipelines" }),
});
const data = await res.json();
const latencyMs = performance.now() - start;
console.log(`Search latency: ${latencyMs.toFixed(1)} ms`);

Track this metric over time in your observability stack to detect regressions before users notice them.

Related Terms

Infrastructure & DevOpsbeginner
Throughput
The number of requests or operations a system can process per unit of time, a key performance metric for scraping and search APIs.
Infrastructure & DevOpsbeginner
Rate Limiting
A control mechanism that restricts how many API requests a client can make within a given time window.
Infrastructure & DevOpsintermediate
Async API
An API design pattern where long-running operations return a job ID immediately and deliver results via polling or webhook when complete.
Large Language ModelLong Context

Try it now

Build with Latency using one API.

Extract, index, and search any web content. First 1,000 requests free.

GET API KEY →
← Back to glossary