Building LLM-Agnostic RAG: Switch Between OpenAI, Anthropic, and Gemini Freely

Avoid LLM vendor lock-in in your RAG pipeline. Design your knowledge extraction and search layer to work with any LLM provider — and switch without rewriting.

Here's how most RAG systems get built: a developer chooses OpenAI because it's familiar, wires the extraction, embedding, and generation layers together around the OpenAI API, and ships. Six months later, Anthropic releases a model with better reasoning for their use case, or the company wants to move to a self-hosted model for data privacy reasons. The migration estimate comes back at three to four weeks.

The root cause is almost always the same: the retrieval and generation layers are coupled. Switching providers requires touching almost everything.

LLM-agnostic RAG is not a difficult architectural goal — but it requires making one deliberate decision early: keep the knowledge layer independent of the generation layer.

The Vendor Lock-In Risk

The problem is not using OpenAI (or Anthropic, or Gemini). All of these are solid providers. The problem is designing your system such that swapping one of them out requires significant rework.

Common coupling points:

Embedding models tied to the LLM provider: If you use text-embedding-ada-002 or text-embedding-3-small for indexing, you must use the same model for query embedding. If you later want to use Anthropic as your generator, your retrieval layer is still OpenAI.
Prompt formats hardcoded: OpenAI uses a messages array with role: "user" and role: "assistant". Anthropic uses a nearly identical format but with different system prompt handling. Gemini uses a contents array. Small differences, but enough to break things.
Output parsing tied to provider SDK: If you parse streaming responses using the OpenAI Python SDK's streaming helpers, none of that code works with Anthropic or Gemini clients.

The solution is a three-layer architecture with a clean interface boundary between retrieval and generation.

The Three-Layer Architecture

Layer 1: Knowledge layer — everything involved in acquiring, storing, and searching documents. This layer does not care about which LLM will consume its output.

Layer 2: Context assembly layer — takes search results from the knowledge layer and formats them as a context string suitable for injection into a prompt. This layer is LLM-aware in terms of token budget management, but not in terms of provider-specific APIs.

Layer 3: Generation layer — the swappable LLM. Takes a query and a context string, returns an answer. This is where provider-specific code lives, and it's isolated behind an interface.

The critical insight: Layers 1 and 2 never change when you switch LLM providers. Only Layer 3 changes.

The Knowledge Layer Stays Stable

When you use KnowledgeSDK for the knowledge layer, this separation is already enforced. POST /v1/extract takes a URL and returns structured markdown. POST /v1/search takes a query and returns ranked chunks. Neither endpoint cares what happens to that content next — whether it goes to GPT-4o, Claude Sonnet, or Gemini Pro is irrelevant.

// This code never changes regardless of which LLM you use
async function retrieveContext(query: string): Promise<string> {
  const response = await fetch("https://api.knowledgesdk.com/v1/search", {
    method: "POST",
    headers: {
      "x-api-key": process.env.KNOWLEDGESDK_API_KEY!,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ query, limit: 5 }),
  });

  const { results } = await response.json();

  return results
    .map((r: any) => `[${r.title}]\n${r.content}`)
    .join("\n\n---\n\n");
}

This function is the entire retrieval layer. It has no OpenAI imports. It will work identically whether you switch to Anthropic tomorrow or Gemini next year.

The Generation Layer: Provider Adapters

The generation layer is where provider differences live. The cleanest approach is a simple adapter pattern — one function per provider, all sharing the same interface:

type Provider = "openai" | "anthropic" | "gemini";

async function generateAnswer(
  context: string,
  query: string,
  provider: Provider
): Promise<string> {
  switch (provider) {
    case "openai":
      return generateWithOpenAI(context, query);
    case "anthropic":
      return generateWithAnthropic(context, query);
    case "gemini":
      return generateWithGemini(context, query);
  }
}

Each adapter handles the provider-specific message format:

async function generateWithOpenAI(context: string, query: string): Promise<string> {
  const openai = new OpenAI();
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Answer questions using only the provided context.\n\nContext:\n${context}`,
      },
      { role: "user", content: query },
    ],
  });
  return response.choices[0].message.content ?? "";
}

async function generateWithAnthropic(context: string, query: string): Promise<string> {
  const anthropic = new Anthropic();
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: `Answer questions using only the provided context.\n\nContext:\n${context}`,
    messages: [{ role: "user", content: query }],
  });
  return response.content[0].type === "text" ? response.content[0].text : "";
}

async function generateWithGemini(context: string, query: string): Promise<string> {
  const genAI = new GoogleGenerativeAI(process.env.GOOGLE_API_KEY!);
  const model = genAI.getGenerativeModel({ model: "gemini-pro" });
  const result = await model.generateContent(
    `Context:\n${context}\n\nQuestion: ${query}`
  );
  return result.response.text();
}

The system prompt text is identical across all three adapters. Only the API call structure differs.

The Full Pipeline

Putting the layers together:

async function rag(query: string, provider: Provider = "openai"): Promise<string> {
  // Layer 1: Knowledge retrieval (never changes)
  const context = await retrieveContext(query);

  // Layer 2: Context assembly (format check, token budget)
  const trimmedContext = context.slice(0, 8000); // simple token budget

  // Layer 3: Generation (swappable)
  return generateAnswer(trimmedContext, query, provider);
}

// Usage
const answer = await rag("What are the pricing plans?", "anthropic");

Switching providers is a one-character change. The retrieval, indexing, and context assembly are completely unaffected.

The Embedding Coupling Problem

If you're building your own embedding layer instead of using KnowledgeSDK's managed search, there's one more coupling point to address: the embedding model.

If you index your documents with OpenAI's text-embedding-3-small, you must also embed queries with text-embedding-3-small at search time. This doesn't stop you from using Anthropic as the generator — but it does mean your infrastructure still touches the OpenAI API even when you've "switched" to Anthropic.

The cleanest solution for true provider independence: use an open-source embedding model (BGE-M3 or Nomic Embed Text V2) hosted on your own infrastructure. Your embedding layer becomes independent of all LLM providers. The generator can be swapped freely. The only external dependency in your retrieval stack is your own vector database.

This matters most for organizations with strict data residency requirements, where sending document content to OpenAI's embedding API is itself the constraint.

Prompt Format Differences

For completeness, here's what actually differs between providers at the API level:

OpenAI: messages: [{role: "system", content: "..."}, {role: "user", content: "..."}]

Anthropic: Separate system parameter at the top level; messages array contains only user and assistant turns.

Gemini: contents: [{role: "user", parts: [{text: "..."}]}] with system instructions via systemInstruction.

These are small differences, but they're enough to cause runtime errors if you try to use one provider's SDK with another's format. The adapter pattern above handles all of these correctly.

Cost Optimization via Provider Switching

An underappreciated benefit of LLM-agnostic design is the ability to route different request types to cost-appropriate providers:

Interactive user queries → premium model (GPT-4o, Claude Sonnet) for quality
Batch document summarization → cheaper model (GPT-4o-mini, Claude Haiku) for throughput
Offline indexing tasks → cheapest model available, or local Ollama

This routing logic is trivial to add to the generateAnswer function:

function selectProvider(task: "interactive" | "batch" | "offline"): Provider {
  switch (task) {
    case "interactive": return "anthropic"; // best reasoning
    case "batch": return "openai";          // cost-efficient
    case "offline": return "gemini";        // flexible pricing
  }
}

You get the quality-cost trade-off you want, without any changes to the retrieval layer.

The Practical Upshot

LLM-agnostic RAG is not a theoretical ideal — it's a practical decision that saves real engineering time when (not if) you need to switch providers. The architectural requirements are minimal: isolate provider-specific code behind a simple interface, keep your retrieval layer free of LLM dependencies, and use an open-source embedding model if full independence matters.

The knowledge layer — extraction, indexing, and search — is the most stable part of your system. Build it to be permanent. Keep the generation layer thin, swappable, and behind a clear interface. You'll thank yourself when the next model releases.

Try it now