Development AI LLM [LLM] HyDE — Hypothetical Document Embeddings

Overview

An overview of HyDE (Hypothetical Document Embeddings), a technique for improving retrieval quality in RAG.

Summary

1. Retrieval Limitations in Standard RAG

RAG (Retrieval-Augmented Generation) converts a query into an embedding, searches a vector database for similar documents, and passes those documents as context to an LLM.

The problem is the semantic gap between user queries and documents. User queries are typically short and abstract, while documents contain long, detailed explanations. For example, a question like “How do I handle async in Python?” and a document thoroughly explaining the asyncio library cover the same topic but may be far apart in embedding space.

Because of this mismatch, the right document may exist but fail to appear at the top of search results.

2. What is HyDE

HyDE (Hypothetical Document Embeddings) addresses this problem. The core idea is that instead of embedding the user query directly, an LLM first generates a hypothetical answer, and that answer is embedded for retrieval.

The hypothetical answer generated by the LLM may not be factually accurate, but it tends to use similar tone, structure, and terminology as real documents, placing it closer to actual documents in embedding space. HyDE leverages the fact that a document-like text is more effective for retrieval than a short query.

3. HyDE Pipeline

The flows of standard RAG and HyDE can be compared as follows.

Standard RAG:

Query → Encoder → Vector Search → Relevant Docs → LLM → Final Answer

HyDE:

Query → LLM (Generate Hypothetical Answer) → Encoder → Vector Search → Relevant Docs → LLM → Final Answer

The difference is an additional step where the LLM generates a hypothetical answer before retrieval. This hypothetical answer serves as the search query instead.

In LangChain, this can be implemented as follows.

from langchain.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser

# Prompt for generating hypothetical answer
hyde_prompt = ChatPromptTemplate.from_template(
    "Please write a passage to answer the question.\n"
    "Question: {question}\n"
    "Passage:"
)

llm = ChatOllama(model="qwen3:8b")

# Hypothetical answer generation chain
hyde_chain = hyde_prompt | llm | StrOutputParser()

# Generate hypothetical answer
hypothetical_doc = hyde_chain.invoke({"question": "How do I handle async in Python?"})

# Use this hypothetical answer as embedding for vector search
# embeddings.embed_query(hypothetical_doc)

4. Advantages and Disadvantages

Advantages:

  • Improves retrieval accuracy in zero-shot settings. Works without labeled training data
  • Can be used with existing retrievers and embedding models. No changes to search infrastructure required
  • Simple to implement. Only one additional LLM call before retrieval

Disadvantages:

  • Adds latency due to the extra LLM call
  • If the LLM generates an inaccurate hypothetical answer (hallucination), retrieval quality may actually decrease
  • Increases token usage and cost
  • Not equally effective for all query types. Works better for explanatory questions than fact-checking questions

5. Paper

HyDE was proposed in the following paper.

  • Title: Precise Zero-Shot Dense Retrieval without Relevance Labels
  • Authors: Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
  • Venue: ACL 2023
  • arXiv: 2212.10496

The key contribution of the paper is demonstrating that dense retrieval performance can be improved without relevance labels. In experiments using InstructGPT to generate hypothetical documents and Contriever for embedding, significant performance improvements over existing zero-shot methods were achieved.

References

Leave a comment