Back to Research
Retrieval

Ephemeral reranking: why retrieval doesn't end at the vector

Vector search is great at fetching candidates and mediocre at ordering them. That's why serious retrieval has two stages: one fast and approximate, one precise and expensive. We cover why reranking more than doubles ranking quality (with data), the hidden cost almost nobody counts, how the landscape stacks up (Cohere, Voyage, Pinecone, ELSER, ColBERT, Jina), and how we build a reranking index that lives only as long as a query — and why that makes us better.

BiVelio Research8 min read
Two-stage retrieval: a cloud of candidates is reordered, through an ephemeral spark, into a ranked stack

When a BiVelio agent answers a question about a company's documents, the first thing that happens isn't generation — it's retrieval. And the quality of the answer is bounded by the quality of the chunks the agent gets to see. If the context is noisy or badly ordered, no model fixes it afterward (Liu et al., 2023).

The naive intuition is that embedding similarity is enough. In practice, vector search is excellent at recall — quickly pulling a reasonable candidate set — and mediocre at precision — ordering those candidates by true relevance. So serious retrieval happens in two stages. And the second one, reranking, is where the answer is won.

The problems developers actually hit

The first instinct when a RAG fails is to raise top_k: fetch more chunks. It's a trap. With small k you leave out the right document; with large k you flood the context window — and models lose the information that lands in the middle (Liu et al., 2023). More context isn't better context.

The second instinct is to trust dense embeddings for everything. Another trap: in zero-shot evaluation on new domains, dense retrievers can perform below old-school BM25 (Thakur et al., 2021). Vector similarity alone doesn't generalize as well as people assume.

−22 pp
Accuracy lost
relevant fact "in the middle" of the context
−47.7%
Dense retriever alone
vs BM25, zero-shot (DPR on BEIR)
~65 h
Cost of ordering well
cross-encoder over 10,000 pairs
Fuente: Liu et al. 2023; Thakur et al. 2021 (BEIR); Reimers & Gurevych 2019

The underlying cause? A cross-encoder — the model that actually judges relevance by looking at the query and the document together — is precise but hugely expensive: you must run it once per (query, document) pair. Reimers and Gurevych measured that exhaustively comparing 10,000 sentences with a cross-encoder takes about 65 hours; with independent embeddings, about 5 seconds (Reimers & Gurevych, 2019). You can't run the cross-encoder over the whole corpus. But its precision is exactly what you need. The answer to that tension is two stages.

Two-stage retrieval

The first stage approximately retrieves the kk chunks most aligned with the query. It works over dense embeddings (Karpukhin et al., 2020) and measures closeness with cosine similarity:

sim(q,d)=eqedeqed\text{sim}(q, d) = \frac{\mathbf{e}_q \cdot \mathbf{e}_d}{\lVert \mathbf{e}_q \rVert \, \lVert \mathbf{e}_d \rVert}

It's cheap and scales to millions of chunks, but the embedding compresses each document into a single vector: it loses nuances that only surface when you compare the query and the document word by word.

The second stage — the reranker — reorders only those kk candidates with a cross-encoder that does look at the pair jointly (Nogueira & Cho, 2019). The quality jump is large: on the standard MS MARCO benchmark, reranking with BERT more than doubles MRR@10 over BM25 (Nguyen et al., 2016; Nogueira & Cho, 2019).

MS MARCO — ranking quality (MRR@10)
BM25 (lexical retrieval)16.7
Cross-encoder reranking (BERT-large)36.5

Same candidate set, different orderer. Cross-encoder reranking more than doubles the official ranking metric.

Fuente: Nogueira & Cho, 2019 (Passage Re-ranking with BERT); dataset: Nguyen et al. 2016

The reranker assigns each candidate a score sis_i and turns it into a distribution over the retrieved set:

pi=esij=1kesjp_i = \frac{e^{s_i}}{\sum_{j=1}^{k} e^{s_j}}

And the pattern generalizes: on BEIR, the BM25 + cross-encoder combination is the best on average versus lexical or dense retrieval alone, winning on 16 of 18 domains (Thakur et al., 2021). When there are several recall sources (lexical and dense), they're fused by rank with Reciprocal Rank Fusion before reranking (Cormack et al., 2009).

The key asymmetry

The first stage decides what gets in the deck; the second decides the order. Good recall with bad ordering wastes the context window; perfect ordering over bad recall can't recover what was never fetched. You need both — and you measure each with its own metric (recall@k and MRR/nDCG).

The hidden cost of reranking

If reranking is so good, why doesn't everyone max it out? Cost. A cross-encoder runs one model forward pass per pair, so reordering is expensive and slow. ColBERT's numbers make it clear: reordering one query's candidates with a BERT-large cross-encoder takes about 33 seconds; ColBERT's late interaction reaches comparable quality in 61 milliseconds (Khattab & Zaharia, 2020).

10,700 ms
Cross-encoder per query
BERT-base reordering candidates
61 ms
Late interaction
ColBERT, comparable quality
154 → 16 GiB
Specialized index
ColBERT → ColBERTv2 on MS MARCO
Fuente: Khattab & Zaharia 2020 (ColBERT); Santhanam et al. 2022 (ColBERTv2)

The usual way out is to keep a specialized, hot reranking index running permanently — a GPU served 24/7, or a huge multi-vector index (ColBERT needed 154 GiB for MS MARCO, cut to 16–25 GiB only with compression (Santhanam et al., 2022)). The problem: that index is expensive to maintain and mostly idle, waiting for queries that arrive in bursts.

Ephemeral reranking

Our bet — what we internally call Turbovec — is to flip the equation: build the reranking index on demand, right when the query arrives, and discard it when it's done.

async function retrieve(query: string, projectId: string) {
  // Stage 1 — cheap recall over the persistent vector index
  const candidates = await vectorSearch(query, projectId, { k: 50 })
 
  // Stage 2 — ephemeral reranking index, alive only for this query
  const boost = await buildEphemeralIndex(candidates)
  const scored = await boost.rerank(query, candidates)
 
  return scored
    .sort((a, b) => b.score - a.score)
    .slice(0, 8) // only the best makes it into the agent's window
}

The ephemeral index doesn't compete with persistent storage: it complements it. Vector search guarantees the right candidate is in the 50; the reranker guarantees it reaches the top-8 the agent actually reads. And because reranking is applied only to those 50 candidates — not the whole corpus — the cost scales with kk (controllable), not NN (intractable). You pay for precision only when there's a question, not for a GPU sitting idle.

How the landscape stacks up

Reranking has become an excellent commodity, but almost every option assumes a permanent service or index:

SolutionMechanismCost / limit
Cohere RerankManaged cross-encoderPay-per-use API; ~4096-token window/doc, chunks long docs
Voyage rerankManaged cross-encoderToken budget per request; truncates by default
Pinecone rerankHosted cross-encoder512-token context on its own model; per request
Elastic ELSER + RerankLearned sparse + cross-encoder512 tokens; English-recommended; per-query inference cost
sentence-transformersSelf-hosted cross-encoderYou keep the GPU hot; doesn't scale to the corpus
ColBERT / RAGatouilleLate interaction (multi-vector)Large, persistent specialized index
Jina RerankerCross-encoder / listwisePer-pair inference; service or self-hosted weights

These are superb building blocks. But almost all of them ask for something running all the time: an API you pay per query, a served GPU, or a specialized index that must be kept fresh. Per-query ephemeral reranking isn't the model they were designed for.

How we plan to be the best

We don't compete on having the biggest cross-encoder: we compete on delivering cross-encoder precision without paying to keep it hot. That's where we focus the edge:

The usual approach

A reranking index or a GPU running permanently — expensive and idle most of the time, sized for the peak.

BiVelio

An ephemeral reranking index: built from the query's candidates, it orders them, then is discarded. Cross-encoder precision, cost proportional to real use.
  1. Cost proportional to use. No idle GPU: the precision index exists only during the query. You pay for questions, not for waiting time.
  2. Hybrid recall, focused precision. We combine lexical (Robertson & Zaragoza, 2009) and dense (Karpukhin et al., 2020) signals, fused by rank (Cormack et al., 2009), and only then apply reranking — the pattern the evidence rewards (Thakur et al., 2021).
  3. Less context, better ordered. We hand the agent the top-8, not the top-50: we attack "lost in the middle" (Liu et al., 2023) head-on instead of flooding the window.
  4. Reranking with operational context. We don't reorder chunks blindly: we combine it with the company's knowledge graph, the idea we develop in The graph as ambient context.
  5. Traceability. Every reranked result keeps its origin and its score: you can audit why a chunk reached the agent.

Note: the figures in this article come from the cited literature (Reimers & Gurevych, Nogueira & Cho, Khattab & Zaharia, Santhanam et al., Thakur et al., Liu et al.) and describe reranking in general. They are the motivation for our design, not a fixed product benchmark.

References

Cormack, G. V., Clarke, C. L. A., & Büttcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 758–759. https://doi.org/10.1145/1571941.1572114
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769–6781.
Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). https://arxiv.org/abs/2004.12832
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics (TACL). https://arxiv.org/abs/2307.03172
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv Preprint arXiv:1611.09268. https://arxiv.org/abs/1611.09268
Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv Preprint arXiv:1901.04085. https://arxiv.org/abs/1901.04085
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP). https://arxiv.org/abs/1908.10084
Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389.
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., & Zaharia, M. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). https://arxiv.org/abs/2112.01488
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track. https://arxiv.org/abs/2104.08663
  • #retrieval
  • #reranking
  • #embeddings
  • #rag
  • #cross-encoder

Want to see these algorithms in production?

BiVelio turns this research into an AI operating system that runs your company end to end.