Ephemeral reranking: why retrieval doesn't end at the vector
Vector search is great at fetching candidates and mediocre at ordering them. That's why serious retrieval has two stages: one fast and approximate, one precise and expensive. We cover why reranking more than doubles ranking quality (with data), the hidden cost almost nobody counts, how the landscape stacks up (Cohere, Voyage, Pinecone, ELSER, ColBERT, Jina), and how we build a reranking index that lives only as long as a query — and why that makes us better.

When a BiVelio agent answers a question about a company's documents, the first thing that happens isn't generation — it's retrieval. And the quality of the answer is bounded by the quality of the chunks the agent gets to see. If the context is noisy or badly ordered, no model fixes it afterward (Liu et al., 2023).
The naive intuition is that embedding similarity is enough. In practice, vector search is excellent at recall — quickly pulling a reasonable candidate set — and mediocre at precision — ordering those candidates by true relevance. So serious retrieval happens in two stages. And the second one, reranking, is where the answer is won.
The problems developers actually hit
The first instinct when a RAG fails is to raise top_k: fetch more chunks. It's
a trap. With small k you leave out the right document; with large k you flood
the context window — and models lose the information that lands in the middle
(Liu et al., 2023). More context isn't better context.
The second instinct is to trust dense embeddings for everything. Another trap: in zero-shot evaluation on new domains, dense retrievers can perform below old-school BM25 (Thakur et al., 2021). Vector similarity alone doesn't generalize as well as people assume.
The underlying cause? A cross-encoder — the model that actually judges relevance by looking at the query and the document together — is precise but hugely expensive: you must run it once per (query, document) pair. Reimers and Gurevych measured that exhaustively comparing 10,000 sentences with a cross-encoder takes about 65 hours; with independent embeddings, about 5 seconds (Reimers & Gurevych, 2019). You can't run the cross-encoder over the whole corpus. But its precision is exactly what you need. The answer to that tension is two stages.
Two-stage retrieval
The first stage approximately retrieves the chunks most aligned with the query. It works over dense embeddings (Karpukhin et al., 2020) and measures closeness with cosine similarity:
It's cheap and scales to millions of chunks, but the embedding compresses each document into a single vector: it loses nuances that only surface when you compare the query and the document word by word.
The second stage — the reranker — reorders only those candidates with a cross-encoder that does look at the pair jointly (Nogueira & Cho, 2019). The quality jump is large: on the standard MS MARCO benchmark, reranking with BERT more than doubles MRR@10 over BM25 (Nguyen et al., 2016; Nogueira & Cho, 2019).
Same candidate set, different orderer. Cross-encoder reranking more than doubles the official ranking metric.
Fuente: Nogueira & Cho, 2019 (Passage Re-ranking with BERT); dataset: Nguyen et al. 2016
The reranker assigns each candidate a score and turns it into a distribution over the retrieved set:
And the pattern generalizes: on BEIR, the BM25 + cross-encoder combination is the best on average versus lexical or dense retrieval alone, winning on 16 of 18 domains (Thakur et al., 2021). When there are several recall sources (lexical and dense), they're fused by rank with Reciprocal Rank Fusion before reranking (Cormack et al., 2009).
The key asymmetry
The first stage decides what gets in the deck; the second decides the order. Good recall with bad ordering wastes the context window; perfect ordering over bad recall can't recover what was never fetched. You need both — and you measure each with its own metric (recall@k and MRR/nDCG).
The hidden cost of reranking
If reranking is so good, why doesn't everyone max it out? Cost. A cross-encoder runs one model forward pass per pair, so reordering is expensive and slow. ColBERT's numbers make it clear: reordering one query's candidates with a BERT-large cross-encoder takes about 33 seconds; ColBERT's late interaction reaches comparable quality in 61 milliseconds (Khattab & Zaharia, 2020).
The usual way out is to keep a specialized, hot reranking index running permanently — a GPU served 24/7, or a huge multi-vector index (ColBERT needed 154 GiB for MS MARCO, cut to 16–25 GiB only with compression (Santhanam et al., 2022)). The problem: that index is expensive to maintain and mostly idle, waiting for queries that arrive in bursts.
Ephemeral reranking
Our bet — what we internally call Turbovec — is to flip the equation: build the reranking index on demand, right when the query arrives, and discard it when it's done.
async function retrieve(query: string, projectId: string) {
// Stage 1 — cheap recall over the persistent vector index
const candidates = await vectorSearch(query, projectId, { k: 50 })
// Stage 2 — ephemeral reranking index, alive only for this query
const boost = await buildEphemeralIndex(candidates)
const scored = await boost.rerank(query, candidates)
return scored
.sort((a, b) => b.score - a.score)
.slice(0, 8) // only the best makes it into the agent's window
}The ephemeral index doesn't compete with persistent storage: it complements it. Vector search guarantees the right candidate is in the 50; the reranker guarantees it reaches the top-8 the agent actually reads. And because reranking is applied only to those 50 candidates — not the whole corpus — the cost scales with (controllable), not (intractable). You pay for precision only when there's a question, not for a GPU sitting idle.
How the landscape stacks up
Reranking has become an excellent commodity, but almost every option assumes a permanent service or index:
| Solution | Mechanism | Cost / limit |
|---|---|---|
| Cohere Rerank | Managed cross-encoder | Pay-per-use API; ~4096-token window/doc, chunks long docs |
| Voyage rerank | Managed cross-encoder | Token budget per request; truncates by default |
| Pinecone rerank | Hosted cross-encoder | 512-token context on its own model; per request |
| Elastic ELSER + Rerank | Learned sparse + cross-encoder | 512 tokens; English-recommended; per-query inference cost |
| sentence-transformers | Self-hosted cross-encoder | You keep the GPU hot; doesn't scale to the corpus |
| ColBERT / RAGatouille | Late interaction (multi-vector) | Large, persistent specialized index |
| Jina Reranker | Cross-encoder / listwise | Per-pair inference; service or self-hosted weights |
These are superb building blocks. But almost all of them ask for something running all the time: an API you pay per query, a served GPU, or a specialized index that must be kept fresh. Per-query ephemeral reranking isn't the model they were designed for.
How we plan to be the best
We don't compete on having the biggest cross-encoder: we compete on delivering cross-encoder precision without paying to keep it hot. That's where we focus the edge:
The usual approach
BiVelio
- Cost proportional to use. No idle GPU: the precision index exists only during the query. You pay for questions, not for waiting time.
- Hybrid recall, focused precision. We combine lexical (Robertson & Zaragoza, 2009) and dense (Karpukhin et al., 2020) signals, fused by rank (Cormack et al., 2009), and only then apply reranking — the pattern the evidence rewards (Thakur et al., 2021).
- Less context, better ordered. We hand the agent the top-8, not the top-50: we attack "lost in the middle" (Liu et al., 2023) head-on instead of flooding the window.
- Reranking with operational context. We don't reorder chunks blindly: we combine it with the company's knowledge graph, the idea we develop in The graph as ambient context.
- Traceability. Every reranked result keeps its origin and its score: you can audit why a chunk reached the agent.
Note: the figures in this article come from the cited literature (Reimers & Gurevych, Nogueira & Cho, Khattab & Zaharia, Santhanam et al., Thakur et al., Liu et al.) and describe reranking in general. They are the motivation for our design, not a fixed product benchmark.
References
- #retrieval
- #reranking
- #embeddings
- #rag
- #cross-encoder