Back to Research
Agents

The knowledge graph as ambient context for agents

An agent without context is just a model. Classic vector retrieval treats knowledge as a bag of independent chunks — which is why it fails on the questions that matter most to a company. We walk through the real problems RAG developers hit (with data), why graph structure wins where vectors fall short, how the landscape stacks up (Pinecone, LangChain, LlamaIndex, Neo4j, Microsoft GraphRAG), and how we plan to be the best at the context that actually counts: a company's operation.

BiVelio Research8 min read
Knowledge graph of glowing interconnected nodes (god nodes) with data pulses on a dark background

The promise of RAG (retrieval-augmented generation) is simple: feed the model the right documents and it will answer well (Lewis et al., 2020). The reality for anyone shipping it is rougher. A company's knowledge isn't a stack of loose texts: it's a network of cases, customers, invoices, tasks and agents connected to each other. Flattening that into independent chunks loses exactly what gives it meaning — and the numbers confirm it.

This article is our technical thesis: why we treat knowledge as a graph and use it as the ambient context of agents, not as a bag of chunks.

The problems developers actually hit

Anyone who has built a real RAG has hit the same wall: adding more context doesn't improve the answer — sometimes it hurts it. That's not anecdote, it's measured. Liu et al. showed that models use information well at the start and end of the context, but lose it when it lands in the middle (Liu et al., 2023).

QA accuracy by where the relevant fact sits (GPT-3.5, 20 documents)
Fact at the start of the context75.8%
Fact in the middle of the context53.8%

Same fact, same question: only the position of the relevant document within the context changes. The mid-context drop exceeds 20 points.

Fuente: Liu et al., 2023 — Lost in the Middle (arXiv:2307.03172)

With 30 documents the effect is so severe that placing the fact in the middle (50.5%) does worse than answering with no document at all (56.1%): badly-ordered context subtracts (Liu et al., 2023). And this is just one of several failure modes the literature documents.

−22 pp
Accuracy drop
relevant fact "lost in the middle"
7
RAG failure points
documented in real systems
27%
Responses with hallucination
GPT-4 on data-to-text tasks
Fuente: Liu et al. 2023; Barnett et al. 2024; Wu et al. 2024 (RAGTruth)

Barnett et al. cataloged seven recurring failure points when taking a RAG to production (Barnett et al., 2024): missing content, the relevant document missing the top-k, dropped during prompt consolidation, not extracted despite being present, wrong format, wrong specificity and incomplete answer. And RAGTruth measured that even with retrieval, a non-trivial fraction of responses hallucinate — up to 27% on data-to-text tasks with GPT-4 (Wu et al., 2024).

The common root

Almost all of these failures share a cause: similarity retrieval brings chunks similar to the question, but blind to one another. If the answer requires connecting several pieces (multi-hop) or synthesizing a whole corpus, vector similarity has no way to see it (Tang & Yang, 2024).

On top of that comes chunking fragmentation: splitting documents into fixed-size chunks cuts a single fact across two chunks, and neither holds the complete answer (Gao et al., 2023).

Three ways to retrieve (and why structure matters)

Not all retrieval architectures are equal. It helps to separate three paradigms:

ParadigmHow it retrievesStrong atBlind spot
Vector RAGk nearest neighbors by embedding similarityMeaning, synonyms, speedMulti-hop, relationships, global synthesis
Hybrid (BM25 + vector)Fuses exact lexical + semantic (e.g. RRF)Exact terms (codes, names) + semanticsStill ranking of disconnected passages
Graph RAGTraverses explicit relationships + diffusionMulti-hop, relational context, sensemakingCost of building the graph

Hybrid fixes "vectors miss the exact term"; it does not fix "retrieval ignores how facts connect." That takes structure. And that's where the graph changes the rules.

The graph as ambient context

We model the operation as a directed graph G=(V,E)G = (V, E) where nodes VV are entities — cases, documents, customers, tasks, agents — and edges EE are the real relationships between them. Some nodes concentrate a huge number of connections; we call them god nodes, and they tend to be the points the whole operation flows through.

To measure a node's importance we use PageRank (Page et al., 1999), defined recursively: a node is important if important nodes point to it.

PR(v)=1dV+duB(v)PR(u)L(u)PR(v) = \frac{1 - d}{|V|} + d \sum_{u \in B(v)} \frac{PR(u)}{L(u)}

When an agent needs context we don't just fire a similarity search: we seed the graph with the nodes most aligned to the query and let relevance diffuse to their neighbors through the normalized adjacency A~\tilde{A}:

rt+1=(1α)s+αA~rt\mathbf{r}_{t+1} = (1 - \alpha)\,\mathbf{s} + \alpha\, \tilde{A}\,\mathbf{r}_{t}

This isn't a hunch: it's exactly the mechanism HippoRAG demonstrated to solve multi-hop questions in a single retrieval step, using Personalized PageRank over a knowledge graph (Gutiérrez et al., 2024). The evidence is striking.

Recall@5 on 2WikiMultiHopQA (multi-hop questions)
Dense vector RAG (ColBERTv2)68
Graph + PageRank (HippoRAG)89

Dense vector retrieval (ColBERTv2) vs graph + Personalized PageRank, same reader. Structure recovers roughly twice the useful evidence on questions that require chaining facts.

Fuente: Gutiérrez et al., 2024 — HippoRAG (arXiv:2405.14831)

And for global questions — "what are the themes that run through the whole operation?" — which have no single answer passage, Microsoft GraphRAG showed that detecting communities in the graph and summarizing them systematically beats vector RAG when an LLM judge scores comprehensiveness and diversity (Edge et al., 2024). Communities come from optimizing modularity — Louvain (Blondel et al., 2008) and its successor Leiden (Traag et al., 2019), which is the one GraphRAG uses:

Q=12mi,j(Aijkikj2m)δ(ci,cj)Q = \frac{1}{2m} \sum_{i,j} \left( A_{ij} - \frac{k_i k_j}{2m} \right) \delta(c_i, c_j)
72–83%
Comprehensiveness wins
graph vs vector RAG (LLM judge)
up to +20%
Multi-hop improvement
graph + PageRank vs SOTA
10–30×
Cheaper
than iterative RAG, in a single step
Fuente: Edge et al. 2024 (GraphRAG); Gutiérrez et al. 2024 (HippoRAG)

Intellectual honesty

The graph doesn't always win. On single-hop questions, or when literal conciseness is valued, vector RAG is enough — and even better (Edge et al., 2024). That's why we don't replace the vector: we combine it with the graph and a precision reranker. Structure is used where it helps: relationships, multi-hop and the big picture.

How the landscape stacks up

The ecosystem is excellent at what it does, but almost all of it is built around the passage, not the relationship:

ToolWhat it isRetrieval mechanismRelational blind spot
PineconeManaged vector databaseVector similarity (+ hybrid)No native notion of relationships
WeaviateVector database (not a graph DB)Vector + BM25FCross-refs discouraged for deep traversal
LangChainOrchestration frameworkDelegates to the backend you plug inNo native relational retrieval of its own
LlamaIndexData framework for RAGVector + PropertyGraphIndexThe graph depends on LLM extraction
Neo4jGraph databaseCypher + vector indexYou must build and model the graph first
Microsoft GraphRAGGraph pipelineGraph + communities (Leiden)Expensive, LLM-intensive indexing
Elastic / OpenSearchSearch enginesBM25 + kNN (RRF)No relationship traversal across documents

The point isn't that these tools are bad — they're superb building blocks. It's that the knowledge graph as the live ambient context of an operation isn't the use case almost any of them was designed for.

How we plan to be the best

We don't compete on having the best vector index: we compete on understanding a company's operation better than anyone. That's where we focus the edge, by domain:

The usual approach

Index text. Knowledge is a corpus of documents; the graph, when it exists, is extracted after the fact with an LLM and goes stale.

BiVelio

The graph is the operation. Cases, tasks, customers and agents are already connected by real, live relationships — no need to reconstruct them with an LLM.
  1. Operational context, not just documents. Our graph isn't born from chunking PDFs: it's born from how the company actually works. That yields precise, up-to-date relationships, not inferred ones.
  2. Multi-hop and the big picture, built in. Personalized PageRank to retrieve coherent neighborhoods (Gutiérrez et al., 2024) and communities to reason at the right granularity (Edge et al., 2024) — the two modes the evidence rewards.
  3. Coherent context, not fragments. We retrieve the case plus its customer plus its related invoices, not three chunks that share a word. We attack "lost in the middle" (Liu et al., 2023) head-on by delivering less context but better connected.
  4. Precision and cost. We pair the graph with ephemeral reranking so only the best reaches the agent's window — the idea we develop in Ephemeral reranking.
  5. Governance and traceability. Every piece of context traces back to the graph: you can audit where a decision came from. In enterprise operations, that's not a nice-to-have, it's a requirement.

Note: the figures in this article come from the cited literature (Liu et al., Edge et al., Gutiérrez et al., Barnett et al., Wu et al.) and describe graph approaches in general. They are the motivation for our design, not a fixed product benchmark.

References

Barnett, S., Kurniawan, S., Thudumu, S., Brannelly, Z., & Abdelrazek, M. (2024). Seven Failure Points When Engineering a Retrieval Augmented Generation System. Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering (CAIN). https://arxiv.org/abs/2401.05856
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R. O., & Larson, J. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv Preprint arXiv:2404.16130. https://arxiv.org/abs/2404.16130
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv Preprint arXiv:2312.10997. https://arxiv.org/abs/2312.10997
Gutiérrez, B. J., Shu, Y., Gu, Y., Yasunaga, M., & Su, Y. (2024). HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2405.14831
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems (NeurIPS), 33, 9459–9474.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics (TACL). https://arxiv.org/abs/2307.03172
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web (Techreport SIDL-WP-1999-0120). Stanford InfoLab. http://ilpubs.stanford.edu:8090/422/
Tang, Y., & Yang, Y. (2024). MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries. Conference on Language Modeling (COLM). https://arxiv.org/abs/2401.15391
Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9, 5233. https://doi.org/10.1038/s41598-019-41695-z
Wu, Y., Zhu, J., Xu, S., Shum, K., Niu, C., Zhong, R., Song, J., & Zhang, T. (2024). RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). https://aclanthology.org/2024.acl-long.585/
  • #graphs
  • #knowledge graph
  • #graphrag
  • #agents
  • #pagerank
  • #rag

Want to see these algorithms in production?

BiVelio turns this research into an AI operating system that runs your company end to end.