Knowledge graphs combine structured data with LLM reasoning, enabling powerful question-answering and discovery capabilities. This chapter covers building and querying knowledge graphs with AI.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Entity and Relationship Extraction
Basic Entity Extraction
Coreference Resolution
Neo4j Integration
Building a Knowledge Graph
GraphRAG Pattern
Entity Linking and Resolution
Graph-Based Question Answering
Incremental Graph Building
Knowledge Graph Best Practices
- Define clear entity and relationship schemas upfront
- Use entity linking to avoid duplicates
- Combine graph queries with vector search for best results
- Decompose complex questions into graph-traversable steps
- Regularly merge and clean duplicate entities
Practice Exercise
Build a knowledge graph system that:- Extracts entities and relationships from documents
- Links mentions to canonical entities
- Supports natural language queries
- Combines graph and vector retrieval
- Handles incremental updates
- Accurate entity extraction
- Proper relationship typing
- Efficient graph queries
- Clear answer synthesis from graph data
Interview Deep-Dive
When would you use a knowledge graph plus LLM (GraphRAG) instead of standard vector-based RAG? What problem does the graph solve that embeddings cannot?
When would you use a knowledge graph plus LLM (GraphRAG) instead of standard vector-based RAG? What problem does the graph solve that embeddings cannot?
Strong Answer:
- Standard vector RAG excels at finding semantically similar text chunks, but it fundamentally does not understand relationships between entities. If you ask “Which companies that Microsoft invested in are headquartered in San Francisco?”, vector RAG retrieves chunks that mention Microsoft, chunks about investments, and chunks about San Francisco — but it cannot traverse the actual relationship chain from Microsoft through investment relationships to companies and then filter by headquarters location. It is doing approximate string matching on steroids, not reasoning.
- A knowledge graph stores entities and their explicit relationships as structured triples (subject-predicate-object). When you layer an LLM on top to generate Cypher or SPARQL queries from natural language, you get precise multi-hop reasoning. The graph traversal guarantees you follow actual documented relationships, not inferred similarity. This is critical for questions involving: multi-hop reasoning (A invested in B, B is located in C), aggregation (how many companies did X acquire), temporal reasoning (what happened before/after event Y), and negation (which entities are NOT connected to X).
- In practice, I use GraphRAG when the domain has rich entity relationships that users need to explore: organizational hierarchies, supply chain networks, compliance and regulatory relationships, medical knowledge bases (drug-gene-disease interactions). I use standard vector RAG when the primary need is finding relevant passages from unstructured text without needing to reason about entity relationships.
- The best production systems combine both. The graph provides structured relational context (entity properties, verified relationships), while vector search provides relevant text passages that add nuance and detail the graph does not capture. I build the final prompt from both sources, and the LLM synthesizes them into a coherent answer.
You are building a knowledge graph from 10,000 documents. How do you handle entity resolution -- the same entity appearing with different names across documents?
You are building a knowledge graph from 10,000 documents. How do you handle entity resolution -- the same entity appearing with different names across documents?
Strong Answer:
- Entity resolution (also called entity deduplication or record linkage) is the problem of recognizing that “Tim Cook,” “Timothy Cook,” “Apple CEO,” and “Cook” all refer to the same person across different documents. This is arguably harder than the initial extraction because it requires global reasoning across the entire corpus, not just within a single document.
- My pipeline has three stages. Stage one is coreference resolution within each document: before extracting entities, I run the text through an LLM-based coreference resolver that replaces pronouns and references (“he,” “the company,” “its CEO”) with the actual entity names. This dramatically improves extraction quality because the entity extractor sees unambiguous references.
- Stage two is canonical name assignment during ingestion. I maintain a knowledge base of known entities with aliases. When a new entity is extracted, I check it against existing entries using both string similarity (fuzzy matching with Levenshtein distance) and semantic similarity (embedding-based). If a match exceeds 0.9 similarity, I link to the existing entity. If it is between 0.7 and 0.9, I flag it for review. Below 0.7, it gets created as a new entity.
- Stage three is periodic deduplication. After ingesting a batch of documents, I run a merge pass that uses an LLM to evaluate potential duplicates in bulk. I present groups of similar entity names and ask the model to identify which ones refer to the same real-world entity. This catches cases like “JPMorgan” and “JP Morgan Chase” that string similarity might miss. The merge operation in Neo4j transfers all relationships from the duplicate to the canonical entity, then deletes the duplicate node.
- The key lesson I learned is that entity resolution is never “done.” Every new document batch can introduce new aliases. I run the deduplication pass weekly and track merge counts as a quality metric — if merges spike after an ingestion batch, the extraction prompt or the alias list needs updating.
An LLM generates a Cypher query from a user's natural language question, but the query returns wrong results. How do you debug this?
An LLM generates a Cypher query from a user's natural language question, but the query returns wrong results. How do you debug this?
Strong Answer:
- There are three failure points in the text-to-Cypher pipeline: the LLM misunderstood the question, the LLM generated syntactically incorrect Cypher, or the Cypher is valid but queries the wrong part of the graph because the model has an incorrect mental model of the schema.
- First, I log everything: the original question, the generated Cypher, the query results, and the final synthesized answer. This lets me pinpoint which stage failed. If the Cypher is syntactically wrong (parsing error from Neo4j), the fix is usually adding more schema context to the generation prompt. I include not just node labels and relationship types but also sample property names and example queries. I found that including 3-5 example question-to-Cypher pairs in the prompt (few-shot) reduces syntax errors by about 60% compared to zero-shot.
- If the Cypher is syntactically valid but returns the wrong results, I compare the query logic against the question. Common mistakes: the model confuses relationship direction (uses
(a)-[:WORKS_FOR]->(b)when the graph stores it as(b)-[:EMPLOYS]->(a)), uses the wrong relationship type (the model guessesINVESTED_INbut the graph usesHAS_INVESTMENT), or applies the wrong filter (matches on entity type instead of a property value). - My fix for schema confusion is to include the actual schema dump in the prompt — not just labels but
CALL db.schema.visualization()output showing exactly which relationships connect which node types, and sample property values. I also implement a validation step: after generating the Cypher, I parse it and check that all referenced labels and relationship types actually exist in the schema before executing. If they do not, I send the error back to the LLM with the list of valid types and let it retry. - For production systems, I maintain a curated set of question-to-Cypher examples that cover the main query patterns and use them as few-shot examples. When a new failure pattern emerges, I add the corrected example to the set.
$name instead of string interpolation). This does not fully prevent injection since the query structure itself is LLM-generated, but it prevents the most common injection vectors.