Automating Knowledge Graph Creation and Validation with Large Language Models
Knowledge Graphs (KGs) are key to structuring and reasoning over complex data. They enable applications like semantic search, recommendation systems, and automated decision-making across industries such as healthcare, legal systems, and e-commerce. However, building KGs involves two critical steps:
- Ontology creation: Defining the schema for entities, attributes, and relationships.
- Population and Validation: Extracting entities and relationships from unstructured text and ensuring they conform to the ontology.
Traditionally, these tasks require domain expertise and are labor-intensive. But with Large Language Models (LLMs), we can now automate:
• Ontology generation based on domain descriptions.
• RDF triple extraction from text.
• Validation and refinement to ensure data aligns with the ontology.
Inspired by the research on Accelerating Knowledge Graph and Ontology Engineering with Large Language Models , this post outlines an end-to-end pipeline for automating KG creation using LLMs, including:
- Generating ontologies dynamically.
- Importing and validating ontology schemas.
- Populating knowledge graphs from unstructured text.
- Refining errors iteratively to ensure quality.
Step 1: Generating Ontologies with LLMs
Ontologies define the structure of a KG by specifying:
• Classes: Categories of entities (e.g., Patient, Doctor).
• Data Properties: Attributes of classes (e.g., hasName, hasAge).
• Object Properties: Relationships between classes (e.g., treatedBy).
Instead of manually defining these, we can use LLMs to dynamically generate ontologies from a natural language description of the domain.
Workflow
1. System Prompt: Instructs the LLM to generate an ontology in OWL format.
2. User Input: Domain-specific requirements.
3. Output: An OWL ontology file in XML or Turtle format.
Code: Ontology Generation
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(
model="gpt-4o",
temperature=0,
max_tokens=None,
timeout=None,
max_retries=2,
# api_key="...", # if you prefer to pass api key in directly instaed of using env vars
# base_url="...",
# organization="...",
# other params...
)
# Define system prompt
system_prompt = """
You are an expert in ontology engineering. Generate an OWL ontology based on the following domain description:
Define classes, data properties, and object properties.
Include domain and range for each property.
Provide the output in OWL (XML) format."""
# Function to generate ontology
def generate_ontology(domain_description):
prompt = f"Domain description: {domain_description}\nGenerate OWL ontology."
response = llm.invoke([(
"system", system_prompt
),
("human", prompt),])
return respone.content
Example
Input: Domain Description
A healthcare domain where:
- Patients have a name, age, and gender.
- Patients can have conditions and are treated by doctors.
- Conditions have a name.
- Doctors have a name and specialty.
Output: OWL Ontology (Excerpt)
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
<owl:Class rdf:ID="Patient"/>
<owl:Class rdf:ID="Condition"/>
<owl:Class rdf:ID="Doctor"/>
<owl:DatatypeProperty rdf:ID="hasName">
<rdfs:domain rdf:resource="#Patient"/>
<rdfs:range rdf:resource="&xsd;string"/>
</owl:DatatypeProperty>
<owl:ObjectProperty rdf:ID="treatedBy">
<rdfs:domain rdf:resource="#Patient"/>
<rdfs:range rdf:resource="#Doctor"/>
</owl:ObjectProperty>
</rdf:RDF>
Step 2: Importing and Validating Ontology Schemas
Once the ontology is generated, it can be dynamically imported to ensure it defines the constraints for validation.
Code: Importing an OWL Ontology
from owlready2 import get_ontology
# Load dynamically generated ontology
ontology_path = "healthcare.owl" # Replace with the OWL file path
ontology = get_ontology(ontology_path).load()
# Print ontology structure
print("Classes:")
for cls in ontology.classes():
print(cls)
print("\nProperties:")
for prop in ontology.properties():
print(f"{prop}: Domain={prop.domain}, Range={prop.range}")
Example Output:
Classes:
healthcare.owl.Patient
healthcare.owl.Condition
healthcare.owl.Doctor
Properties:
healthcare.owl.hasName: Domain=[healthcare.owl.Patient], Range=[xsd:string]
healthcare.owl.treatedBy: Domain=[healthcare.owl.Patient], Range=[healthcare.owl.Doctor]
Example
Classes:
healthcare.owl.Patient
healthcare.owl.Condition
healthcare.owl.Doctor
Properties:
healthcare.owl.hasName: Domain=[healthcare.owl.Patient], Range=[xsd:string]
healthcare.owl.treatedBy: Domain=[healthcare.owl.Patient], Range=[healthcare.owl.Doctor]
Step 3: Populating the Knowledge Graph
The next step involves extracting entities and relationships from unstructured text and converting them into RDF triples that conform to the ontology.
Workflow
1. Input: Unstructured text describing entities and relationships.
2. LLM Output: RDF triples in Turtle format.
3. Validation: Check triples against the ontology schema.
Code: Extracting RDF Triples
from rdflib import Graph
# Function to generate RDF triples using LLM
def generate_rdf_triples(text_input, ontology_schema):
system_prompt = f"
Extract RDF triples from the following text in Turtle format, adhering to the ontology:
- Patient: hasName, hasAge, hasGender, hasCondition, treatedBy.
- Doctor: hasName.
- Condition: hasName.
Ontology: {ontology_schema}
"
user_prompt = f"Text: {text_input}\nGenerate RDF triples in Turtle format."
response = llm.invoke([(
"system", system_prompt
),
("human", user_prompt),])
return respone.content
Example
Input:
- Patient John Doe, aged 45, was diagnosed with diabetes and treated by Dr. Smith.
- Ontology Schema.
Output RDF (Turtle Format):
@prefix : <http://example.org/healthcare.owl#> .
:JohnDoe a :Patient ;
:hasName "John Doe" ;
:hasAge 45 ;
:hasCondition :Diabetes ;
:treatedBy :DrSmith .
:Diabetes a :Condition .
:DrSmith a :Doctor ;
:hasName "Dr. Smith" .
Step 4: Validating and Refining RDF Data
The generated RDF triples must be validated against the ontology to ensure:
• Domain constraints: The subject of a property matches its defined class.
• Range constraints: The object of a property matches its defined class or datatype.
def validate_rdf(rdf_data, ontology):
g = Graph()
g.parse(data=rdf_data, format="turtle")
errors = []
for s, p, o in g:
prop_name = p.split("#")[-1]
ontology_prop = getattr(ontology, prop_name, None)
if not ontology_prop:
errors.append(f"Property ‘{prop_name}’ not found in ontology.")
elif isinstance(o, str) and xsd:string not in ontology_prop.range:
errors.append(f"Range Error: {p} expects {ontology_prop.range}, but found a string.")
return errors
# If validation fails, refine triples
def refine_rdf(rdf_data, feedback):
refinement_prompt = f"""
The following RDF output has errors:
{rdf_data}
Errors: {feedback}
Refine the RDF triples to fix these issues while adhering to the ontology schema.
"""
response = llm.invoke([(
"system", system_prompt
),
("human", refinement_prompt),])
return response.content
Iterative Refinement Workflow
1. Validate the generated RDF triples.
2. If errors are found, pass feedback to the LLM for refinement.
3. Repeat until validation passes.
Wrapping up
This workflow demonstrates how LLMs can automate every stage of knowledge graph creation:
1. Ontology Generation: Automatically define schemas for entities and relationships.
2. Knowledge Graph Population: Extract entities and relationships as RDF triples from text.
3. Validation and Refinement: Dynamically validate and refine triples against ontology constraints.
Automating these traditionally manual tasks accelerates KG creation and ensures consistency across datasets. By combining LLMs with dynamic validation, organizations can deploy high-quality knowledge graphs at scale.