From Unstructured Text to Causal Graphs: AI’s in Decoding Policy Impacts

6 min readNov 2, 2024

When new policies are introduced, they often trigger chains of direct and indirect effects across various sectors. Take, for example, an environmental regulation mandating emissions reduction: this policy might directly increase compliance costs for manufacturers, but it could also indirectly influence supply chains, product pricing, and even employment rates. For policymakers, business leaders, and industry stakeholders, understanding these complex, interconnected impacts is essential, but analyzing them is complex.

In fields like epidemiology, Summary Causal Graphs (SCGs) are used to analyze intricate systems as such. SCGs allow researchers to model direct and indirect effects, such as how a vaccination might reduce infection rates while also influencing public health policies and behavior.

In this blog post, we will explore how SCGs can be adapted to analyze the impacts of policy changes on economic sectors. By parsing unstructured policy data and incorporating this data into SCGs can reveal nuanced causal relationships that help clarify the broader effects of regulatory changes.

Why Summary Causal Graphs Are Ideal for Policy Impact Analysis

SCGs offer unique advantages for policy impact analysis by handling complexity, capturing feedback loops, and modeling effects that evolve over time. Here’s why SCGs are particularly valuable for understanding policy impacts:

1. Handling Partial Information: Unlike fully specified causal graphs, which require detailed knowledge of every variable and interaction, SCGs can function with partial data – an essential feature in policy analysis, where data may be scattered across multiple sources, more often unlocked behind 1000s of pages of unstructured data.

2. Capturing Cyclic Dependencies: Policy impacts are rarely linear and often involve feedback loops. For example, a regulation that raises energy prices might initially increase manufacturing costs, prompting further regulatory adjustments as industries adapt. SCGs allow us to model these cyclical relationships.

3. Temporal Structure: Policies often have delayed or phased effects. SCGs support time-dependent analysis, enabling us to see both immediate and long-term policy outcomes.

Adapting SCG Techniques for Policy Analysis: Why Large Language Models Help

To apply SCGs to policy analysis, we need to transform unstructured policy text into structured data by identifying key entities (e.g., “emissions cap,” “compliance cost”), themes, and relationships. This is where LLMs become essential. Here’s why LLMs can be particularly effective for parsing policy documents:

1. Contextual Interpretation: LLMs capture both explicit statements and broader implications in policy text, interpreting directives and impacts. For example, they can identify that a mandated “emissions reduction” implies increased compliance costs for manufacturers.

2. Identifying Implicit Relationships: Policies often contain implied relationships that are critical for SCG construction. LLMs can understand these nuances, such as recognizing that a “20% emissions reduction” mandate will likely affect “compliance costs” for the manufacturing sector.

3. Maintaining Thematic Consistency: LLMs can parse entities and relationships within their thematic context, ensuring data consistency, which is essential for building a cohesive SCG.

Using the ChatCompletions API to Parse Policy Text

To leverage the LLM’s full potential, we use a system prompt and a user prompt via the Chat Completions API to guide the model towards producing structured, consistent output. Here’s how we set up these prompts.

System Prompt

The system prompt provides context, instructing the model to extract entities, themes, and relationships with a consistent structure

system_prompt = 
"""You are a sophisticated language model designed to analyze complex regulatory documents. Your task is to read paragraphs from policy texts and extract entities, themes, and relationships between them in a structured format. 
Focus on identifying:
- Entities (e.g., policies, sectors, economic impacts)
- Themes (e.g., regulatory goals, compliance requirements)
- Relationships (e.g., causal or implied links between entities, including directives or mandates)
Please respond using the following JSON structure to maintain consistency:

{
  "entities": [
    {"name": "<entity_name>", "type": "<entity_type>"},
    ...
  ],
  "themes": [
    "<theme_1>",
    ...
  ],
  "relationships": [
    {"source": "<source_entity>", "target": "<target_entity>", "type": "<relationship_type>"}
  ]
}

Provide accurate and contextually relevant details based on the input paragraph."""

User Prompt

The user prompt contains the specific policy text to analyze and instructs the LLM to apply the system prompt’s structure to this text.

user_prompt=
"""Please analyze the following policy paragraph and extract entities, themes, and relationships as defined. Make sure to follow the structure outlined in the system prompt. Here is the paragraph:

‘The new emissions regulation mandates a 20% reduction in carbon output across the manufacturing sector. Compliance costs are expected to increase, and companies failing to meet the standard by 2025 will face substantial fines. This policy aims to align manufacturing practices with international environmental standards and may influence production capacities in the near term."""

API Call with Prompts

Here’s how we would set up the Chat Completions API call with these prompts:

from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
  ]
)

print(completion.choices[0].message[‘content’)

Sample Output from the LLM

With this API call, here’s an example of the structured output we receive from the LLM:

  "entities": [
    {"name": "emissions regulation", "type": "policy"},
    {"name": "carbon output", "type": "environmental metric"},
    {"name": "manufacturing sector", "type": "industry"},
    {"name": "compliance costs", "type": "economic impact"},
    {"name": "substantial fines", "type": "penalty"},
    {"name": "international environmental standards", "type": "benchmark"}
  ],
  "themes": [
    "emissions reduction mandate",
    "compliance by 2025",
    "alignment with international standards",
    "production capacity impacts"
  ],
  "relationships": [
    {"source": "emissions regulation", "target": "manufacturing sector", "type": "mandate"},
    {"source": "emissions regulation", "target": "compliance costs", "type": "implied increase"},
    {"source": "manufacturing sector", "target": "production capacities", "type": "potential reduction"},
    {"source": "emissions regulation", "target": "substantial fines", "type": "penalty for non-compliance"}
  ]
}

This structured data provides all necessary components for SCG construction: entities (nodes), themes (context), and relationships (edges).

Building an SCG for Policy Analysis: Step-by-Step

With structured output from the LLM, we can construct an SCG that models direct and indirect policy impacts.

Step 1: Vectorizing Entities and Relationships for Contextual Similarity

To capture semantic meaning, each extracted entity, theme, and relationship is vectorized using embeddings. This enables semantic clustering, helping us group thematically similar nodes in the SCG.

Example Code for Vectorization:

def vectorize_text(text):
    response = client.embeddings.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response['data'][0]['embedding']

# Vectorize entities and relationships
entities_vector = vectorize_text(json.dumps(structured_output['entities']))
relationships_vector = vectorize_text(json.dumps(structured_output['relationships']))

print("Entities Vector:", entities_vector)
print("Relationships Vector:", relationships_vector)

Step 2: Defining Nodes and Creating Causal Edges

1. Define Nodes: Nodes represent key elements such as policies and economic impacts. For example, “emissions regulation” and “compliance costs” from the LLM output become nodes.

2. Create Causal Edges: Based on vector similarity scores and the LLM-defined relationships, we add edges between nodes with significant causal relationships. For instance, if “emissions regulation” and “compliance costs” have high vector similarity, we create an edge between them.

Step 3: Adding Temporal Structure for Phased Analysis

Policies often have phased impacts, so we add temporal nodes to capture short-term and long-term effects.

Step 4: Simulating Interventions Using Do-Calculus

With our SCG, we can use do-calculus to simulate policy interventions, such as activating or adjusting a policy, to measure direct and indirect outcomes.

Example Code: Causal Modeling with dowhy

from dowhy import CausalModel

# Example data for SCG
data = {
    "policy": [1, 0, 1, 0],  # Binary indicator (1 = policy active, 0 = inactive)
    "energy_price": [52, 48, 53, 50],  # Mediating factor (energy prices)
    "manufacturing_output": [95, 105, 98, 110]  # Target outcome (manufacturing output)
}

# Define Causal Model
model = CausalModel(
    data=data,
    treatment="policy",
    outcome="manufacturing_output",
    graph="policy -> manufacturing_output; energy_price -> manufacturing_output"
)

# Estimate Natural Direct Effect
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.propensity_score_matching"
)

print("Estimated Natural Direct Effect:", estimate.value)

Wrapping up

By adapting SCGs to analyze policy impacts, we gain a powerful framework for understanding how regulatory changes affect economic sectors.

LLMs play a critical role in parsing unstructured policy text, allowing us to extract entities, themes, and relationships with precision.

By combining LLMs, vector embeddings, and causal inference techniques, this approach provides insights into the direct and indirect effects of policies, enabling stakeholders to make data-informed decisions.