From Proactive Agents in Text-to-Image Generation to Conversational AI: A Technical Insight

3 min readDec 15, 2024

Current AI-powered systems often fail when prompts are ambiguous or incomplete. Google DeepMind’s paper by Hahn et al. (2024) on “Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty” addresses this issue using proactive question-asking agents powered by belief graphs. In this post, we dissect their approach and extend it to a broader, more technical application: Conversational AI involving multiple agents, API orchestration, and context tracking.

Summary of the Paper: Proactive Agents for T2I

Core Problem: Prompt Underspecification

Text-to-image models struggle with incomplete prompts. Descriptions like “A rabbit wearing a hat in a field” leave key details unspecified (e.g., color of the rabbit, style of the hat). Standard models guess, often generating irrelevant results.

Proposed Solution: Proactive Agents with Belief Graphs

The solution is an agent that proactively asks clarifying questions, building and updating a belief graph, which represents the agent’s internal understanding of entities, attributes, and relationships extracted from prompts.

Technical Breakdown: How It Works

1. Belief Graph Initialization: Extract entities (e.g., “rabbit,” “hat”), assign attributes (e.g., “color,” “size”), and define relations (e.g., “wearing”) and assign uncertainty scores where details are unclear.

2. Question Generation: Select uncertain elements, ask targeted questions like “What color is the rabbit?”, use an LLM for question formulation or a rule-based system for simpler prompts.

3. Prompt Expansion: Refine the belief graph as user responses come in.

4. Image Generation: Generate the final prompt from the completed belief graph and submit it to a T2I model like Imagen.

Key Insights from the Paper

Belief Graphs as a Modular Framework: Upgrading agents becomes seamless.
Explainability and Interpretability: Transparent decision-making through graph representation.
Significant Improvements in Performance: T2I generation improves when users interact through multi-turn question-asking agents.

Extending the Model to Conversational AI

While the original system targets T2I generation, its principles — belief tracking, uncertainty-driven questioning, and modular task execution — apply directly to Conversational AI involving complex workflows and API calls.

Adapting Core Technical Concepts

1. Belief Graph for Context Tracking

In Conversational AI, a belief graph can potentially track:

Entities: Tasks, user requests, API functions.
Attributes: Parameters like priority, deadlines, user preferences.
Relations: Dependencies between tasks (e.g., “send email after generating report”).

2. Multi-Agent System for Task Execution

Agent-Orchestrator Model: Each task-specific agent performs atomic tasks, while an Orchestrator manages dependencies based on the belief graph.

Example Agents:

Scheduling Agent: Books meetings.
Email Summarization Agent: Summarizes email threads.
Data Retrieval Agent: Fetches knowledge base results.

3. Proactive Question-Asking

Ask clarifying questions when uncertain:

“What time should the meeting be scheduled?”
“Which email thread should I summarize?”
Use dynamic queries derived from incomplete belief graph nodes.

4. State Transitions and API Orchestration

State Updates: After every agent action, the belief graph updates.
API Call Management: Agents call external APIs only when relevant data is complete. If gaps remain, the system asks follow-up questions or suggests fixes.

Technical Example: Personal AI Assistant

User Prompt:

“Schedule a meeting with Anna, summarize the Q4 email thread, and notify the team.”

Belief Graph Initialization

Entities:
  - Meeting (Participants: Anna, Time: Unknown)
  - Email Summary (Subject: Q4 Report, Date: Last Week)
  - Notification (Recipients: Team)

Multi-Agent Workflow:

1. Scheduling Agent: Clarifies meeting time: “What time should the meeting be schedule” and schedules once the user responds.

2. Email Summarization Agent: Summarizes the correct thread after asking “Which email thread should I summarize?”

3. Notification Agent: Sends notifications after tasks are complete.

Orchestrator Logic: If an agent’s task depends on another agent’s result (e.g., summarizing emails before sending notifications), the orchestrator ensures proper sequencing based on belief graph updates.

Wrapping up

Belief-driven AI systems that proactively clarify uncertain requests can reduce user friction, enhance task efficiency, and enable complex task orchestration. The Proactive Agent Model from the DeepMind paper provides a modular blueprint that can extend well beyond text-to-image tasks.

From Proactive Agents in Text-to-Image Generation to Conversational AI: A Technical Insight

Summary of the Paper: Proactive Agents for T2I

Extending the Model to Conversational AI

Technical Example: Personal AI Assistant

Wrapping up

Written by Anna Alexandra Grigoryan

No responses yet