From Proactive Agents in Text-to-Image Generation to Conversational AI: A Technical Insight

Anna Alexandra Grigoryan
3 min readDec 15, 2024

--

Current AI-powered systems often fail when prompts are ambiguous or incomplete. Google DeepMind’s paper by Hahn et al. (2024) on Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty addresses this issue using proactive question-asking agents powered by belief graphs. In this post, we dissect their approach and extend it to a broader, more technical application: Conversational AI involving multiple agents, API orchestration, and context tracking.

Photo by Growtika on Unsplash

Summary of the Paper: Proactive Agents for T2I

Core Problem: Prompt Underspecification

Text-to-image models struggle with incomplete prompts. Descriptions like “A rabbit wearing a hat in a field” leave key details unspecified (e.g., color of the rabbit, style of the hat). Standard models guess, often generating irrelevant results.

Proposed Solution: Proactive Agents with Belief Graphs

The solution is an agent that proactively asks clarifying questions, building and updating a belief graph, which represents the agent’s internal understanding of entities, attributes, and relationships extracted from prompts.

Technical Breakdown: How It Works

1. Belief Graph Initialization: Extract entities (e.g., “rabbit,” “hat”), assign attributes (e.g., “color,” “size”), and define relations (e.g., “wearing”) and assign uncertainty scores where details are unclear.

2. Question Generation: Select uncertain elements, ask targeted questions like “What color is the rabbit?”, use an LLM for question formulation or a rule-based system for simpler prompts.

3. Prompt Expansion: Refine the belief graph as user responses come in.

4. Image Generation: Generate the final prompt from the completed belief graph and submit it to a T2I model like Imagen.

Key Insights from the Paper

  • Belief Graphs as a Modular Framework: Upgrading agents becomes seamless.
  • Explainability and Interpretability: Transparent decision-making through graph representation.
  • Significant Improvements in Performance: T2I generation improves when users interact through multi-turn question-asking agents.

Extending the Model to Conversational AI

While the original system targets T2I generation, its principles — belief tracking, uncertainty-driven questioning, and modular task execution — apply directly to Conversational AI involving complex workflows and API calls.

Adapting Core Technical Concepts

1. Belief Graph for Context Tracking

In Conversational AI, a belief graph can potentially track:

  • Entities: Tasks, user requests, API functions.
  • Attributes: Parameters like priority, deadlines, user preferences.
  • Relations: Dependencies between tasks (e.g., “send email after generating report”).

2. Multi-Agent System for Task Execution

Agent-Orchestrator Model: Each task-specific agent performs atomic tasks, while an Orchestrator manages dependencies based on the belief graph.

Example Agents:

  • Scheduling Agent: Books meetings.
  • Email Summarization Agent: Summarizes email threads.
  • Data Retrieval Agent: Fetches knowledge base results.

3. Proactive Question-Asking

Ask clarifying questions when uncertain:

  • “What time should the meeting be scheduled?”
  • “Which email thread should I summarize?”
  • Use dynamic queries derived from incomplete belief graph nodes.

4. State Transitions and API Orchestration

  • State Updates: After every agent action, the belief graph updates.
  • API Call Management: Agents call external APIs only when relevant data is complete. If gaps remain, the system asks follow-up questions or suggests fixes.

Technical Example: Personal AI Assistant

User Prompt:

“Schedule a meeting with Anna, summarize the Q4 email thread, and notify the team.”

Belief Graph Initialization

Entities:
- Meeting (Participants: Anna, Time: Unknown)
- Email Summary (Subject: Q4 Report, Date: Last Week)
- Notification (Recipients: Team)

Multi-Agent Workflow:

1. Scheduling Agent: Clarifies meeting time: “What time should the meeting be schedule” and schedules once the user responds.

2. Email Summarization Agent: Summarizes the correct thread after asking “Which email thread should I summarize?”

3. Notification Agent: Sends notifications after tasks are complete.

Orchestrator Logic: If an agent’s task depends on another agent’s result (e.g., summarizing emails before sending notifications), the orchestrator ensures proper sequencing based on belief graph updates.

Wrapping up

Belief-driven AI systems that proactively clarify uncertain requests can reduce user friction, enhance task efficiency, and enable complex task orchestration. The Proactive Agent Model from the DeepMind paper provides a modular blueprint that can extend well beyond text-to-image tasks.

--

--

Anna Alexandra Grigoryan
Anna Alexandra Grigoryan

Written by Anna Alexandra Grigoryan

red schrödinger’s cat thinking of doing something brilliant

No responses yet