From Proactive Agents in Text-to-Image Generation to Conversational AI: A Technical Insight
Current AI-powered systems often fail when prompts are ambiguous or incomplete. Google DeepMind’s paper by Hahn et al. (2024) on “Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty” addresses this issue using proactive question-asking agents powered by belief graphs. In this post, we dissect their approach and extend it to a broader, more technical application: Conversational AI involving multiple agents, API orchestration, and context tracking.
Summary of the Paper: Proactive Agents for T2I
Core Problem: Prompt Underspecification
Text-to-image models struggle with incomplete prompts. Descriptions like “A rabbit wearing a hat in a field” leave key details unspecified (e.g., color of the rabbit, style of the hat). Standard models guess, often generating irrelevant results.
Proposed Solution: Proactive Agents with Belief Graphs
The solution is an agent that proactively asks clarifying questions, building and updating a belief graph, which represents the agent’s internal understanding of entities, attributes, and relationships extracted from prompts.
Technical Breakdown: How It Works
1. Belief Graph Initialization: Extract entities (e.g., “rabbit,” “hat”), assign attributes (e.g., “color,” “size”), and define relations (e.g., “wearing”) and assign uncertainty scores where details are unclear.
2. Question Generation: Select uncertain elements, ask targeted questions like “What color is the rabbit?”, use an LLM for question formulation or a rule-based system for simpler prompts.
3. Prompt Expansion: Refine the belief graph as user responses come in.
4. Image Generation: Generate the final prompt from the completed belief graph and submit it to a T2I model like Imagen.
Key Insights from the Paper
- Belief Graphs as a Modular Framework: Upgrading agents becomes seamless.
- Explainability and Interpretability: Transparent decision-making through graph representation.
- Significant Improvements in Performance: T2I generation improves when users interact through multi-turn question-asking agents.
Extending the Model to Conversational AI
While the original system targets T2I generation, its principles — belief tracking, uncertainty-driven questioning, and modular task execution — apply directly to Conversational AI involving complex workflows and API calls.
Adapting Core Technical Concepts
1. Belief Graph for Context Tracking
In Conversational AI, a belief graph can potentially track:
- Entities: Tasks, user requests, API functions.
- Attributes: Parameters like priority, deadlines, user preferences.
- Relations: Dependencies between tasks (e.g., “send email after generating report”).
2. Multi-Agent System for Task Execution
Agent-Orchestrator Model: Each task-specific agent performs atomic tasks, while an Orchestrator manages dependencies based on the belief graph.
Example Agents:
- Scheduling Agent: Books meetings.
- Email Summarization Agent: Summarizes email threads.
- Data Retrieval Agent: Fetches knowledge base results.
3. Proactive Question-Asking
Ask clarifying questions when uncertain:
- “What time should the meeting be scheduled?”
- “Which email thread should I summarize?”
- Use dynamic queries derived from incomplete belief graph nodes.
4. State Transitions and API Orchestration
- State Updates: After every agent action, the belief graph updates.
- API Call Management: Agents call external APIs only when relevant data is complete. If gaps remain, the system asks follow-up questions or suggests fixes.
Technical Example: Personal AI Assistant
User Prompt:
“Schedule a meeting with Anna, summarize the Q4 email thread, and notify the team.”
Belief Graph Initialization
Entities:
- Meeting (Participants: Anna, Time: Unknown)
- Email Summary (Subject: Q4 Report, Date: Last Week)
- Notification (Recipients: Team)
Multi-Agent Workflow:
1. Scheduling Agent: Clarifies meeting time: “What time should the meeting be schedule” and schedules once the user responds.
2. Email Summarization Agent: Summarizes the correct thread after asking “Which email thread should I summarize?”
3. Notification Agent: Sends notifications after tasks are complete.
Orchestrator Logic: If an agent’s task depends on another agent’s result (e.g., summarizing emails before sending notifications), the orchestrator ensures proper sequencing based on belief graph updates.
Wrapping up
Belief-driven AI systems that proactively clarify uncertain requests can reduce user friction, enhance task efficiency, and enable complex task orchestration. The Proactive Agent Model from the DeepMind paper provides a modular blueprint that can extend well beyond text-to-image tasks.