app.ai()

Universal LLM interface with multimodal support and intelligent type detection

app.ai()

Universal LLM interface with multimodal support and intelligent type detection

Universal interface to Large Language Models with automatic multimodal detection, structured output validation, and hierarchical configuration. Handles text, images, audio, and files with intelligent response wrapping.

Return Type Behavior: - With schema parameter: Returns validated Pydantic model instance - Without schema parameter: Returns MultimodalResponse object (backward compatible as string)

Basic Example

from agentfield import Agent

app = Agent(node_id="assistant")

# Simple text-to-text
response = await app.ai("What is the capital of France?")
print(response)  # "The capital of France is Paris."

# System + user pattern
response = await app.ai(
    system="You are a geography expert.",
    user="What is the capital of France?"
)

Parameters

Prop

Type

Parameter Handling: The SDK passes all **kwargs directly to LiteLLM without hard-coding parameters. Provider-specific transformations (e.g., OpenAI's max_tokensmax_completion_tokens) are handled in litellm_adapters.py.

Common Patterns

Structured Output with Pydantic

Enforce type-safe, validated responses using Pydantic models.

from pydantic import BaseModel

class SentimentAnalysis(BaseModel):
    sentiment: str  # positive, negative, neutral
    confidence: float  # 0.0 to 1.0
    keywords: list[str]
    reasoning: str

@app.reasoner
async def analyze_sentiment(text: str) -> SentimentAnalysis:
    """Returns validated Pydantic object, not raw text."""
    return await app.ai(
        system="Analyze sentiment systematically.",
        user=text,
        schema=SentimentAnalysis  # Automatic validation
    )

# Usage
result = await analyze_sentiment("I love this product!")
print(result.sentiment)  # "positive"
print(result.confidence)  # 0.95
print(result.keywords)  # ["love", "product"]

The schema parameter automatically augments the system prompt with strict schema adherence instructions. The SDK validates the LLM response and returns a typed Pydantic instance.

Multimodal Input - Automatic Detection

Agentfield automatically detects and processes images, audio, and files from URLs or local paths.

# Image from URL - automatically detected
response = await app.ai(
    "Describe this image in detail.",
    "https://example.com/product-photo.jpg"
)

# Local image file - automatically converted to base64
response = await app.ai(
    "What's in this screenshot?",
    "./screenshots/error-message.png"
)

# Audio file - automatically processed
response = await app.ai(
    "Transcribe this audio.",
    "./recordings/meeting-notes.mp3"
)

# Mix multiple types
response = await app.ai(
    "Compare the audio description with the visual content.",
    "./product-review.wav",
    "https://example.com/product-image.jpg",
    "Additional context: Premium product line."
)

Automatic detection handles: image URLs, local image files (jpg, png, gif, webp), audio files (wav, mp3, flac, ogg), base64 data URLs, and raw bytes. No manual type specification needed.

Multimodal Input - Explicit Control

Use input classes for precise control over multimodal content.

from agentfield import Text, Image, Audio
from agentfield import image_from_url, image_from_file, audio_from_file

# Explicit multimodal composition
response = await app.ai(
    Text(text="Describe this chart and the audio commentary"),
    image_from_url("https://example.com/sales-chart.png"),
    audio_from_file("./presenter-notes.wav")
)

# Convenience functions
response = await app.ai(
    "Analyze the presentation",
    image_from_file("./slide1.png"),
    image_from_file("./slide2.png"),
    audio_from_file("./narration.mp3")
)

Multimodal Response Handling

When called without a schema parameter, app.ai() returns a MultimodalResponse object that works as a string but provides rich multimodal access.

This section applies only when no schema parameter is provided. When using schema, the method returns a validated Pydantic model instance instead.

# Backward compatible - works as string
response = await app.ai("Generate a greeting with audio")
print(response)  # Prints text content
print(str(response))  # Explicit string conversion

# Access multimodal content
if response.has_audio:
    response.audio.save("greeting.wav")
    response.audio.play()  # Requires pygame

if response.has_images:
    for i, image in enumerate(response.images):
        image.save(f"generated_{i}.png")
        image.show()  # Requires PIL

# Check content types
print(f"Has audio: {response.has_audio}")
print(f"Has images: {response.has_images}")
print(f"Is multimodal: {response.is_multimodal}")

# Save all content at once
saved_files = response.save_all("./output", prefix="ai_response")
# Returns: {"text": "path/to/text.txt", "audio": "path/to/audio.wav", ...}

Configuration Overrides

Override agent defaults on a per-call basis using hierarchical configuration.

Why Override Configurations? - Cost Optimization: Use cheaper models (gpt-4o-mini) for simple tasks, expensive models (gpt-4o) only when needed - Task-Specific Performance: Different models excel at different tasks (reasoning vs speed vs multimodal) - Quality Control: Adjust temperature for deterministic outputs (0.0) vs creative generation (1.2+) - Token Management: Set appropriate max_tokens based on expected response length

# Agent defaults for most operations
app = Agent(
    node_id="writer",
    ai_config=AIConfig(
        model="openai/gpt-4o-mini",  # Cost-effective default
        temperature=0.7,
        max_tokens=1000
    )
)

# Override for creative writing (better quality, higher cost)
creative_story = await app.ai(
    "Write a sci-fi short story.",
    model="openai/gpt-4o",  # Better model for creative tasks
    temperature=1.2,  # More creative
    max_tokens=2000  # Longer output
)

# Override for precise analysis (deterministic, lower cost)
analysis = await app.ai(
    "Analyze this data for errors.",
    temperature=0.0,  # Deterministic
    max_tokens=500  # Concise
)

# Provider-specific parameters
response = await app.ai(
    "Generate diverse ideas.",
    top_p=0.9,  # LiteLLM parameter
    frequency_penalty=0.5,  # OpenAI parameter
    presence_penalty=0.3  # OpenAI parameter
)

Configuration hierarchy: Agent defaults → Method parameters → Runtime kwargs. Later values override earlier ones. All LiteLLM parameters are supported via **kwargs.

Reasoner Pattern with Model Selection

Pass model configuration through reasoners for flexible AI routing.

from pydantic import BaseModel

class Analysis(BaseModel):
    summary: str
    complexity: str
    recommendations: list[str]

@app.reasoner
async def analyze_document(
    document: str,
    model: str = "openai/gpt-4o-mini"  # Accept model as parameter
) -> Analysis:
    """Analyze document with configurable model selection."""

    # Route to appropriate model based on task complexity
    return await app.ai(
        system="You are a document analyzer.",
        user=f"Analyze: {document}",
        schema=Analysis,
        model=model  # Pass through to app.ai()
    )

# Use cheap model for simple documents
simple_analysis = await analyze_document(
    "Short memo about meeting",
    model="openai/gpt-4o-mini"
)

# Use powerful model for complex documents
complex_analysis = await analyze_document(
    "50-page technical specification",
    model="openai/gpt-4o"
)

This pattern enables cost-aware AI routing: use cheaper models by default, upgrade to powerful models only when complexity demands it. Particularly useful for multi-step reasoners where different steps have different requirements.

Image Generation

# Works with both DALL-E and OpenRouter
result = await app.ai_with_vision(
    "A beautiful sunset over mountains",
    model="dall-e-3"  # or "openrouter/google/gemini-2.5-flash-image-preview"
)
result.images[0].save("sunset.png")

See full example: examples/python_agent_nodes/image_generation_hello_world/

Audio Generation Beta

Generate audio responses using specialized audio models. Supports schema parameter via **kwargs for structured output.

# Basic audio generation (returns MultimodalResponse)
audio_result = await app.ai_with_audio("Say hello warmly")
audio_result.audio.save("greeting.wav")
audio_result.audio.play()

# With structured output (returns Pydantic model)
from pydantic import BaseModel

class Greeting(BaseModel):
    text: str
    tone: str

greeting = await app.ai_with_audio(
    "Say hello warmly",
    schema=Greeting  # Returns Greeting instance, not MultimodalResponse
)
print(greeting.text)  # Access as Pydantic model

# Customize voice and format
audio_result = await app.ai_with_audio(
    "Explain quantum computing in simple terms",
    voice="nova",  # alloy, echo, fable, onyx, nova, shimmer
    format="mp3",  # wav, mp3
    model="openai/gpt-4o-audio-preview"
)

# Access both text and audio
print(audio_result.text)  # Text version
audio_result.audio.save("explanation.mp3")

# OpenAI direct mode with instructions (bypasses LiteLLM)
audio_result = await app.ai_with_audio(
    "Provide a warm, professional greeting",
    voice="alloy",
    format="wav",
    model="openai/gpt-4o-mini-tts",
    mode="openai_direct",
    instructions="Speak slowly and clearly with enthusiasm",
    speed=0.9
)

Image Generation Beta

Generate images using DALL-E or other image models. Always returns MultimodalResponse with images.

LiteLLM Dependency: Image generation capabilities are determined by LiteLLM's supported providers. Agentfield passes requests directly to LiteLLM's aimage_generation() API. Available models, sizes, and features depend on what LiteLLM supports for your configured provider.

ai_with_vision() does not support the schema parameter - it's for image generation, not text completion.

# Basic image generation (always returns MultimodalResponse)
image_result = await app.ai_with_vision("A sunset over mountains")
image_result.images[0].save("sunset.png")
image_result.images[0].show()

# Customize image parameters
image_result = await app.ai_with_vision(
    "A futuristic cityscape with flying cars",
    size="1792x1024",  # 256x256, 512x512, 1024x1024, 1792x1024, 1024x1792
    quality="hd",  # standard, hd
    style="vivid",  # vivid, natural (DALL-E 3 only)
    model="openai/dall-e-3"
)

# Access generated image
image = image_result.images[0]
image.save("cityscape.png")
print(image.revised_prompt)  # See how DALL-E interpreted the prompt

Explicit Multimodal Control

Request specific output modalities for complex workflows. Supports schema parameter via **kwargs for structured output.

# Request text + audio output (returns MultimodalResponse)
result = await app.ai_with_multimodal(
    "Describe this image and provide audio narration",
    image_from_url("https://example.com/chart.jpg"),
    modalities=["text", "audio"],
    audio_config={"voice": "nova", "format": "wav"}
)

# Access all outputs
print(result.text)  # Text description
result.audio.save("narration.wav")  # Audio narration

# With structured output (returns Pydantic model)
from pydantic import BaseModel

class ImageAnalysis(BaseModel):
    description: str
    key_elements: list[str]

analysis = await app.ai_with_multimodal(
    "Analyze this chart",
    image_from_url("https://example.com/chart.jpg"),
    schema=ImageAnalysis  # Returns ImageAnalysis instance
)
print(analysis.description)  # Access as Pydantic model

# Complex multimodal workflow
result = await app.ai_with_multimodal(
    Text(text="Create a presentation summary"),
    image_from_file("./slide1.png"),
    image_from_file("./slide2.png"),
    audio_from_file("./presenter-audio.wav"),
    modalities=["text", "audio"],
    audio_config={"voice": "alloy", "format": "mp3"},
    model="openai/gpt-4o-audio-preview"
)

Streaming Responses

Enable streaming for real-time output processing. Returns async generator instead of complete response.

# Enable streaming
stream = await app.ai(
    "Write a long essay about AI.",
    stream=True,
    max_tokens=2000
)

# Process chunks as they arrive
async for chunk in stream:
    if hasattr(chunk.choices[0].delta, 'content'):
        content = chunk.choices[0].delta.content
        if content:
            print(content, end='', flush=True)

When stream=True, app.ai() returns an async generator that yields response chunks as they arrive from the LLM. This enables real-time display and reduces perceived latency for long responses.

Error Handling and Fallbacks

Agentfield automatically handles rate limits and provides fallback models.

# Configure fallback models in AIConfig
app = Agent(
    node_id="resilient-agent",
    ai_config=AIConfig(
        model="openai/gpt-4o",
        fallback_models=[
            "openai/gpt-4o-mini",
            "anthropic/claude-3-haiku"
        ],
        enable_rate_limit_retry=True,  # Automatic retry with exponential backoff
        rate_limit_max_retries=3,
        rate_limit_base_delay=1.0,
        rate_limit_max_delay=60.0
    )
)

# Automatic fallback on failure
try:
    response = await app.ai("Analyze this complex data")
    # If gpt-4o fails, automatically tries gpt-4o-mini, then claude-3-haiku
except Exception as e:
    print(f"All models failed: {e}")

# Manual error handling
try:
    response = await app.ai(
        "Generate analysis",
        model="openai/gpt-4o"
    )
except Exception as e:
    # Fallback to simpler model
    response = await app.ai(
        "Generate analysis",
        model="openai/gpt-4o-mini",
        temperature=0.0  # More deterministic for reliability
    )

Rate limiting is enabled by default. Agentfield automatically retries with exponential backoff (1s → 2s → 4s → ...) up to max_delay. Circuit breaker prevents cascading failures.

Response Object

The MultimodalResponse object provides comprehensive access to all response content. This is returned when no schema parameter is provided to app.ai(), ai_with_audio(), or ai_with_multimodal().

When a schema parameter is provided, these methods return a validated Pydantic model instance instead of MultimodalResponse.

Properties

Prop

Type

Methods

Prop

Type

AudioOutput Methods

Prop

Type

ImageOutput Methods

Prop

Type

Specialized Methods

ai_with_audio()

Optimized for audio generation with automatic model selection.

Prop

Type

Returns: MultimodalResponse (without schema) or Pydantic model instance (with schema)

ai_with_vision()

Generate images with LiteLLM or OpenRouter. Routes automatically based on model name.

Prop

Type

Returns: MultimodalResponse with generated images

Examples:

# DALL-E (LiteLLM)
result = await app.ai_with_vision("A sunset over mountains")
result.images[0].save("output.png")

# OpenRouter (Gemini)
result = await app.ai_with_vision(
    "A futuristic city",
    model="openrouter/google/gemini-2.5-flash-image-preview",
    image_config={"aspect_ratio": "16:9"}
)

# Base64 data
result = await app.ai_with_vision("A landscape", response_format="b64_json")

ai_with_multimodal()

Explicit control over input and output modalities.

Prop

Type

Returns: MultimodalResponse (without schema) or Pydantic model instance (with schema)

Advanced Features

Automatic Prompt Trimming

Agentfield automatically trims prompts to fit model context windows using token-aware trimming.

# Long conversation - automatically trimmed to fit context window
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    # ... hundreds of messages ...
]

response = await app.ai(messages)
# Agentfield uses LiteLLM's trim_messages() to keep within token limits
# Preserves system message and recent context

Trimming uses LiteLLM's token counter for accurate token counting. Preserves system messages and trims from the middle using "middle-out" strategy. Configurable via max_input_tokens in AIConfig.

Memory Integration Coming Soon

Development Status: The memory_scope parameter is defined in the SDK but automatic memory injection is not yet implemented. Currently, memory must be manually retrieved and passed via the context parameter.

# Current approach - manual memory retrieval
@app.reasoner
async def context_aware_chat(user_message: str, user_id: str) -> str:
    # Manually retrieve memory
    history = await app.memory.get(f"user_{user_id}_history", default=[])

    # Pass as context
    response = await app.ai(
        system="You are a helpful assistant.",
        user=user_message,
        context={"history": history}  # Manual injection
    )

    return response

Automatic injection via memory_scope parameter is planned for a future release.

Best Practices

Use Pydantic Schemas for Reliability

Structured output is more reliable than parsing free-form text.

# ❌ Unreliable - parsing free text
response = await app.ai("Return JSON with sentiment and confidence")
data = json.loads(response)  # May fail if LLM doesn't format correctly

# ✅ Reliable - enforced schema
class Analysis(BaseModel):
    sentiment: str
    confidence: float

response = await app.ai("Analyze sentiment", schema=Analysis)
# Guaranteed to be valid Analysis object

Schema Complexity vs Model Capability: Complex nested schemas may fail with smaller models. If using gpt-4o-mini or similar, keep schemas simple (2-3 levels deep max). For complex schemas with deep nesting, use more capable models like gpt-4o or claude-3-opus.

# ❌ Too complex for mini models
class ComplexSchema(BaseModel):
    level1: dict[str, dict[str, list[dict[str, Any]]]]

# ✅ Simple schema works reliably
class SimpleSchema(BaseModel):
    category: str
    items: list[str]

Handle Multimodal Content Safely

Always check for content availability before accessing.

response = await app.ai_with_audio("Generate greeting")

# ✅ Safe access
if response.has_audio:
    response.audio.save("greeting.wav")
else:
    print("No audio generated, using text:", response.text)

# ❌ Unsafe - may raise AttributeError
response.audio.save("greeting.wav")  # Fails if audio is None

Use Appropriate Models for Tasks

Different models excel at different tasks.

# Fast, cheap tasks - use mini models
quick_summary = await app.ai(
    "Summarize in one sentence: ...",
    model="openai/gpt-4o-mini"
)

# Complex reasoning - use full models
deep_analysis = await app.ai(
    "Analyze this complex dataset: ...",
    model="openai/gpt-4o"
)

# Multimodal tasks - use specialized models
audio_response = await app.ai_with_audio(
    "Explain this concept",
    model="openai/gpt-4o-audio-preview"
)

Leverage Configuration Hierarchy

Set sensible defaults, override when needed.

# Agent-level defaults for most operations
app = Agent(
    node_id="assistant",
    ai_config=AIConfig(
        model="openai/gpt-4o-mini",  # Fast, cheap default
        temperature=0.7,
        max_tokens=1000
    )
)

# Override for specific needs
creative_output = await app.ai(
    "Write a poem",
    model="openai/gpt-4o",  # Better model
    temperature=1.2  # More creative
)

Performance Considerations

Token Counting Overhead:

  • Prompt trimming uses LiteLLM's token counter for accuracy
  • Only triggered when prompt exceeds model's context window
  • Minimal overhead for typical prompts (< 5ms)
  • Uses "middle-out" trimming strategy to preserve context

Rate Limiting:

  • Enabled by default with exponential backoff
  • Adds ~0ms overhead when no rate limits hit
  • Prevents cascading failures in production

Multimodal Processing:

  • Image/audio conversion to base64 adds ~50-200ms
  • Cached after first conversion
  • Use explicit classes to skip auto-detection

Fallback Models:

  • No overhead if primary model succeeds
  • Automatic retry adds ~1-5s on failure
  • Configure fallback_models in AIConfig