Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitoring & Observability

Production observability for MoFA applications — Prometheus metrics, OpenTelemetry distributed tracing, and structured logging.


Architecture Overview

graph TD
    A[LLMAgent.chat_with_session] -->|creates Internal span| B[span: llm.agent.chat\nagent.id  session.id]
    B --> C[ChatSession.send]
    C -->|creates Client span| D[span: gen_ai.chat_completion\ngen_ai.system  gen_ai.request.model]
    D --> E[LLM Provider API]
    E -->|response + token usage| D
    B -->|exported via global tracer| F[OTel Exporter]
    D -->|exported via global tracer| F
    F --> G[Jaeger / OTLP Collector]

    A -->|token counts| H[MetricsCollector]
    H --> I[PrometheusExporter]
    I --> J[GET /metrics]
    J --> K[Prometheus / Grafana]

Part 1 — Prometheus Metrics

Quickstart

Start the monitoring dashboard and scrape /metrics:

# Run the monitoring dashboard
cargo run -p mofa-monitoring

# Scrape metrics
curl -s http://localhost:9090/metrics | grep mofa_llm

Expected output for LLM metrics:

# HELP mofa_llm_requests_total Total LLM requests
# TYPE mofa_llm_requests_total counter
mofa_llm_requests_total{provider="openai",model="gpt-4o"} 142

# HELP mofa_llm_input_tokens_total Cumulative prompt tokens sent to the LLM provider
# TYPE mofa_llm_input_tokens_total counter
mofa_llm_input_tokens_total{provider="openai",model="gpt-4o"} 89340

# HELP mofa_llm_output_tokens_total Cumulative completion tokens received from the LLM provider
# TYPE mofa_llm_output_tokens_total counter
mofa_llm_output_tokens_total{provider="openai",model="gpt-4o"} 21056

# HELP mofa_llm_latency_seconds Average LLM request latency in seconds
# TYPE mofa_llm_latency_seconds gauge
mofa_llm_latency_seconds{provider="openai",model="gpt-4o"} 1.24

# HELP mofa_llm_time_to_first_token_seconds Time to first token for streaming requests in seconds
# TYPE mofa_llm_time_to_first_token_seconds gauge
mofa_llm_time_to_first_token_seconds{provider="openai",model="gpt-4o"} 0.31

Full LLM Metrics Reference

MetricTypeLabelsDescription
mofa_llm_requests_totalcounterprovider, modelTotal requests sent
mofa_llm_errors_totalcounterprovider, modelFailed requests
mofa_llm_input_tokens_totalcounterprovider, modelCumulative prompt tokens
mofa_llm_output_tokens_totalcounterprovider, modelCumulative completion tokens
mofa_llm_tokens_per_secondgaugeprovider, modelGeneration throughput
mofa_llm_latency_secondsgaugeprovider, modelAverage request latency
mofa_llm_time_to_first_token_secondsgaugeprovider, modelStreaming TTFT
mofa_llm_request_duration_secondshistogramprovider, modelLatency distribution

Prometheus Scrape Config

Add to your prometheus.yml:

scrape_configs:
  - job_name: mofa
    static_configs:
      - targets: ["localhost:9090"]
    scrape_interval: 15s

Practical PromQL Queries

Cost estimation — tokens consumed per model per minute:

rate(mofa_llm_input_tokens_total[1m])  +  rate(mofa_llm_output_tokens_total[1m])

Input/output token ratio (measures response verbosity):

rate(mofa_llm_output_tokens_total[5m])
  /
rate(mofa_llm_input_tokens_total[5m])

Error rate per model:

rate(mofa_llm_errors_total[5m])
  /
rate(mofa_llm_requests_total[5m])

P95 latency across all models (from histogram):

histogram_quantile(0.95, rate(mofa_llm_request_duration_seconds_bucket[5m]))

Streaming time-to-first-token by provider:

mofa_llm_time_to_first_token_seconds

Grafana Dashboard

Import the following panel definitions into Grafana to visualise LLM health at a glance.

graph LR
    subgraph Grafana Dashboard
        P1[Token Usage\nrate input+output /1m]
        P2[Error Rate\nerrors/requests /5m]
        P3[Latency P95\nhistogram_quantile 0.95]
        P4[TTFT\ntime_to_first_token_seconds]
        P5[Throughput\ntokens_per_second]
    end
    J[GET /metrics] --> P1 & P2 & P3 & P4 & P5

Suggested alert rules:

# Alert when error rate exceeds 5% for 2 minutes
- alert: MoFALLMHighErrorRate
  expr: |
    rate(mofa_llm_errors_total[2m])
    / rate(mofa_llm_requests_total[2m]) > 0.05
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "LLM error rate above 5% on {{ $labels.model }}"

# Alert when input token rate spikes (cost runaway)
- alert: MoFATokenBudgetSpike
  expr: rate(mofa_llm_input_tokens_total[5m]) > 10000
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "High token consumption on {{ $labels.provider }}/{{ $labels.model }}"

Part 2 — OpenTelemetry Distributed Tracing

How It Works

sequenceDiagram
    participant App
    participant LLMAgent
    participant ChatSession
    participant Provider as LLM Provider

    App->>LLMAgent: chat_with_session("session-1", "hello")
    activate LLMAgent
    Note over LLMAgent: span: llm.agent.chat<br/>kind=Internal<br/>agent.id, session.id

    LLMAgent->>ChatSession: send("hello")
    activate ChatSession
    Note over ChatSession: span: gen_ai.chat_completion<br/>kind=Client<br/>gen_ai.system, gen_ai.request.model

    ChatSession->>Provider: POST /v1/chat/completions
    Provider-->>ChatSession: {choices, usage}
    Note over ChatSession: span ends<br/>gen_ai.usage.input_tokens<br/>gen_ai.usage.output_tokens

    ChatSession-->>LLMAgent: "response text"
    deactivate ChatSession

    Note over LLMAgent: span ends OK
    LLMAgent-->>App: "response text"
    deactivate LLMAgent

Enable the Feature

# Cargo.toml
[dependencies]
mofa-foundation = { version = "0.1", features = ["otel-tracing"] }

Span Attributes Reference

AttributeSpanDescription
agent.idllm.agent.chatAgent identifier
session.idllm.agent.chat, gen_ai.chat_completionSession identifier
gen_ai.systemgen_ai.chat_completionProvider name ("openai", "anthropic")
gen_ai.request.modelgen_ai.chat_completionModel requested
gen_ai.response.modelgen_ai.chat_completionModel that responded
gen_ai.usage.input_tokensgen_ai.chat_completionPrompt tokens used
gen_ai.usage.output_tokensgen_ai.chat_completionCompletion tokens used
llm.streamingllm.agent.chat_streamtrue for streaming requests

Console Exporter (Development)

See spans printed to stdout — no infrastructure required:

#![allow(unused)]
fn main() {
use opentelemetry_sdk::trace::TracerProvider;
use opentelemetry::global;

fn init_tracing() {
    let exporter = opentelemetry_stdout::SpanExporter::default();
    let provider = TracerProvider::builder()
        .with_simple_exporter(exporter)
        .build();
    global::set_tracer_provider(provider);
}
}

Example console output for a single chat_with_session call:

SpanData {
  name: "llm.agent.chat",
  kind: Internal,
  status: Ok,
  attributes: [
    agent.id = "secretary-agent",
    session.id = "01936b2f-1234-7abc-8def-000000000001"
  ]
}
SpanData {
  name: "gen_ai.chat_completion",
  kind: Client,
  status: Ok,
  attributes: [
    gen_ai.system = "openai",
    gen_ai.request.model = "gpt-4o",
    gen_ai.response.model = "gpt-4o",
    gen_ai.usage.input_tokens = 312,
    gen_ai.usage.output_tokens = 78
  ]
}

# Start Jaeger all-in-one
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:latest
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::trace::TracerProvider;
use opentelemetry::global;

fn init_tracing() -> anyhow::Result<()> {
    let exporter = opentelemetry_otlp::SpanExporter::builder()
        .with_tonic()
        .with_endpoint("http://localhost:4317")
        .build()?;

    let provider = TracerProvider::builder()
        .with_batch_exporter(exporter, opentelemetry_sdk::runtime::Tokio)
        .build();

    global::set_tracer_provider(provider);
    Ok(())
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    init_tracing()?;

    let agent = LLMAgentBuilder::from_env()?
        .with_id("demo-agent")
        .build();

    let session = agent.create_session().await;
    let reply = agent.chat_with_session(&session, "What is Rust?").await?;
    println!("{reply}");

    // Flush spans before exit
    global::shutdown_tracer_provider();
    Ok(())
}

Open Jaeger UI at http://localhost:16686 and select service mofa-foundation.


OTLP / Grafana Tempo (Production)

# Set via environment (no code change needed)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4317
export OTEL_SERVICE_NAME=mofa-production
#![allow(unused)]
fn main() {
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::trace::TracerProvider;
use opentelemetry::global;

fn init_tracing() -> anyhow::Result<()> {
    let endpoint = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT")
        .unwrap_or_else(|_| "http://localhost:4317".to_string());

    let exporter = opentelemetry_otlp::SpanExporter::builder()
        .with_tonic()
        .with_endpoint(endpoint)
        .build()?;

    let provider = TracerProvider::builder()
        .with_batch_exporter(exporter, opentelemetry_sdk::runtime::Tokio)
        .build();

    global::set_tracer_provider(provider);
    Ok(())
}
}

Full Working Example

use mofa_foundation::llm::LLMAgentBuilder;
use opentelemetry::{global, KeyValue};
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::trace::TracerProvider;

fn init_tracing() -> anyhow::Result<()> {
    let exporter = opentelemetry_otlp::SpanExporter::builder()
        .with_tonic()
        .with_endpoint("http://localhost:4317")
        .build()?;

    let provider = TracerProvider::builder()
        .with_batch_exporter(exporter, opentelemetry_sdk::runtime::Tokio)
        .with_resource(opentelemetry_sdk::Resource::new(vec![
            KeyValue::new("service.name", "my-mofa-app"),
            KeyValue::new("deployment.environment", "production"),
        ]))
        .build();

    global::set_tracer_provider(provider);
    Ok(())
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // 1. Init tracing — must happen before any agent calls
    init_tracing()?;

    // 2. Build agent (otel-tracing feature auto-instruments all calls)
    let agent = LLMAgentBuilder::from_env()?
        .with_id("support-agent")
        .with_system_prompt("You are a helpful assistant.")
        .build();

    // 3. Multi-turn conversation — every call emits two spans
    let session = agent.create_session().await;

    let r1 = agent.chat_with_session(&session, "What is MoFA?").await?;
    println!("Turn 1: {r1}");

    let r2 = agent.chat_with_session(&session, "Give me a code example.").await?;
    println!("Turn 2: {r2}");

    // 4. Flush before exit
    global::shutdown_tracer_provider();
    Ok(())
}

Each call to chat_with_session emits two spans — an outer llm.agent.chat (Internal) and an inner gen_ai.chat_completion (Client) — visible as a parent-child pair in Jaeger/Tempo.


Part 3 — Logging

Configure via RUST_LOG:

export RUST_LOG=mofa_foundation=debug,mofa_runtime=info

Structured Log Fields

#![allow(unused)]
fn main() {
use tracing::{info, debug, instrument};

#[instrument(skip(input))]
async fn execute(&mut self, input: AgentInput) -> AgentResult<AgentOutput> {
    debug!(input_len = input.to_text().len(), "Processing input");
    let result = self.process(input).await?;
    info!(output_len = result.as_text().map(|s| s.len()), "Execution complete");
    Ok(result)
}
}

Part 4 — Health Checks

#![allow(unused)]
fn main() {
use mofa_sdk::monitoring::HealthCheck;

let health = HealthCheck::new()
    .with_database_check(|| async { store.health().await })
    .with_llm_check(|| async { llm.health().await });

// GET /health
let status = health.check().await;
}

Part 5 — Dashboard

MoFA includes a built-in monitoring dashboard:

cargo run -p mofa-monitoring

Access at http://localhost:3000 — shows live agent metrics, LLM token usage, latency histograms, and system resources.


See Also