context compression architecture
architecture diagrams for mofa’s context compression system - how components interact, data flows, and the layered design.
table of contents
- system architecture overview
- layer responsibilities
- compression flow
- component interactions
- strategy selection flow
- data flow diagram
- caching architecture
- integration points
system architecture overview
the context compression system follows mofa’s microkernel architecture - trait definitions live in the kernel layer, concrete implementations in the foundation layer.
graph TB
%% =========================
%% User Layer
%% =========================
subgraph USER["User Code / Agent Executor"]
Messages[Chat Messages]
Agent[Agent Executor]
Messages -->|input| Agent
Agent -->|compressed| Messages
end
%% =========================
%% Foundation Layer
%% =========================
subgraph FOUNDATION["mofa-foundation Layer"]
subgraph IMPL["Compression Implementations"]
direction LR
SW[SlidingWindowCompressor]
SUM[SummarizingCompressor]
SEM[SemanticCompressor]
HIER[HierarchicalCompressor]
HYB[HybridCompressor]
end
subgraph SUPPORT["Supporting Components"]
direction LR
TC[TokenCounter]
TTC[TikTokenCounter]
Cache[CompressionCache]
end
subgraph LLMINT["LLM Integration"]
direction LR
LLM[LLM Provider]
Embed[Embedding Provider]
end
end
%% =========================
%% Kernel Layer
%% =========================
subgraph KERNEL["mofa-kernel Layer"]
direction LR
Trait[ContextCompressor Trait]
Strategy[CompressionStrategy Enum]
Result[CompressionResult]
Metrics[CompressionMetrics]
Trait -->|defines| Strategy
Trait -->|returns| Result
Result -->|contains| Metrics
end
%% =========================
%% Connections (Structured)
%% =========================
%% Agent uses implementations (kept vertical)
Agent -->|uses| SW
Agent -->|uses| SUM
Agent -->|uses| SEM
Agent -->|uses| HIER
Agent -->|uses| HYB
%% Implementations implement trait
SW -->|implements| Trait
SUM -->|implements| Trait
SEM -->|implements| Trait
HIER -->|implements| Trait
HYB -->|implements| Trait
%% Dependencies aligned downward to reduce crossing
SW --> TC
SUM --> TC
SUM --> LLM
SEM --> TC
SEM --> Embed
SEM --> Cache
HIER --> TC
HYB --> TC
layer responsibilities
graph LR
subgraph "mofa-kernel"
K1[Trait Definition<br/>ContextCompressor]
K2[Core Types<br/>CompressionStrategy<br/>CompressionMetrics<br/>CompressionResult]
K3[Interface Contracts]
end
subgraph "mofa-foundation"
F1[Concrete Implementations<br/>5 Compression Strategies]
F2[Token Counting<br/>Heuristic & TikToken]
F3[Caching System<br/>LRU Cache]
F4[Integration Logic]
end
K1 -.->|defines interface| F1
K2 -.->|used by| F1
K3 -.->|contract| F1
F1 -->|implements| K1
F1 -->|uses| K2
F2 -->|used by| F1
F3 -->|used by| F1
F4 -->|orchestrates| F1
key principles:
- kernel layer: defines abstractions only, no implementations
- foundation layer: provides all concrete implementations
- dependency direction: foundation → kernel (never reverse)
compression flow
end-to-end compression process from agent execution to compressed output.
sequenceDiagram
participant Agent as Agent Executor
participant Compressor as ContextCompressor
participant Counter as TokenCounter
participant Strategy as Compression Strategy
participant LLM as LLM Provider
participant Cache as CompressionCache
participant Result as CompressionResult
Agent->>Compressor: compress(messages, max_tokens)
Compressor->>Counter: count_tokens(messages)
Counter-->>Compressor: token_count
alt token_count <= max_tokens
Compressor-->>Agent: messages (unchanged)
else token_count > max_tokens
Compressor->>Strategy: apply compression strategy
alt Strategy == SlidingWindow
Strategy->>Strategy: keep system + recent N
Strategy-->>Compressor: compressed messages
else Strategy == Summarize
Strategy->>Cache: check summary cache
alt cache hit
Cache-->>Strategy: cached summary
else cache miss
Strategy->>LLM: generate summary
LLM-->>Strategy: summary
Strategy->>Cache: store summary
end
Strategy-->>Compressor: compressed messages
else Strategy == Semantic
Strategy->>Cache: check embedding cache
alt cache hit
Cache-->>Strategy: cached embeddings
else cache miss
Strategy->>LLM: generate embeddings
LLM-->>Strategy: embeddings
Strategy->>Cache: store embeddings
end
Strategy->>Strategy: cluster by similarity
Strategy->>Strategy: merge redundant messages
Strategy-->>Compressor: compressed messages
else Strategy == Hierarchical
Strategy->>Strategy: score messages
Strategy->>Strategy: keep high-scoring
alt high-scoring messages don't fit
Strategy->>LLM: summarize message
LLM-->>Strategy: summary
end
Strategy-->>Compressor: compressed messages
else Strategy == Hybrid
Strategy->>Strategy: apply multiple strategies
Strategy-->>Compressor: compressed messages
end
Compressor->>Counter: count_tokens(compressed)
Counter-->>Compressor: tokens_after
Compressor->>Result: create CompressionResult
Result-->>Agent: CompressionResult
end
component interactions
how different compression components interact with each other and external systems.
graph TD
subgraph "Compression Strategies"
SW[SlidingWindowCompressor]
SUM[SummarizingCompressor]
SEM[SemanticCompressor]
HIER[HierarchicalCompressor]
HYB[HybridCompressor]
end
subgraph "Token Counting"
TC[TokenCounter<br/>chars/4 heuristic]
TTC[TikTokenCounter<br/>accurate counting]
end
subgraph "Caching Layer"
EC[Embedding Cache<br/>LRU]
SC[Summary Cache<br/>LRU]
end
subgraph "External Services"
LLM[LLM Provider<br/>OpenAI/Anthropic]
EMB[Embedding Provider<br/>OpenAI/Anthropic]
end
subgraph "Output"
CR[CompressionResult]
CM[CompressionMetrics]
end
SW --> TC
SUM --> TC
SUM --> LLM
SUM --> SC
SEM --> TC
SEM --> EMB
SEM --> EC
HIER --> TC
HIER --> LLM
HYB --> SW
HYB --> SUM
HYB --> SEM
HYB --> HIER
SW --> CR
SUM --> CR
SEM --> CR
HIER --> CR
HYB --> CR
CR --> CM
strategy selection flow
decision-making process for choosing and applying compression strategies.
flowchart TD
Start([Agent receives messages]) --> CheckTokens{Token count<br/>exceeds limit?}
CheckTokens -->|No| Return[Return messages unchanged]
CheckTokens -->|Yes| SelectStrategy{Strategy type?}
SelectStrategy -->|SlidingWindow| SW[SlidingWindow Strategy]
SelectStrategy -->|Summarize| SUM[Summarize Strategy]
SelectStrategy -->|Semantic| SEM[Semantic Strategy]
SelectStrategy -->|Hierarchical| HIER[Hierarchical Strategy]
SelectStrategy -->|Hybrid| HYB[Hybrid Strategy]
SW --> SW1[Separate system messages]
SW1 --> SW2[Keep system + recent N]
SW2 --> SW3[Return compressed]
SUM --> SUM1[Separate system messages]
SUM1 --> SUM2[Keep recent messages]
SUM2 --> SUM3{Check cache}
SUM3 -->|Hit| SUM4[Use cached summary]
SUM3 -->|Miss| SUM5[Call LLM to summarize]
SUM5 --> SUM6[Store in cache]
SUM6 --> SUM4
SUM4 --> SUM7[Return compressed]
SEM --> SEM1[Separate system messages]
SEM1 --> SEM2[Keep recent messages]
SEM2 --> SEM3{Check cache}
SEM3 -->|Hit| SEM4[Use cached embeddings]
SEM3 -->|Miss| SEM5[Generate embeddings]
SEM5 --> SEM6[Store in cache]
SEM6 --> SEM4
SEM4 --> SEM7[Cluster by similarity]
SEM7 --> SEM8[Merge redundant messages]
SEM8 --> SEM9[Return compressed]
HIER --> HIER1[Separate system messages]
HIER1 --> HIER2[Score all messages]
HIER2 --> HIER3[Keep high-scoring]
HIER3 --> HIER3A{High-scoring<br/>messages fit?}
HIER3A -->|No| HIER3B[Summarize with LLM]
HIER3A -->|Yes| HIER4[Return compressed]
HIER3B --> HIER4
HYB --> HYB1[Apply first strategy]
HYB1 --> HYB2{Still exceeds limit?}
HYB2 -->|Yes| HYB3[Apply next strategy]
HYB3 --> HYB2
HYB2 -->|No| HYB4[Return compressed]
SW3 --> CalculateMetrics[Calculate metrics]
SUM7 --> CalculateMetrics
SEM9 --> CalculateMetrics
HIER4 --> CalculateMetrics
HYB4 --> CalculateMetrics
CalculateMetrics --> CreateResult[Create CompressionResult]
CreateResult --> Log[Log compression event]
Log --> ReturnResult[Return to Agent]
Return --> ReturnResult
data flow diagram
how data flows through the compression system, from input messages to compressed output with metrics.
flowchart LR
Input[Input Messages<br/>Vec<ChatMessage>] --> TokenCount[Token Counting]
TokenCount --> Decision{Exceeds<br/>max_tokens?}
Decision -->|No| DirectOutput[Output Messages<br/>unchanged]
Decision -->|Yes| Strategy[Compression Strategy]
Strategy --> Process[Process Messages]
Process --> SW_Proc[SlidingWindow:<br/>Truncate]
Process --> SUM_Proc[Summarize:<br/>LLM Summary]
Process --> SEM_Proc[Semantic:<br/>Embed & Cluster]
Process --> HIER_Proc[Hierarchical:<br/>Score & Filter]
Process --> HYB_Proc[Hybrid:<br/>Multi-stage]
SW_Proc --> Compressed[Compressed Messages]
SUM_Proc --> Compressed
SEM_Proc --> Compressed
HIER_Proc --> Compressed
HYB_Proc --> Compressed
Compressed --> FinalTokenCount[Final Token Count]
DirectOutput --> FinalTokenCount
FinalTokenCount --> Metrics[Calculate Metrics<br/>tokens_before<br/>tokens_after<br/>compression_ratio<br/>reduction_percent]
Metrics --> Result[CompressionResult<br/>messages<br/>metrics<br/>strategy_name]
Result --> Output[Output to Agent]
style Input fill:#e1f5ff
style Output fill:#d4edda
style Result fill:#fff3cd
style Metrics fill:#f8d7da
caching architecture
the compression system includes an optional LRU cache for embeddings and summaries to improve performance and reduce API costs.
graph TB
subgraph "CompressionCache"
Cache[CompressionCache Manager]
EC[Embedding Cache<br/>HashMap<String, Entry>]
SC[Summary Cache<br/>HashMap<String, Entry>]
end
subgraph "Cache Operations"
Get[get_embedding<br/>get_summary]
Store[store_embedding<br/>store_summary]
Evict[LRU Eviction]
Stats[Cache Statistics]
end
subgraph "Cache Entry"
EEntry[EmbeddingCacheEntry<br/>embedding: Vec<f32><br/>accessed_at: Instant]
SEntry[SummaryCacheEntry<br/>summary: String<br/>accessed_at: Instant]
end
subgraph "Key Generation"
KeyGen[Cache Key<br/>SHA256 hash of content]
end
Cache --> EC
Cache --> SC
EC --> EEntry
SC --> SEntry
Get --> Cache
Store --> Cache
Evict --> Cache
Stats --> Cache
KeyGen --> Get
KeyGen --> Store
style Cache fill:#e1f5ff
style EC fill:#d4edda
style SC fill:#d4edda
style Evict fill:#fff3cd
cache features:
- lru eviction: automatically evicts oldest entries when capacity is reached
- separate caches: embeddings and summaries cached independently
- sha256 keys: content-based cache keys for deduplication
- thread-safe: uses
Arc<RwLock<>>for concurrent access - statistics: tracks cache size and capacity
integration points
how the compression system integrates with other mofa components.
graph TB
subgraph "Agent Executor"
Executor[AgentExecutor]
Config[AgentConfig<br/>max_context_tokens<br/>compressor]
end
subgraph "Context Compression"
Compressor[ContextCompressor]
Result[CompressionResult]
end
subgraph "Session Management"
Session[Session Manager]
History[Message History]
end
subgraph "LLM Integration"
LLM[LLM Provider]
Embed[Embedding Provider]
end
subgraph "Logging & Observability"
Tracing[tracing crate]
Events[Compression Events]
end
Executor --> Config
Config --> Compressor
Executor --> Session
Session --> History
History --> Compressor
Compressor --> LLM
Compressor --> Embed
Compressor --> Result
Result --> Executor
Compressor --> Tracing
Tracing --> Events
style Executor fill:#e1f5ff
style Compressor fill:#d4edda
style Result fill:#fff3cd
style Events fill:#f8d7da
integration details:
-
agent executor integration:
- checks token count before each LLM call
- automatically compresses if limit exceeded
- preserves system prompts and recent messages
-
session management:
- works with persistent session storage
- compresses historical messages while keeping recent context
-
llm provider integration:
- uses LLM for summarization (SummarizingCompressor, HierarchicalCompressor)
- uses embedding API for semantic compression (SemanticCompressor)
-
observability:
- structured logging with
tracing - compression events logged with metrics
- cache statistics available for monitoring
- structured logging with
architecture principles
1. separation of concerns
- kernel layer: defines
what(interfaces and contracts) - foundation layer: defines
how(concrete implementations)
2. extensibility
- new compression strategies can be added by implementing
ContextCompressor - token counting can be customized via trait methods
- caching is optional and feature-gated
3. performance
- fast strategies (SlidingWindow) for low-latency scenarios
- caching reduces API calls and improves throughput
- parallel processing available via feature flags
4. observability
- comprehensive metrics for every compression operation
- structured logging for debugging and monitoring
- cache statistics for performance analysis
5. backward compatibility
- default
compress()method for existing code - optional
compress_with_metrics()for new code - feature flags for optional dependencies
future enhancements (phase 3)
the architecture is designed to support future enhancements:
- adaptive strategy selection: automatically choose the best strategy based on conversation characteristics
- compression profiles: pre-configured presets (fast, balanced, quality, cost-optimized)
- quality evaluation: measure semantic preservation, not just token reduction
- configuration management: yaml/toml configs for reusable settings
- cost tracking: monitor and optimize API costs for compression operations
these enhancements will build on the existing architecture without requiring breaking changes.