How We Made Our Agents 70%-80% More Frugal and Smarter
The Architecture That Cut Our AI Token Usage by 80% While Boosting Performance
Introduction: The Problem
AI agents are only as capable as the context they can afford to process. At JustCopy.ai, our agents navigate the web, reason across long task sequences, and coordinate tools to get work done for our users. As adoption grew, so did our costs and latency, because each step required large prompt contexts loaded with instructions, history, page state, and tool outputs. In peak sessions, we were burning tens of thousands of tokens per action—and on complex tasks, hundreds of thousands overall.
The tradeoff was clear and unsatisfying: either keep agents “smart” with rich prompts and pay heavily in tokens, or make them “cheap” by pruning context and accept a hit to reliability and capability. We refused that binary. Instead, we set an ambitious goal: make agents super frugal AND super smart—delivering the same or better outcomes while cutting token usage by 70–86%.
This is the story of how we achieved it.
The Insight: Not All Skills Are Needed All The Time
Agents often carry a heavy backpack of skills and instructions at every step: search strategies, form-filling rules, error recovery logic, data extraction patterns, content style guides, compliance policies, and more. But in actual sequences, most of those skills are dormant at any given moment.
Our core insight was simple:
An agent’s “global brain” is overkill for most local decisions.
Most steps require only a small, task-local subset of capabilities.
Loading everything, every time, wastes tokens and hurts focus.
We reframed the problem from “How do we compress everything?” to “How do we load only what’s needed now?” That led to a shift from static, monolithic prompts to a lean agent core with dynamic skill injection—a just-in-time model for intelligence.
The Solution: Lean Agents + Dynamic Skill Injection
We split the agent into two conceptual layers:
Lean Core Agent
Minimal, universal instructions for safe tool use, step planning, and result verification
Stable conversational scaffolding with tight formatting contracts
Lightweight memory hooks and a compact reasoning rubric
Dynamic Skill Injection (DSI)
Contextual, on-demand capabilities that are injected only when the step truly needs them
Examples: DOM extraction recipes, structured data parsing schemas, domain-specific validators, retry heuristics, persona/style guides, and search/domain playbooks
Together, these produced a tighter, faster loop: the core stays small and consistent, while the skill layer flexes to the task’s needs. The agent remains competent without carrying encyclopedic instructions at every turn.
The Implementation Approach
We engineered the system around four pillars that work together to minimize tokens while preserving precision and capability.
Context Budgeting and Governance
We introduced a context budget manager that treats tokens as a first-class resource
Sets a per-step token envelope and enforces hard caps
Prioritizes high-yield information (actionable elements, deltas, schemas)
Defers low-utility info with lazy loading and backoff
Degrades gracefully: if the budget tightens, we compress explainability first, never the action contract
Dynamic Skill Graph
We modeled skills as nodes in a dependency graph with metadata
Triggers: page patterns, tool outputs, errors, or user intents
Cost hints: expected token footprint and benefit score
Compatibility flags: which skills compose well; which conflict
Differential State Representation (DOM + History)
We adopted an aggressively compact representation of state
Page HTML is simplified to an actionable skeleton with stable node IDs
We send diffs across steps instead of full page dumps
Prior interactions are summarized into terse, typed records (intent, action, outcome)
Tool outputs are normalized into compact, schema-aware snippets
Contract-First Tooling
Every tool call follows strict input/output contracts
Instead of verbose natural language, the agent thinks with structures
JSON envelopes for requests and responses
Short, validated enums over free text
Deterministic formatting rules that eliminate re-explaining the same thing
This reduces verbal overhead and slashes boilerplate tokens while increasing reliability.
Results: The Numbers
We evaluated the new architecture across 500+ real-world sessions covering research, frontend implementation, backend implementation, testing flows, and deployment.
Avg. tokens per page: 45,200 (before) → 8,100 (after) = 82% reduction
Avg. tokens per action: 38,500 (before) → 10,800 (after) = 72% reduction
Complex task (10+ steps): 420,000 (before) → 95,000 (after) = 77% reduction
Simple task (2–3 steps): 85,000 (before) → 22,000 (after) = 74% reduction
Peak token usage: 68,000 (before) → 15,200 (after) = 78% reduction
Additional outcomes:
Latency: 30–45% faster end-to-end
Costs: 70–86% lower token consumption, 73% average cost reduction per task
Reliability: 94.2% success rate maintained; context overflows down from 8.3% to 0.4%
Scalability: Longer sessions without degradation; more concurrent jobs per GPU budget
Real-World Impact
The savings changed what we could build and how quickly we could iterate:
Richer reasoning where it matters: Freed budget lets us allocate tokens to high-value moments: ambiguous instructions, long-form synthesis, or tricky form workflows.
Smoother user experience: Faster response cycles and fewer “lost context” errors make agents feel sharper and more trustworthy.
Cheaper experimentation: We can A/B test prompts, new skills, and search strategies at a fraction of former costs—accelerating product learning loops.
Greener compute: Cutting tokens means less inference work, which reduces energy use and aligns with responsible AI.
Key Learnings
What worked:
Load skills just-in-time: DSI outperforms monolithic prompts in both cost and accuracy by focusing the model on the immediate subtask.
Prefer structure over prose: Contract-first tool IO and compact state kept prompts short and removed ambiguity.
Deltas beat dumps: Differential DOM/state updates eliminated the largest source of repeated tokens.
Small, stable core = fewer regressions: A lean base prompt reduced prompt drift and made changes safer.
What didn’t:
Over-summarizing text near controls: Truncating labels/instructions near interactive elements hurt precision. We now preserve full text around actionable nodes.
One-size-fits-all pruning: Research flows and transactional flows need different filters; our planner learns and switches profiles.
Flattening all structure: Some hierarchy is essential for spatial reasoning; we preserve minimal nesting with consistent IDs.
What’s Next
Learning skill policies: Train a scheduler to predict which skills to inject, based on task type and recent outcomes.
Semantic compression: Use embeddings to compact redundant lists (e.g., product cards) and cache them across steps.
Predictive prefetch: Anticipate likely next pages and prepare diffs in the background to cut perceived latency.
Multi-modal parity: Extend the same efficiency concepts to screenshots and video, coordinating a unified token budget.
Cross-session wisdom: Share anonymized patterns about which elements matter for which sites to bootstrap new tasks.
Conclusion
We set out to make our agents super frugal AND super smart—and proved the two are not in conflict. By separating a lean, invariant core from dynamic, just-in-time skills, representing state as compact deltas, and enforcing contract-first tool calls, we cut token use by 70–86% while sustaining (and often improving) success rates, speed, and user trust.
For teams building agents, the takeaway is practical: treat tokens like a budget, not a backdrop. Align information density with step intent, inject skills only when needed, and prefer typed structure over sprawling prose. The result is not just lower costs—it’s a sharper, more reliable agent your users will feel.
Try JustCopy.ai
Start your free trial at https://justcopy.ai and experience faster, more capable, and more affordable agents.