Adaptive Context Memory Allocator — the first system to jointly optimize prompt compression, KV-cache quantization, and token eviction under a single GPU memory budget.

“Long-context inference isn't a memory problem. It's an allocation problem.”
Every long-context LLM eventually slams into the same wall: the KV cache won't fit. The field responded with three separate escape hatches — compress the prompt (LLMLingua), quantize the cache (KIVI), or evict tokens (H2O) — each invented in isolation, each tuned against its own baseline. In practice you're forced to pick one lever, apply it uniformly, and eat the accuracy cost. Nobody was asking the obvious question: under a fixed memory budget, which lever should you pull for which token?
ACMA reframes the whole thing as a knapsack. Each token gets a value score from its attention entropy, then a single optimizer jointly decides its fate: compress it with LLMLingua-2, assign it a KV precision tier (FP16, INT8, INT4, or INT2), or evict it entirely — all packed against one GPU memory budget. Precision is allocated per token rather than uniformly across the cache, so high-value tokens keep their bits while filler degrades or disappears. Built on Llama-3.1-8B-Instruct with a CUDA-backed mixed-precision KV path.
The three levers aren't competitors — they're a single design space, and treating them separately leaves accuracy on the table at every budget. A token that's cheap to compress shouldn't also be paying for FP16 cache, and a token worth keeping at full precision shouldn't be a candidate for eviction. Once you score tokens by attention entropy and let one allocator trade compression against precision against eviction, the budget stops being a cliff and becomes a dial.
The architecture behind the system.
Casts long-context inference as a single constrained allocation: every token competes for the same GPU memory budget across compression, precision, and eviction — instead of three disconnected heuristics.
Each token earns a value from its attention entropy, identifying which context actually carries signal so the allocator knows what to protect and what to sacrifice.
Mixed-precision KV cache spanning FP16 / INT8 / INT4 / INT2, assigned per token rather than uniformly — high-value tokens keep their bits while filler is pushed to INT2.
Prompt compression is one lever in the joint objective, not a separate preprocessing step, so the optimizer can trade compression against precision token by token.
Low-value tokens are dropped only when keeping them costs more than they're worth under the budget — eviction becomes a decision, not a fixed cache-size cutoff.
A custom CUDA-backed KV-cache path on Llama-3.1-8B-Instruct executes the per-token precision tiers, validated on the Needle-in-a-Haystack retrieval benchmark.