Most prompt compression claims are too easy to make.
Take a long conversation. Summarize it. Count fewer tokens. Declare victory.
That is not good enough for 3ngram.
The product does need to reduce prompt bloat. If every AI session starts by pasting a week of conversation history, the workflow gets slower, more expensive, and harder to inspect. But shorter is not the product. A tiny prompt that forgets the open commitment, loses the decision rationale, or carries forward a stale blocker is not better. It is just cheaper to be wrong.
Long-context research points in the same direction. Lost in the Middle found that language models can struggle to use relevant information in long inputs, especially when that information sits in the middle of the context. The 2025 NoLiMa benchmark pushed this further: when questions and the relevant evidence do not share obvious literal matches, performance drops as context grows even for models with large claimed context windows.
Prompt-compression research is promising too. LongLLMLingua shows that compression can reduce cost and latency, and sometimes improve performance, by making key information easier for the model to use. That is exactly why the measurement cannot stop at token count. We need to know which information survived.
Demis Hassabis made the same point in a 2026 Y Combinator conversation: just putting everything into a huge context window is unsatisfying, because even if memory can be very large, there is still a real cost to finding the thing that matters for the decision in front of you. Later in the same discussion, he argued against putting every capability into one giant model and pointed instead toward general systems using specialized tools.
That is the frame for 3ngram: not more context, but better context selection.
The real compression question is:
How much context can we remove while preserving the work state needed for the next action?
That is the metric worth building around.
Why Compression Matters
AI work creates its own drag.
The first session is easy. You explain the project, ask for help, make a decision, and move on.
The tenth session is different. You have prior decisions, half-finished work, source-system state, preferences, deadlines, and unresolved blockers. The user starts carrying more of that state into every prompt because they do not trust the next model session to know it.
That creates two bad habits.
First, people over-paste. They add transcripts, summaries, files, issue threads, and reminders because one missing detail can derail the answer.
Second, people under-explain. They get tired of re-pasting everything and start from a blank session, hoping the model can infer enough.
Both are expensive. One spends too many tokens. The other spends too much human attention.
3ngram should offer a third path: distill the work state into typed context that is small enough to use, specific enough to trust, and portable enough that the user can direct the next step instead of ferrying context between sessions.
The Claim We Should Avoid For Now
There is an obvious marketing claim here:
“3ngram reduces your prompt context by X percent.”
That may become true. It may even become a useful number. But it is not the first claim we should ship, because compression by itself does not prove value.
A system can reduce context by deleting everything.
The measurement has to pair compression with retention.
For every compressed briefing, we need to know:
- how many tokens were removed
- which work-state objects were preserved
- which relevant objects were missed
- which stale or irrelevant objects were included
- whether the next output improved, degraded, or stayed the same
- whether the user corrected the context
Only then does a compression ratio mean anything.
The Metric: Useful Compression
The metric I want is useful compression.
Useful compression is not just compressed_tokens / raw_tokens.
It is the token reduction achieved while preserving the work-state objects required for a task.
A first version can be simple:
- Start with a real work episode: raw transcript, source-system notes, captured memories, and the next task.
- Label the context objects that matter for the next task.
- Generate a 3ngram briefing.
- Measure how many raw tokens were replaced by the briefing.
- Score whether the briefing retained the required objects and excluded stale ones.
That gives three numbers:
- compression ratio
- required-context recall
- stale-context inclusion rate
The product should not optimize any one of them alone.
High compression with low recall is bad.
High recall with low compression may still be useful, but it does not solve prompt bloat.
Low stale-context inclusion is non-negotiable. If the system carries the wrong state into a new session, trust breaks quickly.
What Counts As Work State
This is where a normal summarizer is not enough.
3ngram does not only need to preserve text. It needs to preserve typed work-state objects.
A commitment has a state. It can be open, waiting, resolved, overdue, or stale.
A blocker has to remain visible until cleared.
A decision needs the outcome and the rationale.
A fact can stay quiet until it is relevant.
A preference should shape future work without becoming a task.
A pattern can preserve a repeatable way of working.
A note can keep useful context that has not earned a stronger type yet.
A source-system object can change outside the chat.
That means the compression target is not “make a shorter summary.” The target is “preserve the state that changes what happens next.”
The compressed context should answer questions like:
- What did the user promise?
- What has changed since the last session?
- What is still blocked?
- Which decision should not be reopened?
- Which source should the next AI inspect before acting?
- What should stay out because it is stale?
If the compressed prompt cannot answer those questions, it is not useful compression.
What Responsible Measurement Requires
To measure this responsibly, 3ngram does not need to turn every interaction into a surveillance event. It needs a bounded eval trail around distillation and retrieval: enough to connect context selection to outcomes, without stuffing raw private content into analytics.
At capture time, the useful signals are basic:
- source category
- raw token estimate
- extracted memory type
- whether the user edited, deleted, or corrected the capture
At briefing time, the useful signals are:
- which classes of work-state objects were included
- which stale or irrelevant objects were excluded
- final token estimate
- target surface: ChatGPT, Claude, Codex, Cursor, or source-system view
- task intent, when the user provides one
After use, the useful signals are:
- whether the briefing was used
- whether the user corrected or marked context as stale
- whether a commitment, blocker, decision, or other work-state object changed state
- whether later work referenced the same object without manual re-explanation
This gives the product enough to answer the practical question:
Did the compressed context help the next piece of work happen with less re-explanation, while leaving the user in control of what the system remembers?
The Offline Eval
The live product data will be noisy at first, so the first version should include an offline eval.
Take a set of real but sanitized work episodes. For each one, build a task that requires continuity. Then compare three context packs:
- Raw pack: the full transcript or source material.
- Summary pack: a generic model-generated summary.
- Work-state pack: 3ngram’s typed commitments, decisions, blockers, facts, preferences, patterns, notes, and source links.
Run the same model and the same task against each pack.
Score the answer on:
- task completion
- required-context recall
- incorrect-context carryover
- next-action clarity
- token count
- latency
- user correction required
The useful result would be a work-state pack that approaches the raw pack on task quality while using far fewer tokens and producing fewer stale assumptions than a generic summary.
That is the bar.
Not “we compressed the prompt.”
“We preserved the work state with less context.”
The Metrics That Build The Thesis
Prompt compression is one metric inside a larger thesis.
The broader question is whether 3ngram becomes the continuity layer for AI-heavy work. That needs more than token counts.
The metric set should include:
- Context carry rate: relevant work-state objects brought into the next session without manual re-explanation.
- Briefing reuse rate: how often activated users open or insert a briefing before real work.
- Correction rate: how often users edit, reject, or correct generated context.
- Stale carryover rate: how often the system includes old context that should have been suppressed.
- Open-loop resolution rate: commitments and blockers surfaced by 3ngram that later get resolved or updated.
- Cross-tool handoff success: work that starts in one AI tool and continues in another without re-explaining project state.
- Time to useful start: time between opening a new AI session and reaching a context-aware answer.
Those are not all launch metrics. Some require labels. Some require product events that do not exist yet. Some are easier to measure in a structured beta than in broad usage.
But they are the right direction because they connect the product to behavior.
If people get shorter prompts but still manually restate every project detail, the thesis is weak.
If people use briefings, resolve surfaced commitments, and move work across tools without losing state, the thesis gets stronger.
What Good Looks Like
A good early measurement system would let us say something modest and defensible:
In a small labeled eval, 3ngram briefings preserved the required work-state objects while using materially fewer tokens than the raw context. In live beta usage, briefing correction rates fell over time and surfaced open loops were more likely to be resolved or updated.
That is still not a universal benchmark. It is not a lab-grade claim about every model or every workflow.
But it would be real.
It would connect compression to the product’s actual job: keeping work from leaking between sessions.
What Comes Next
The first implementation can be boring:
- Add token estimates to capture, briefing, and retrieval events.
- Record which work-state objects were included in each briefing.
- Let users correct or mark context as stale from the briefing surface.
- Build a labeled eval set from real sanitized work episodes.
- Track compression ratio only beside recall and stale-carryover scores.
After that, the product can start learning which context earns its place in the next prompt.
That is the real value.
The goal is not to throw away a certain percentage of every prompt.
The goal is to stop paying, waiting, and re-explaining for context that does not change the next action - while preserving the few things that do.