The thing I want to be able to prove is simple:

The same model should do better work when it starts with the right work state.

Not a bigger prompt. Not a longer transcript. Not a folder full of notes the user has to paste by hand.

The right work state.

That means commitments, decisions, blockers, facts, preferences, patterns, notes, source-system context, and the few pieces of project history that change what should happen next.

Long-context research makes the distinction sharper. Lost in the Middle showed that language models can struggle to use relevant information inside long inputs. The 2025 NoLiMa benchmark found that this gets harder when the question and evidence do not share obvious literal matches. The point is not to give the model more text. The point is to select the right state for the work in front of it.

The current version of the 3ngram thesis depends on this being true. If a fresh model with a blank session performs just as well as a model with portable context, the product is mostly a convenience layer. Useful, maybe, but not load-bearing. If the context-loaded model is consistently better at follow-through, then 3ngram is not only storing memory. It is changing the starting conditions for work.

The honest state today is that we do not have enough foundational data to treat that as proven.

We have a product shape. We have a strong hypothesis. We have enough daily usage to know where the pain is. But the public claim should not be “we ran the eval and the gap is X” until the eval exists, is repeatable, and survives a few boring controls.

So the first useful post is not a victory lap. It is the measurement plan.

What We Need To Measure

The baseline comparison is straightforward:

Give the same task to the same model twice.

In one run, the model starts from a blank session with only the direct user prompt.

In the other run, the model starts with a compact briefing generated from 3ngram: relevant commitments, decisions, blockers, facts, preferences, patterns, notes, source links, and any project context that matters.

Then compare the outputs.

That sounds easy, but the details matter. A bad eval will mostly prove that a longer prompt has more information in it. That is not the thesis. The thesis is that a typed work-state layer can preserve the information that changes the next action.

The measurement should focus on follow-through work:

drafting a reply after a previous meeting
preparing the next AI coding session after an interrupted implementation
deciding which open commitments need attention today
explaining why a decision was made
continuing a client or investor thread without re-reading every prior note
resolving whether a blocker is still real or already cleared

These are not trivia questions. They are continuity tasks.

The First Eval Shape

A useful first eval can be small.

Take twenty real work episodes that already happened inside 3ngram. For each episode, create a task that requires prior context to answer well. The task should be answerable by a human who has the work state, and difficult for a fresh model that only sees the last prompt.

Each episode gets three inputs:

the direct user request
the full raw source context, held out as the reference set
the compact 3ngram briefing generated before the task starts

Then run three conditions:

Fresh model: direct user request only.
Full-context model: direct request plus the raw reference set.
Portable-context model: direct request plus the 3ngram briefing.

The full-context model is not the product experience. It is an upper bound. It tells us what happens when the model gets too much context but has access to everything.

The fresh model is the real-world baseline. It represents the current habit: start a new chat and re-explain what you remember.

The portable-context model is what 3ngram is trying to make routine.

The interesting result is not whether the portable-context model beats the fresh model once. It should. It has more relevant information. The interesting result is whether it gets close to the full-context model while using much less context and producing fewer follow-up questions, fewer stale assumptions, and fewer missed commitments.

The Scoring Rubric

The scoring needs to be concrete enough that we can disagree with it.

For each task, score the output on five dimensions:

Task success: did the answer complete the job?
Context recall: did it use the relevant prior commitment, decision, blocker, fact, preference, pattern, note, or source link?
False carryover: did it import context that was stale, wrong, or unrelated?
Follow-through quality: did it identify the next action, owner, due date, or state change where applicable?
User correction needed: would the user need to clarify, repair, or restart the answer?

The most important failure mode is false carryover.

A memory product can look impressive by adding lots of old context. That is not enough. If it carries the wrong state into the next session, it becomes worse than useless because it makes the model sound confident about stale work.

So the score cannot only reward recall. It has to penalize incorrect continuity.

What Responsible Measurement Requires

The offline eval is only the start. The product needs live measurement too, but it should not read like a blank check for analytics. The useful measurement is bounded: enough to connect context selection to later work, without treating private content as product telemetry.

The first measurement layer should answer a few practical questions:

Was a briefing generated for the task?
Which classes of work-state objects were included?
Was the briefing opened or inserted into an AI session?
Did the user correct, reject, or mark context as stale?
Did a surfaced commitment, blocker, decision, or other work-state object later change state?

The exact event names are less important than the chain.

We need to know whether a context object was captured, selected, used, corrected, and later acted on. Without that chain, we can count activity but not value.

The first real product metric should be context carry rate:

Of the memories that should matter in a new session, how many did 3ngram carry forward without the user manually re-explaining them?

That metric requires judgement. The system cannot know “should matter” perfectly by itself. For the first version, a human-labeled eval set is fine. Later, user behavior can become the proxy: selected memories, accepted briefings, resolved commitments, and corrections.

What Would Be A Good Signal

The early signal I care about is not a single benchmark number.

It is a pattern:

portable-context runs need fewer user corrections than fresh runs
portable-context runs identify more open loops than fresh runs
portable-context runs avoid stale carryover nearly as well as full-context runs
users open briefings before real work, not only during demos
commitments surfaced from briefings get resolved or updated
users stop ferrying the same project state across AI tools

If that pattern appears, the product thesis gets stronger.

It would mean 3ngram is not only making old information searchable. It is lowering the cost of starting the next useful session.

That matters because AI work is becoming fragmented. One task now passes through ChatGPT, Claude, Cursor, Codex, GitHub, Linear, Calendar, and a human conversation. The model is not the durable object. The work state is.

What Would Prove This Wrong

This eval can fail in several useful ways.

If the portable-context model does not beat the fresh model on follow-through tasks, then the captured context is too weak, too generic, or too hard to apply.

If it beats the fresh model but trails the full-context model by too much, then the briefing is over-compressed or missing the wrong things.

If it performs well only because the briefing is long, then the product is just moving prompt bloat into a nicer wrapper.

If it creates many false carryovers, then the retrieval layer is dangerous.

If users open briefings but still manually re-explain everything, then the product is not trusted.

Those are not edge cases. They are the core risks.

The point of the eval is not to prove that 3ngram is right. The point is to make the claim falsifiable enough that the product can improve.

The First Practical Spike

The first spike does not need a full analytics warehouse.

It needs four things:

A small labeled dataset of real work episodes.
A repeatable runner that executes fresh, full-context, and portable-context conditions against the same task.
A scoring sheet with the five dimensions above.
Product events that connect briefing generation to later user action.

Once that exists, we can start tracking the metrics that actually build the 3ngram thesis:

context carry rate
correction rate
stale carryover rate
briefing reuse rate
open-loop resolution rate
cross-tool handoff success
prompt tokens saved without recall loss

Those are the numbers that matter more than a generic “memory improved output quality” claim.

The thesis is not that AI needs longer memory.

The thesis is that serious AI work needs portable state: the right context, in the right tool, before the next action starts.

That is what we have to measure.

A small evaluation: portable context vs. a fresh model.