Context Engineering: Sculpting Attention Surfaces for Real LLM Work

8/20/2025 · 12 min

So what?

A practitioner's field guide to moving beyond prompt tinkering—architecting token budgets, retrieval granularity, temporal pruning, and adaptive memory so large language models stay on-mission at scale.

See the related case study: Command-Ask: Context-aware GitHub AI Assistant

LLM Context Engineering RAG Embeddings Systems Prompt Engineering

Series connective tissue: builds on the GitHub thread bot story (“Bringing an AI coworker onto GitHub”), the temporal decay work (Ghost series), and the semantic matchmaking observations. Those were incidents. This is the discipline that unifies them.

1. Why context engineering (and why now)

Prompt engineering will get you surprisingly far: clever system messages, well-placed delimiters, a few curated exemplars. But once you push past tiny, self-contained tasks into:

Multi-repository refactors (hundreds of files)
Long-running conversational agents (days of history)
Dense artifact reasoning (PR threads + diffs + linked issues + ADRs)
Multi-pass analytical pipelines (profiling human performance, temporal evolution)

the naive strategy of “just stuff more into the window” plateaus fast. At ~60%+ window occupancy frontier models lose plot integrity: factual drift up, structural compliance down, redundancy skyrockets, latency/cost spike. Context engineering is the shift from wordsmithing a prompt to designing the dynamic attention surface.

Definition (field-oriented): Context engineering is the end-to-end discipline of selecting, compressing, ordering, formatting, gating, and budgeting the information + tool affordances an LLM consumes so that reasoning remains bounded, purposeful, and temporally relevant.

2. Prompt engineering vs context engineering (concise contrast)

Dimension	Prompt engineering	Context engineering
Primary unit	A static (or lightly templated) string	A pipeline / graph assembling evolving slices
Focus	Wording, examples, tone	Selection, omission, compression, structure, temporal weighting
Failure mode	Bad phrasing → weak answer	Uncontrolled entropy → drift, cost blowups, hallucination
Key levers	Few-shot curation, role framing, formatting	Retrieval ranking, granularity, summarization, dedupe, decay, tool plans
Metrics	BLEU-esque adhoc, perceived quality	Token efficiency, answer stability, latency, drift rate, acceptance

3. Core principles (opinionated)

Subtractive first — Decide what never enters the window before hunting new data.
Budget early — Reserve explicit token quotas per segment (context, reasoning, output) before retrieval.
Granularity matches task — Paragraph vs sentence embeddings vs code symbol slices: choose smallest units that preserve determinism for the target decision.
Temporal skepticism — Older artifacts enter only if explicitly revalidated or uniquely relevant (Ghost series rationale).
Semantic boundaries — Preserve textual separators; avoid spaghetti concatenation that melts referents.
Reasoning headroom — Hard ceiling (e.g., ≤ 55–60% of window) for staged context; never starve the model’s latent planning space.
Composable flows — Break monolith tasks into phased calls only if cross-phase coupling doesn’t force full re-ingestion.
Observability-first — Log token projections, retrieval scores, truncation events, and hallucination deltas.

4. The 60% rule (why saturation degrades quality)

Empirical threshold (multiple systems, frontier + mid-tier models): once raw staged context climbs past ~60% of max tokens:

Model begins truncating internal scratch (chain-of-thought invisible) → shallower synthesis
Increased reliance on early tokens → recency overshadowing mid-block nuance
Higher repetition probability (looped phrasing, duplicate rationale)
Rising incidence of attention anchoring: over-focusing on the longest or earliest block

Heuristic: If window_use > 0.6 before the question + required scaffolds, force a condensation pass (extractive first, abstractive second only if entropy remains high).

5. Granularity strategy & hierarchical condensation (use-case first)

There is no universally “correct” chunk size. Granularity is downstream of task shape, retrieval pattern, and longevity of the knowledge surface. Two contrasting scenarios:

Versioned / relatively static docs (“talk to docs”): sentence-level slices excel—users ask narrow, clause-specific questions; older versions are explicitly replaced so temporal decay may harm accuracy if it hides still-valid archived info.
Evolving, unversioned internal knowledge base (process runbooks, ad-hoc decision logs, merger legacy systems): paragraph or report-level condensations outperform hyper-fine slices; you need consolidated up-to-date guidance for new staff while still allowing drill-down into legacy layers when required.

The SWE profile example (duplication pressure)

Suppose 100 engineer performance / activity reports each carry 30+ semi-structured data points (ratings, qualitative notes). That’s hundreds maybe thousands of near-duplicate sentences like “Communication: 5/10” or “Demonstrated leadership by unblocking X.” Law of large numbers kicks in:

Semantic search bias: cosine similarity over-weights the most repeated phrasing (bland, averaged statements) and under-represents unique, high-signal outliers.
Model pattern anchoring: LLM latches onto frequently repeated generic sentences, diluting nuanced signal (rare spikes in leadership, novel incident response behavior).
Retrieval tradeoff: returning only the “top 30” sentences loses breadth (>90% of the longitudinal range); returning everything floods the window with redundancy, and you’d have been better skipping the embedding step and one-shotting things.

Hierarchical condensation pipeline

Prefer multi-level rollups over raw fine granularity:

Issue / PR / Thread → Micro Report (artifact summary + key metrics)
Weekly / Sprint Report → Aggregated metrics + exceptional events
Monthly Report → Trend deltas + anomaly notes
Quarterly / Yearly Profile → Stable traits, trajectory vectors, distilled narrative
Final Persona Object → (core_stats, domain_vectors, temporal_trends, anomaly_notes)

Each level is token bounded and references lower layers by stable IDs, enabling drill-down on demand without automatically re-injecting raw historical chatter.

Granularity heuristic checklist

Retrieval question is surgical → favor sentence / code symbol slices.
Retrieval question is holistic / evaluative → favor pre-distilled report chunks.
High duplication across artifacts → prioritize clustering + centroid summarization before any direct injection.
Very long operational lifespan (multi-year RAG) → invest early in condensation layers; treat raw artifacts as cold storage.

Principle: Pick the smallest number of largest trustworthy summaries that still preserve reversible linkage to underlying detail.

6. Retrieval + pruning loop (practical pattern)

Query expand (optional): canonicalize question (strip noise, normalize refs).
Coarse retrieve: paragraph vectors (k=50).
Score augment: semantic_score * time_decay * doc_authority.
Dedupe / cluster: drop near-duplicates (cosine > 0.94).
Fine retrieve: expansion for top m (m≈10).
Salience compression: extract key spans (regex heuristics + embedding centrality).
Budget trim: enforce segment token ceilings.
Assemble: ordered context blocks + explicit separators + metadata headers.
Guardrail prompt: brevity rules, structure contract, DO / DO NOT lists.
Answer.

Key: Omission logging. Store what you didn’t include and why (rank below threshold, duplicate cluster id). Critical for post-hoc error analysis.

7. Formatting as an interface

Format is a compression codec for model attention.

Format	Strength	Risk	Use
Markdown	Readability, structural cues	Overhead tokens in headers	Mixed code + prose
JSON (strict)	Parseable output enforcement	Higher mitten token cost	Tool invocation planning
XML	Hierarchical clarity	Verbose tags	Nested attribute reasoning
Lightweight tagged blocks	Low overhead	Custom parser needed	Internal pipelines

Pattern: Provide context blocks in Markdown (semantic clarity) → require output schema in JSON with strict field ordering. Separation reduces drift.

8. Boundary preservation

Failure mode: concatenated multi-issue specs fused into one ambiguous blob → model blends constraints. Mitigation:

<<< ISSUE_A (id:123 | updated:2025-08-12) >>>
...spec text...
<<< /ISSUE_A >>>

<<< ISSUE_B (id:456 | updated:2025-08-18) >>>
...spec text...
<<< /ISSUE_B >>>

Explicit sentinels act as alignment landmarks; they also simplify downstream partial redaction.

9. Temporal resistance tie-in (Ghost series, scoped responsibly)

Temporal resistance (historical embeddings outliving operational relevance) matters only when recency or availability change correctness or utility. Examples:

Matters: onboarding assistants prioritizing deprecated vs current process docs; contributor recommenders (Ghost series); compliance workflows with policy supersession; rapidly shifting architectural decisions.
Usually does not matter: stable versioned product documentation where older versions are explicit; evergreen reference material (mathematical proofs); archived API specs where version selection is user-controlled or easily inferred from the environment.

So apply temporal decay & availability filters when the problem definition demands present-state alignment. Otherwise, decay can become silent accuracy erosion.

10. Embeddings aren’t silver bullets

Common anti-pattern: “Vector everything → dump top 200 hits into window.” Problems:

Latent topic bleed (broad paragraphs include tangential clauses)
Redundancy (slight paraphrases all included)
Token budget cannibalized by boilerplate

Countermeasures: cluster → centroid; recency gating; authorship diversification; acceptance / click feedback loops to recalibrate weights.

11. Phase vs monolith calls

When building sophisticated analytical reports (e.g., multi-dimensional engineer competency profile):

Monolith risk: one massive call (90% window) → drift + cost + brittle retries.
Naive decomposition risk: splitting into 10 calls each re-consuming 70% overlapping context → cost explosion.

Balanced approach:

Phase 1: Normalize + extract structured primitives (events, metrics, qualitative snippets) → canonical JSON.
Phase 2: Aggregate + temporal smooth (decay weighting, anomaly flags) using only structured primitives (cheap tokens).
Phase 3: Narrative synthesis referencing primitive ids, not raw text.

This preserves cross-context nuance without repeatedly injecting raw artifacts.

12. Building semantic human (or system) profiles

Goal: smallest powerful composite object capturing 360º evolution (onboarding → departure). Pipeline sketch:

Ingest artifacts (PRs, issues, reviews, comments, diffs).
Normalize → event schema (timestamp, type, surface, domain_tags, authorship_role).
Derive metrics (review latency, domain spread, semantic tag efficiency, ownership shifts).
Periodic distilled report (weekly/monthly) — bounded token target (e.g., 1.2k).
Embed report summaries, not raw conversations.
Final profile object: layered: (core_stats, domain_vectors, temporal_trend_summaries, anomaly_notes).
Access pattern: retrieval first hits profile; only on ambiguity escalate to archived raw slices.

Result: queries answerable in <5% of original cumulative token footprint.

13. Token budgeting ledger (tactical)

Segment	Target %	Hard Max %	Notes
System rails + schema	3–5	7	Keep terse; reference glossary url instead of re-defining
Context (staged)	40–50	55–60	Enforce via projected tokenizer; trim oldest low-impact spans
Reasoning allowance (implicit)	15–25	—	Don’t starve; frontier models need latent space
Output	15–25	30	Use JSON/markdown hybrid to avoid verbosity
Slack buffer (safety)	5	10	Prevent overflow on slight length variance

14. Measurement & observability

Track:

window_utilization_ratio
retrieval_unique_author_ratio
dup_token_pct (post-dedupe shrink factor)
hallucination_incidents_per_100_calls (manual or heuristic flagged)
cost_per_successful_task
temporal_staleness_rate (stale doc hits / total hits)
answer_revision_rate (user follow-up corrections)

Improvement loop: correlate hallucination spikes with utilization > threshold or cluster diversity collapses.

15. Anti-pattern catalog

Anti-pattern	Symptom	Fix
Context hoarding	80–95% window pre-answer	Hard ceilings + summarizers
Granularity mismatch	Irrelevant half-paragraphs	Drilldown refinement
Temporal amnesia	Outdated recommendations	Time-decay scoring
Redundant recursion	Re-fetch same linked issues	Cache + signature hashing
Over-summarization	Lost nuanced constraints	Hybrid (extract key spans + keep originals selectively)
Output drift	Hallucinated sections	Explicit schema + refusal clauses

16. Practical heuristics (field-tested)

Project tokens before fetching deep links; abort early if budget blown.
Drop any block whose novelty embedding similarity to kept set < 0.06 (tunable).
Always include the question twice (top & bottom) for noisy multi-block contexts.
Maintain a global reserved_tokens integer updated as segments accumulate; forbid negative.

17. Minimal pseudo-implementation

async function assembleContext(q: string) {
  const budget = new TokenBudget(MAX_TOKENS, { reserveAnswer: 0.25, reserveReasoning: 0.2 });
  const canonical = normalizeQuery(q);
  const coarse = await coarseRetrieve(canonical, 50);            // paragraphs
  const rescored = applyTemporalDecay(coarse);                    // Ghost tie-in
  const pruned = dedupe(rescored, 0.94);
  const top = pruned.slice(0, 12);                                // heuristic
  const sentences = await fineExpand(top, 10);                    // selective
  const salient = salienceCompress(sentences, budget.remainingFor('context'));
  const blocks = boundaryWrap(salient);
  return buildPrompt({ system: SYSTEM_RAILS, blocks, query: canonical, budget });
}

18. Case references (internal linkage)

GitHub thread assistant — context recursion + depth caps: Context-aware answers on GitHub
Temporal weighting rationale — decay & availability: Ghost in the machine part 2
Semantic vector persistence (ghost footprint) — failure case: Ghost in the machine part 1
Legacy vector gravity as initial observation: Semantic task matchmaking

19. When not to over-engineer

One-shot classification / extraction when raw input << window
Low-stakes summarization (quick gist) where minor drift acceptable
Prototype phase where data coverage unknown (optimize selection later)

Smell: If your pipeline has more summarization steps than original artifacts, you may be cargo-culting sophistication.

20. Roadmap (beyond baseline)

Maturity	Capability
0	Raw concatenation + naive prompt
1	Basic retrieval + ordering
2	Dedupe + temporal decay + boundaries
3	Granular drilldown + structured primitives + budget ledger
4	Adaptive compression (middle-out, semantic diffing)
5	Learning ranker (reinforcement from outcome feedback)
6	Cross-session memory graph + dynamic persona modulation

21. Controlled agency vs unbounded agents

Fully autonomous “tool-call-anything” agents frequently self-bloat: recursive tool invocations fetch overlapping context, rehydrate identical embeddings, and burn budget on speculative planning. Debugging becomes opaque as the reasoning path fragments across dozens of ephemeral tool traces.

Controlled agency pattern:

Planner quota: hard cap tool invocations per turn (e.g., 3) with explicit justification tokens budgeted.
Tool result summarization: each tool response compressed to a fixed envelope (e.g., ≤120 tokens) before reinsertion.
Duplicate suppression: signature/hash recent tool outputs—skip if semantic similarity > 0.92.
Context deltaing: inject only diffs vs previously provided workspace state.
Human override hooks: ability to freeze planner and request rationale snapshot when drift suspected.

Goal: retain determinism over the attention surface; agents contribute curated deltas instead of flooding their own window.

22. Closing

Context engineering isn’t about feeding more—it’s about feeding purposefully less, better. The constraint mindset (budget, prune, boundary, decay) outperforms raw window escalation even as models scale. Architectural intent is now a competitive advantage: the teams who sculpt the model’s attention surface win on quality, latency, and trust.

North star: smallest sufficient, temporally grounded, diversity-preserving slice + guaranteed reasoning headroom.

Context Engineering: Sculpting Attention Surfaces for Real LLM Work

1. Why context engineering (and why now)

2. Prompt engineering vs context engineering (concise contrast)

3. Core principles (opinionated)

4. The 60% rule (why saturation degrades quality)

5. Granularity strategy & hierarchical condensation (use-case first)

The SWE profile example (duplication pressure)

Hierarchical condensation pipeline

Granularity heuristic checklist

6. Retrieval + pruning loop (practical pattern)

7. Formatting as an interface

8. Boundary preservation

9. Temporal resistance tie-in (Ghost series, scoped responsibly)

10. Embeddings aren’t silver bullets

11. Phase vs monolith calls

12. Building semantic human (or system) profiles

13. Token budgeting ledger (tactical)

14. Measurement & observability

15. Anti-pattern catalog

16. Practical heuristics (field-tested)

17. Minimal pseudo-implementation

18. Case references (internal linkage)

19. When not to over-engineer

20. Roadmap (beyond baseline)

21. Controlled agency vs unbounded agents

22. Closing

See also