Keyrxng

Context Engineering: Sculpting Attention Surfaces for Real LLM Work

8/20/2025 · 12 min

So what?
A practitioner's field guide to moving beyond prompt tinkering—architecting token budgets, retrieval granularity, temporal pruning, and adaptive memory so large language models stay on-mission at scale.

See the related case study: Command-Ask: Context-aware GitHub AI Assistant

Series connective tissue: builds on the GitHub thread bot story (“Bringing an AI coworker onto GitHub”), the temporal decay work (Ghost series), and the semantic matchmaking observations. Those were incidents. This is the discipline that unifies them.

1. Why context engineering (and why now)

Prompt engineering will get you surprisingly far: clever system messages, well-placed delimiters, a few curated exemplars. But once you push past tiny, self-contained tasks into:

the naive strategy of “just stuff more into the window” plateaus fast. At ~60%+ window occupancy frontier models lose plot integrity: factual drift up, structural compliance down, redundancy skyrockets, latency/cost spike. Context engineering is the shift from wordsmithing a prompt to designing the dynamic attention surface.

Definition (field-oriented): Context engineering is the end-to-end discipline of selecting, compressing, ordering, formatting, gating, and budgeting the information + tool affordances an LLM consumes so that reasoning remains bounded, purposeful, and temporally relevant.

2. Prompt engineering vs context engineering (concise contrast)

DimensionPrompt engineeringContext engineering
Primary unitA static (or lightly templated) stringA pipeline / graph assembling evolving slices
FocusWording, examples, toneSelection, omission, compression, structure, temporal weighting
Failure modeBad phrasing → weak answerUncontrolled entropy → drift, cost blowups, hallucination
Key leversFew-shot curation, role framing, formattingRetrieval ranking, granularity, summarization, dedupe, decay, tool plans
MetricsBLEU-esque adhoc, perceived qualityToken efficiency, answer stability, latency, drift rate, acceptance

3. Core principles (opinionated)

  1. Subtractive first — Decide what never enters the window before hunting new data.
  2. Budget early — Reserve explicit token quotas per segment (context, reasoning, output) before retrieval.
  3. Granularity matches task — Paragraph vs sentence embeddings vs code symbol slices: choose smallest units that preserve determinism for the target decision.
  4. Temporal skepticism — Older artifacts enter only if explicitly revalidated or uniquely relevant (Ghost series rationale).
  5. Semantic boundaries — Preserve textual separators; avoid spaghetti concatenation that melts referents.
  6. Reasoning headroom — Hard ceiling (e.g., ≤ 55–60% of window) for staged context; never starve the model’s latent planning space.
  7. Composable flows — Break monolith tasks into phased calls only if cross-phase coupling doesn’t force full re-ingestion.
  8. Observability-first — Log token projections, retrieval scores, truncation events, and hallucination deltas.

4. The 60% rule (why saturation degrades quality)

Empirical threshold (multiple systems, frontier + mid-tier models): once raw staged context climbs past ~60% of max tokens:

Heuristic: If window_use > 0.6 before the question + required scaffolds, force a condensation pass (extractive first, abstractive second only if entropy remains high).

5. Granularity strategy & hierarchical condensation (use-case first)

There is no universally “correct” chunk size. Granularity is downstream of task shape, retrieval pattern, and longevity of the knowledge surface. Two contrasting scenarios:

The SWE profile example (duplication pressure)

Suppose 100 engineer performance / activity reports each carry 30+ semi-structured data points (ratings, qualitative notes). That’s hundreds maybe thousands of near-duplicate sentences like “Communication: 5/10” or “Demonstrated leadership by unblocking X.” Law of large numbers kicks in:

  1. Semantic search bias: cosine similarity over-weights the most repeated phrasing (bland, averaged statements) and under-represents unique, high-signal outliers.
  2. Model pattern anchoring: LLM latches onto frequently repeated generic sentences, diluting nuanced signal (rare spikes in leadership, novel incident response behavior).
  3. Retrieval tradeoff: returning only the “top 30” sentences loses breadth (>90% of the longitudinal range); returning everything floods the window with redundancy, and you’d have been better skipping the embedding step and one-shotting things.

Hierarchical condensation pipeline

Prefer multi-level rollups over raw fine granularity:

Issue / PR / Thread → Micro Report (artifact summary + key metrics)
Weekly / Sprint Report → Aggregated metrics + exceptional events
Monthly Report → Trend deltas + anomaly notes
Quarterly / Yearly Profile → Stable traits, trajectory vectors, distilled narrative
Final Persona Object → (core_stats, domain_vectors, temporal_trends, anomaly_notes)

Each level is token bounded and references lower layers by stable IDs, enabling drill-down on demand without automatically re-injecting raw historical chatter.

Granularity heuristic checklist

Principle: Pick the smallest number of largest trustworthy summaries that still preserve reversible linkage to underlying detail.

6. Retrieval + pruning loop (practical pattern)

  1. Query expand (optional): canonicalize question (strip noise, normalize refs).
  2. Coarse retrieve: paragraph vectors (k=50).
  3. Score augment: semantic_score * time_decay * doc_authority.
  4. Dedupe / cluster: drop near-duplicates (cosine > 0.94).
  5. Fine retrieve: expansion for top m (m≈10).
  6. Salience compression: extract key spans (regex heuristics + embedding centrality).
  7. Budget trim: enforce segment token ceilings.
  8. Assemble: ordered context blocks + explicit separators + metadata headers.
  9. Guardrail prompt: brevity rules, structure contract, DO / DO NOT lists.
  10. Answer.

Key: Omission logging. Store what you didn’t include and why (rank below threshold, duplicate cluster id). Critical for post-hoc error analysis.

7. Formatting as an interface

Format is a compression codec for model attention.

FormatStrengthRiskUse
MarkdownReadability, structural cuesOverhead tokens in headersMixed code + prose
JSON (strict)Parseable output enforcementHigher mitten token costTool invocation planning
XMLHierarchical clarityVerbose tagsNested attribute reasoning
Lightweight tagged blocksLow overheadCustom parser neededInternal pipelines

Pattern: Provide context blocks in Markdown (semantic clarity) → require output schema in JSON with strict field ordering. Separation reduces drift.

8. Boundary preservation

Failure mode: concatenated multi-issue specs fused into one ambiguous blob → model blends constraints. Mitigation:

<<< ISSUE_A (id:123 | updated:2025-08-12) >>>
...spec text...
<<< /ISSUE_A >>>

<<< ISSUE_B (id:456 | updated:2025-08-18) >>>
...spec text...
<<< /ISSUE_B >>>

Explicit sentinels act as alignment landmarks; they also simplify downstream partial redaction.

9. Temporal resistance tie-in (Ghost series, scoped responsibly)

Temporal resistance (historical embeddings outliving operational relevance) matters only when recency or availability change correctness or utility. Examples:

So apply temporal decay & availability filters when the problem definition demands present-state alignment. Otherwise, decay can become silent accuracy erosion.

10. Embeddings aren’t silver bullets

Common anti-pattern: “Vector everything → dump top 200 hits into window.” Problems:

Countermeasures: cluster → centroid; recency gating; authorship diversification; acceptance / click feedback loops to recalibrate weights.

11. Phase vs monolith calls

When building sophisticated analytical reports (e.g., multi-dimensional engineer competency profile):

Balanced approach:

This preserves cross-context nuance without repeatedly injecting raw artifacts.

12. Building semantic human (or system) profiles

Goal: smallest powerful composite object capturing 360º evolution (onboarding → departure). Pipeline sketch:

  1. Ingest artifacts (PRs, issues, reviews, comments, diffs).
  2. Normalize → event schema (timestamp, type, surface, domain_tags, authorship_role).
  3. Derive metrics (review latency, domain spread, semantic tag efficiency, ownership shifts).
  4. Periodic distilled report (weekly/monthly) — bounded token target (e.g., 1.2k).
  5. Embed report summaries, not raw conversations.
  6. Final profile object: layered: (core_stats, domain_vectors, temporal_trend_summaries, anomaly_notes).
  7. Access pattern: retrieval first hits profile; only on ambiguity escalate to archived raw slices.

Result: queries answerable in <5% of original cumulative token footprint.

13. Token budgeting ledger (tactical)

SegmentTarget %Hard Max %Notes
System rails + schema3–57Keep terse; reference glossary url instead of re-defining
Context (staged)40–5055–60Enforce via projected tokenizer; trim oldest low-impact spans
Reasoning allowance (implicit)15–25Don’t starve; frontier models need latent space
Output15–2530Use JSON/markdown hybrid to avoid verbosity
Slack buffer (safety)510Prevent overflow on slight length variance

14. Measurement & observability

Track:

Improvement loop: correlate hallucination spikes with utilization > threshold or cluster diversity collapses.

15. Anti-pattern catalog

Anti-patternSymptomFix
Context hoarding80–95% window pre-answerHard ceilings + summarizers
Granularity mismatchIrrelevant half-paragraphsDrilldown refinement
Temporal amnesiaOutdated recommendationsTime-decay scoring
Redundant recursionRe-fetch same linked issuesCache + signature hashing
Over-summarizationLost nuanced constraintsHybrid (extract key spans + keep originals selectively)
Output driftHallucinated sectionsExplicit schema + refusal clauses

16. Practical heuristics (field-tested)

17. Minimal pseudo-implementation

async function assembleContext(q: string) {
  const budget = new TokenBudget(MAX_TOKENS, { reserveAnswer: 0.25, reserveReasoning: 0.2 });
  const canonical = normalizeQuery(q);
  const coarse = await coarseRetrieve(canonical, 50);            // paragraphs
  const rescored = applyTemporalDecay(coarse);                    // Ghost tie-in
  const pruned = dedupe(rescored, 0.94);
  const top = pruned.slice(0, 12);                                // heuristic
  const sentences = await fineExpand(top, 10);                    // selective
  const salient = salienceCompress(sentences, budget.remainingFor('context'));
  const blocks = boundaryWrap(salient);
  return buildPrompt({ system: SYSTEM_RAILS, blocks, query: canonical, budget });
}

18. Case references (internal linkage)

19. When not to over-engineer

Smell: If your pipeline has more summarization steps than original artifacts, you may be cargo-culting sophistication.

20. Roadmap (beyond baseline)

MaturityCapability
0Raw concatenation + naive prompt
1Basic retrieval + ordering
2Dedupe + temporal decay + boundaries
3Granular drilldown + structured primitives + budget ledger
4Adaptive compression (middle-out, semantic diffing)
5Learning ranker (reinforcement from outcome feedback)
6Cross-session memory graph + dynamic persona modulation

21. Controlled agency vs unbounded agents

Fully autonomous “tool-call-anything” agents frequently self-bloat: recursive tool invocations fetch overlapping context, rehydrate identical embeddings, and burn budget on speculative planning. Debugging becomes opaque as the reasoning path fragments across dozens of ephemeral tool traces.

Controlled agency pattern:

Goal: retain determinism over the attention surface; agents contribute curated deltas instead of flooding their own window.

22. Closing

Context engineering isn’t about feeding more—it’s about feeding purposefully less, better. The constraint mindset (budget, prune, boundary, decay) outperforms raw window escalation even as models scale. Architectural intent is now a competitive advantage: the teams who sculpt the model’s attention surface win on quality, latency, and trust.

North star: smallest sufficient, temporally grounded, diversity-preserving slice + guaranteed reasoning headroom.

See also

Future write-up teaser: forthcoming open-source release on large-scale semantic profile generation (multi-million token condensation → sub-kilobyte persona objects). Stay tuned.