pproenca Linguistic and Semantic Algorithms Best Practices

Reference of 40 algorithms an agent should reach for when extracting structure, meaning, history, or risk signals from source code and commit data. Categories are ordered by insight-per-effort — how much non-obvious truth the technique exposes relative to how easy it is to apply. The first two categories target the highest-leverage questions: what business entities live in this code? and where else does this concept already exist? — questions that grep and intuition cannot answer.

When to Apply

Reach for these algorithms when:

Orienting in an unfamiliar codebase: PageRank the import graph to find the core, run LDA over identifier tokens to discover business themes, mine change coupling to surface hidden architectural couplings.
Hunting a bug from a description: BM25 + history prior + embedding re-rank produces a ranked file shortlist far better than grep.
Scoping a feature: find prior PRs that did similar work via embedding similarity; map the feature's vocabulary against the codebase's domain via TF-IDF and noun-phrase mining.
Reviewing a refactor: AST-level GumTree diff reveals semantic impact text diff hides; PDG isomorphism finds the "same logic, different code" twin you should also update.
Auditing risk: hotspots (churn × complexity), bus factor, defect-magnet density, dead-code candidates — together they direct attention to the parts of the codebase that pay back attention.
Identifying domain entities and bounded contexts: noun-phrase mining + TF-IDF rare-term extraction + Louvain communities + Jensen-Shannon divergence on per-cluster vocabulary.

linguistic-semantic-algorithms

pproenca Linguistic and Semantic Algorithms Best Practices

When to Apply

Rule Categories by Priority