codebase-comprehension-algorithms
Community Codebase Comprehension And Domain Mapping Algorithms Best Practices
A practitioner-oriented reference of the algorithms that work for mapping a codebase into understandable feature/business domains. Most of these techniques live in the Software Architecture Recovery and Mining Software Repositories literatures and are invisible to working engineers — yet they're the right tools for the job a coding agent is asked to do every day: "what does this codebase do, and where?"
The 47 rules are organized by execution-lifecycle impact: a wrong decision early in the pipeline (which graph to build, which identifiers to keep) propagates through everything downstream. The three CRITICAL categories (graph-, clust-, valid-) are the ones a wrong call cannot be recovered from later. Read them first.
Scope: proven algorithms with peer-reviewed citations or canonical books — Newman Networks, Leskovec-Rajaraman-Ullman Mining of Massive Datasets, Ganter-Wille Formal Concept Analysis, plus 40+ ICSE / FSE / TSE / PNAS / JMLR papers. No tutorial sites, no Stack Overflow, no marketing posts. Deliberately deferred to a future version: GNN/CodeBERT/code2vec (not "proven over decades" yet) and refactoring-recipe stuff (covered by sibling skills like react-refactor and typescript-refactor).
When to Apply
Use these rules when:
- Onboarding an agent into an unfamiliar codebase: "explain what this codebase does, by domain"
- Producing an architecture map: "what are the main subsystems and how do they connect?"
- Locating a feature: "which files implement payments / authentication / search?"
- Reviewing a refactor: "did this change respect the architectural boundaries?"
- Detecting architectural debt: "what files have surprising coupling?"
- Validating an existing decomposition: "does the README's architecture match the code?"
- Picking algorithms for any of the above — the user wants something that's proven, not vibes