Graphify — Code/Doc → Knowledge Graph

Install: pipx install 'graphifyy[all]' Binary: graphify (CLI for install/hooks/benchmark only) Python: ${HOME}/.local/share/pipx/venvs/graphifyy/bin/python Repo: https://github.com/safishamsi/graphify (5k stars, v0.3.6)

When to Use

Analyzing unfamiliar codebases before modifying them
Understanding architecture and cross-component relationships
Building structured context for a complex multi-file task
Finding "god nodes" (most connected abstractions) and surprising connections
Pre-loading knowledge before running AutoAgent or similar optimization loops
Comparing different tool options: use Graphify for docs+code+images, GitNexus for pure code call graphs

When NOT to Use

Repo <5 files and <50K words — just read the files directly (Graphify will warn you)
Pure code-only analysis — GitNexus (gitnexus analyze) is faster for call graphs
You need real-time file watching — use gitnexus with MCP instead

Modes

Graphify has two modes depending on your content type:

Code Mode (default)

For codebases — extracts functions, classes, imports, call graphs. Best for understanding unfamiliar repos.

Knowledge Mode (`--mode knowledge`)

For prose-heavy directories — Obsidian vaults, documentation, research papers, meeting notes. Extracts domain concepts instead of code symbols.

When to use knowledge mode:

Obsidian vaults (.md/.txt heavy)
Documentation directories
Research paper collections
Meeting notes or decision logs
Any corpus where the value is in concepts and their relationships, not code structure

Why it matters: Running code mode on prose produces function-level noise (e.g., main(), frontmatter() as god nodes) instead of conceptual structure (e.g., "WiseChef", "Agent Coordination", "Cognee" as god nodes).

Architecture

Graphify runs in two passes (code mode) or one pass (knowledge mode):

AST pass (deterministic, free) — Tree-sitter extracts classes, functions, imports, call graphs from code files. Skipped in knowledge mode.
Semantic pass (LLM, costs tokens) — Extracts concepts and relationships:
- Code mode: Extracts technical concepts from docs, papers, images
- Knowledge mode: Extracts domain concepts, entities, and cross-document relationships from all files

Results merge into a NetworkX graph → Leiden community detection → outputs:

graphify-out/graph.html — interactive visualization (click nodes, search, filter)
graphify-out/graph.json — persistent queryable graph
graphify-out/GRAPH_REPORT.md — god nodes, communities, surprising connections, knowledge gaps

Quick Start (Inside a Coding Agent)

If running from Claude Code, Codex, or OpenClaw with Agent/subagent support:

/graphify /path/to/repo

This uses the skill.md workflow with parallel subagent dispatch for semantic extraction. Fastest and most complete.

Running from Hermes (Without Subagent Dispatch)

Hermes doesn't have the Agent tool that Graphify expects for parallel semantic extraction. Use the Python API directly:

Step 1: Detect Files

PYTHON=${HOME}/.local/share/pipx/venvs/graphifyy/bin/python
$PYTHON -c "
import json
from graphify.detect import detect
from pathlib import Path
result = detect(Path('/path/to/repo'))
print(json.dumps(result, indent=2))
"

Check total_files, total_words, and needs_graph. If corpus is tiny, just read the files.

Step 2: AST Extraction (Code Files)

Write a script file (don't use f-strings in terminal — escaping issues):

# save as _run_ast.py in the repo
import json
from graphify.extract import collect_files, extract
from pathlib import Path

code_files = [Path(f) for f in [
    '/path/to/repo/file1.py',
    '/path/to/repo/file2.py',
]]
result = extract(code_files)
Path('.graphify_ast.json').write_text(json.dumps(result, indent=2))
print(f"AST: {len(result['nodes'])} nodes, {len(result['edges'])} edges")

Step 3: Semantic Extraction (Manual)

Since we can't dispatch subagents, read the doc files yourself and create semantic nodes/edges manually:

semantic_nodes = [
    {"id": "unique_id", "type": "concept", "label": "Human Label",
     "description": "What this is", "source": "file.md", "provenance": "EXTRACTED"},
]
semantic_edges = [
    {"source": "node_a", "target": "node_b", "relation": "uses",
     "provenance": "EXTRACTED"},  # or "INFERRED" with confidence
]

Node types: concept, artifact, process, config, constraint, directive, dependency, benchmark, boundary Provenance: EXTRACTED (explicit in source), INFERRED (reasonable inference, add confidence: 0.0-1.0), AMBIGUOUS (uncertain)

Step 3b: Knowledge Mode Extraction (prose/docs)

When running on prose-heavy content (Obsidian vaults, docs, papers), skip the AST pass entirely and extract domain-level concepts as nodes with these types:

semantic_nodes = [
    {"id": "wisechef", "type": "project", "label": "WiseChef",
     "description": "AI platform for autonomous agents, €248/mo MRR", "source": "business-strategy.md", "provenance": "EXTRACTED"},
    {"id": "agent_coordination", "type": "concept", "label": "Agent Coordination Protocol",
     "description": "Multi-agent coordination via Discord markers and ACK system", "source": "agent-map.md", "provenance": "EXTRACTED"},
    {"id": "cognee", "type": "technology", "label": "Cognee",
     "description": "Knowledge graph engine for AI memory", "source": "tools-skills-inventory.md", "provenance": "EXTRACTED"},
    {"id": "adam", "type": "person", "label": "Adam Krawczyk",
     "description": "Founder, Kraków-based, late-night worker", "source": "contacts.md", "provenance": "EXTRACTED"},
]

semantic_edges = [
    {"source": "wisechef", "target": "agent_coordination", "relation": "uses",
     "provenance": "EXTRACTED"},
    {"source": "wisechef", "target": "cognee", "relation": "depends_on",
     "provenance": "EXTRACTED"},
    {"source": "wisechef", "target": "adam", "relation": "owned_by",
     "provenance": "EXTRACTED"},
    {"source": "agent_coordination", "target": "cognee", "relation": "stores_state_in",
     "provenance": "INFERRED", "confidence": 0.7},
]

Knowledge mode node types:

Type	Use For	Example
`project`	Named projects/products	WiseChef, AgentPact, WiseVision
`concept`	Abstract ideas, protocols	Agent Coordination, Zero-Touch Deals
`technology`	Tools, frameworks, infra	Cognee, Honcho, Hetzner, Docker
`person`	Named individuals	Adam, Olek, Mariusz
`organization`	Companies, teams	WiseChef, MIT
`decision`	Key decisions made	"use GLM for entity extraction"
`process`	Workflows, pipelines	Nightly Ingest, Graphify nightly

Knowledge mode edge relations: depends_on, uses, owned_by, stores_state_in, relates_to, builds_on, contradicts, is_part_of, managed_by, competes_with

Key principle: Prioritize cross-document connections over within-document structure. The value of a knowledge graph is revealing relationships that aren't visible from reading any single document.

Custom prompt for LLM extraction (if using subagents):

"Extract domain concepts, named entities, project names, technologies, and abstract ideas as nodes. Extract relationships like 'depends on', 'is part of', 'relates to', 'contradicts', 'builds on' as edges. Prioritize cross-document connections. Do NOT extract code-level artifacts (functions, classes, imports) — only domain-level concepts."

Step 4: Build Graph + Cluster + Export

import json
from pathlib import Path
from graphify.build import build_from_json
from graphify.cluster import cluster, score_all
from graphify.analyze import god_nodes, surprising_connections, suggest_questions
from graphify.report import generate
from graphify.export import to_html, to_json

# Merge AST + semantic
ast = json.loads(Path('.graphify_ast.json').read_text())
merged = {"nodes": ast['nodes'] + semantic_nodes,
          "edges": ast['edges'] + semantic_edges,
          "input_tokens": 0, "output_tokens": 0}

# Build → Cluster → Analyze
G = build_from_json(merged)
communities = cluster(G)  # Returns dict[int, list[str]]
cohesion = score_all(G, communities)

# Community labels
community_labels = {}
for cid, members in communities.items():
    labels = [G.nodes[m].get('label', m) for m in members if G.has_node(m)]
    community_labels[cid] = labels[0] if labels else f"Community {cid}"

god_list = god_nodes(G)
surprise_list = surprising_connections(G)
questions = suggest_questions(G, communities, community_labels)

# Export
OUT = Path('graphify-out')
OUT.mkdir(exist_ok=True)
to_json(G, communities, str(OUT / 'graph.json'))
to_html(G, communities, str(OUT / 'graph.html'))

# Report
detection_result = {"files": {"code": N, "document": N}, "total_files": N, "total_words": N}
token_cost = {"input_tokens": 0, "output_tokens": 0}
report = generate(G, communities, cohesion, community_labels, god_list,
                  surprise_list, detection_result, token_cost, str(REPO), questions)
(OUT / 'GRAPH_REPORT.md').write_text(report)

API Reference (Discovered — Not Documented)

Module	Key Functions	Notes
`graphify.detect`	`detect(path: Path) -> dict`	Returns files, total_words, needs_graph
`graphify.extract`	`extract(files: list[Path]) -> dict`, `collect_files(path) -> list`	AST extraction, returns nodes+edges
`graphify.cache`	`check_semantic_cache(files) -> (nodes, edges, hyperedges, uncached)`	SHA256-based cache
`graphify.build`	`build_from_json(data: dict) -> nx.Graph`	NOT `build_graph`
`graphify.cluster`	`cluster(G) -> dict[int, list[str]]`, `score_all(G, communities) -> dict[int, float]`	Leiden community detection
`graphify.analyze`	`god_nodes(G)`, `surprising_connections(G)`, `suggest_questions(G, communities, labels)`	Graph analysis
`graphify.report`	`generate(G, communities, cohesion, labels, gods, surprises, detect, cost, root, questions)`	Markdown report
`graphify.export`	`to_html(G, communities, path)`, `to_json(G, communities, path)`, `to_svg`, `to_graphml`, `to_cypher`	Multiple formats

Always-On Integration

After building a graph, install the always-on hook so your agent reads the graph before grepping:

cd /path/to/repo
graphify claude install    # Adds CLAUDE.md section + PreToolUse hook
# or: graphify codex install / graphify claw install

Querying an Existing Graph

/graphify query "what connects X to Y?"          # BFS traversal
/graphify query "what connects X to Y?" --dfs    # DFS — trace specific path
/graphify path "NodeA" "NodeB"                    # Shortest path
/graphify explain "SomeNode"                      # Plain-language explanation

Pitfalls

PyPI package is graphifyy (double y) — the graphify name is being reclaimed
Don't use f-strings in terminal for the Python scripts — shell escaping breaks. Write script files instead.
build_from_json not build_graph — the function name differs from what you'd expect
cluster() returns dict[int, list[str]] not a modified graph — you pass communities as a separate arg to export functions
to_json/to_html take 3 args — (G, communities, path_string), not just (G, path)
suggest_questions needs community_labels — compute labels before calling it
Semantic extraction without subagents requires manual node/edge creation — read the docs yourself and build the semantic layer by hand
AST rationale nodes get warnings about invalid file_type='rationale' — these are cosmetic, the graph still builds correctly
Large repos (>200 files): detection will suggest running on a subfolder. Respect this.
The graphify CLI is just for install/hooks/benchmark — the actual pipeline runs inside a coding agent or via Python API
Running code mode on prose produces noise — function-level nodes (main(), frontmatter()) instead of domain concepts. Always use knowledge mode for .md/.txt/.rst heavy directories like Obsidian vaults.
Knowledge mode graphs need cross-document edges — within-document edges alone produce disconnected communities. Spend 50% of extraction effort on finding connections between documents.

Running on Our Obsidian Vault

# First, detect what we're working with
PYTHON=${HOME}/.local/share/pipx/venvs/graphifyy/bin/python
$PYTHON -c "
from graphify.detect import detect
from pathlib import Path
r = detect(Path('${HOME}/obsidian-vault'))
print(f'Files: {r[\"total_files\"]}, Words: {r[\"total_words\"]}')
"

Then follow the knowledge mode workflow (Step 3b) — read each .md file, extract domain concepts as nodes, create cross-document edges. Skip AST pass entirely (no code to analyze in the vault).

Expected output for our vault (~79 notes, ~52K words):

Nodes: ~100-150 domain concepts (projects, technologies, people, decisions, processes)
Edges: ~150-250 cross-document relationships
God nodes: WiseChef, Cognee, Adam, Agent Coordination, Infrastructure (not function names)
Communities: Business strategy, Agent infra, Knowledge systems, Client operations, Development tools

Verification

which graphify                                    # Should show ~/.local/bin/graphify
PYTHON=${HOME}/.local/share/pipx/venvs/graphifyy/bin/python
$PYTHON -c "import graphify; print('OK')"         # Should print OK
$PYTHON -c "from graphify.detect import detect; print('detect OK')"
$PYTHON -c "from graphify.build import build_from_json; print('build OK')"

Graphify

Graphify — Code/Doc → Knowledge Graph

When to Use

When NOT to Use

When NOT to Use

Modes

Code Mode (default)

Knowledge Mode (`--mode knowledge`)

Architecture

Quick Start (Inside a Coding Agent)

Running from Hermes (Without Subagent Dispatch)

Step 1: Detect Files

Step 2: AST Extraction (Code Files)

Step 3: Semantic Extraction (Manual)

Step 3b: Knowledge Mode Extraction (prose/docs)

Step 4: Build Graph + Cluster + Export

API Reference (Discovered — Not Documented)

Always-On Integration

Querying an Existing Graph

Pitfalls

Running on Our Obsidian Vault

Verification

Works well with

Grok Search

Hundred Million Offers

Lean Startup

Mom Test

Obviously Awesome

Summarize Cli

Tavily Search

Hyperspace Matrix

Customer Discovery Competitive Intel

Help us price this right.

Graphify — Code/Doc → Knowledge Graph

When to Use

When NOT to Use

When NOT to Use

Modes

Code Mode (default)

Knowledge Mode (--mode knowledge)

Architecture

Quick Start (Inside a Coding Agent)

Running from Hermes (Without Subagent Dispatch)

Step 1: Detect Files

Step 2: AST Extraction (Code Files)

Step 3: Semantic Extraction (Manual)

Step 3b: Knowledge Mode Extraction (prose/docs)

Step 4: Build Graph + Cluster + Export

API Reference (Discovered — Not Documented)

Always-On Integration

Querying an Existing Graph

Pitfalls

Running on Our Obsidian Vault

Verification

Works well with

Grok Search

Hundred Million Offers

Lean Startup

Mom Test

Obviously Awesome

Summarize Cli

Tavily Search

Hyperspace Matrix

Customer Discovery Competitive Intel

Knowledge Mode (`--mode knowledge`)