Semble: Intelligent Code Search That Slashes Token Usage by 98%
The Problem with Traditional Code Search for AI Agents
When AI coding assistants like Claude Code tackle large codebases, they often rely on grep to locate relevant code. But grep is a blunt instrument—it scans files line by line, consuming massive numbers of tokens and frequently missing the right matches. The result: wasted compute, slower responses, and incomplete context for the agent. Existing alternatives either demand GPU-powered indexing, require API keys, or suffer from poor retrieval quality. Developers need a tool that is fast, accurate, and economical with tokens.
Introducing Semble: A Token-Efficient Alternative
Semble is an open-source code search engine built specifically for AI agents. Developed by Stephan and Thomas, it addresses the token waste problem head-on. By combining static Model2Vec embeddings (using their custom model, potion-code-16M) with BM25, fused via Reciprocal Rank Fusion (RRF) and reranked using code-aware signals, Semble achieves state-of-the-art retrieval without any transformers. This means everything runs on CPU, making it accessible and inexpensive.
How It Works
The magic lies in the hybrid approach: static embeddings capture semantic meaning without the overhead of running a transformer model, while BM25 provides traditional keyword matching. RRF blends the two rankings, and a lightweight reranking step fine-tunes results based on code-specific heuristics. The entire pipeline is optimized for speed—typically indexing a repository takes ~250 milliseconds, and each query completes in ~1.5 milliseconds on CPU.
Benchmark Performance: Almost Perfect Accuracy
On a benchmark of approximately 1,250 query/document pairs across 63 repositories and 19 programming languages, Semble delivers remarkable results:
- Token reduction: Uses 98% fewer tokens than the traditional grep+read approach
- Accuracy: Achieves an NDCG@10 of 0.854, which is 99% of the performance of a 137M-parameter code-trained transformer model
- Speed: About 200× faster than that transformer setup
These numbers show that Semble nearly matches the retrieval quality of much heavier transformer models while being dramatically faster and token-efficient.
Key Features
- Token-efficient: 98% fewer tokens than grep+read
- Fast indexing and querying: ~250 ms to index a typical repo on their benchmark; ~1.5 ms per query on CPU (very large repos may take longer)
- Accurate: 0.854 NDCG@10, 99% of the best transformer setup tested
- MCP server: Drop-in replacement for Claude Code, Cursor, Codex, and OpenCode
- Zero configuration: No API keys, no GPU, no external services required
Getting Started
Integrating Semble with Claude Code is a one-liner:
claude mcp add semble -s user -- uvx --from "semble[mcp]" sembleFor other environments (Cursor, Codex, OpenCode), check the README for detailed instructions.
Why This Matters for AI Agents
Agents work in loops: they ask a question, gather context, then act. Every token spent on grep or reading full files adds latency and cost. By slashing token usage by 98%, Semble allows agents to operate faster, handle larger codebases, and stay within budget. Because it runs on CPU with no external dependencies, it works immediately out of the box—perfect for local, offline, or air-gapped environments.
Conclusion
Semble proves that you don’t need massive transformer models for high-quality code retrieval. Its hybrid approach offers a practical, efficient solution for AI coding tools. Whether you’re building a custom agent or using Claude Code, Semble can dramatically reduce token consumption while maintaining near-perfect retrieval accuracy. Try it today and see the difference.
For more details, including the full benchmark methodology and model weights, visit the Semble repository and the benchmarks page. The static model is available on Hugging Face.