Your code, semantically searchable.

Local-first RAG engine that indexes your codebase AND documentation, searchable through MCP.
No cloud. No GPU. No data leaves your machine.

100% Offline No GPU Required MCP Native Multi-Language
abyss -- semantic search
$ abyss query "how is payment processed?"
1.
src/Orders/OrderService.cs:5-18
ProcessPayment() -- validates order and charges amount
Score: 0.94 | kind: method | callers: OrderController#Post()
2.
src/Payment/PaymentProcessor.py:12-28
charge() -- executes payment via gateway adapter
Score: 0.87 | kind: function | callees: GatewayAdapter#submit()
3.
docs/architecture.md -- "Payment Flow"
Section describing the end-to-end payment pipeline
Score: 0.82 | chunk_type: document | word_count: 245
3 results in 42ms -- all data stays local
Features

Everything your AI workflow needs

From code parsing to semantic search, Abyss handles the full pipeline locally.

Semantic Search

Search by meaning, not keywords. Ask "how is authentication handled?" and find the exact functions -- even if the word "auth" never appears.

AST-Aware Code Parsing

Tree-sitter splits code at syntactic boundaries -- methods, classes, functions -- not arbitrary line counts. Supports C#, Python, Java, TypeScript, and more.

MCP Native

8 tools + 3 resources accessible from VS Code, Claude Desktop, and any MCP-compatible client. Zero configuration friction.

Universal Document Ingestion

PDF, DOCX, PPTX, images (OCR), Markdown, Jupyter notebooks, CSV, JSON, XML -- all converted and indexed in one unified vector database.

SCIP Call Graph Enrichment

Optional SCIP indexing adds caller/callee relationships, symbol kinds, and documentation to every code chunk. Ask "who calls ProcessPayment()?" and get real answers.

Privacy by Design

Everything runs on your machine. The embedding model downloads once (~90MB) and is cached locally. No cloud APIs, no telemetry, no data exfiltration.

Advanced Query Filters

Filter by language, symbol kind, file path, line range, chunk type, or full-text substring. Precision search across your entire codebase.

Persistent ChromaDB Storage

Indexed data survives restarts in a local SQLite-backed ChromaDB database. No external server needed. Index once, query forever.

Debug HTML Reports

Per-file HTML debug reports showing every chunk, its metadata, and the exact text sent to the embedding model. Diagnose chunking quality visually.

Architecture

Five-stage ingestion pipeline

Files flow through discovery, parsing, enrichment, embedding, and storage -- fully automated.

1

File Discovery

Recursive glob traversal with size limits (10MB max), directory exclusions, extension filtering, and automatic file type classification (code, document, structured, unknown).

.cs .py .ts .md .pdf .docx .json .xml +20 more
2

Smart Parsing

Four specialized parsers dispatch by file type. CodeParser uses Tree-sitter AST grammars. DocParser leverages MarkItDown + header-based sectioning. JsonParser and XmlParser handle structured data hierarchically.

3

SCIP Enrichment (optional)

Matches each code chunk to a SCIP index by file + line range. Injects symbol, kind, callers[], callees[], and documentation -- enabling call-graph-aware search.

4

Semantic Header Injection

EmbedBuilder prepends a structured header to each chunk's text -- file path, language, symbol name, kind, callers, and callees. This dramatically improves embedding quality and search relevance.

enriched chunk
// File : src/Orders/OrderService.cs
// Language : csharp
// Symbol : ValidateOrder
// Kind : method
// Calls : PriceCalculator#Compute()
// Called by: OrderController#Post()
public void ValidateOrder(Order order)
{
if (order.Items.Count == 0)
throw new InvalidOperationException();
}
5

Embedding & Storage

Batch-encodes enriched text with all-MiniLM-L6-v2, then upserts into ChromaDB with cosine similarity. Files are tracked in a document registry with hash, size, and timestamp for incremental re-indexing.

Why Abyss

Semantically link your docs and codebase

Text search finds strings. Abyss finds meaning.

grep / text search
$ grep -r "auth" src/
utils/string.ts:42: // authorize string
Header.tsx:8: author="John"
oauth.ts:15: authToken = null
config.ts:3: // authentication cfg
user.ts:89: reauthorize()
... 841 more results
847 results. Where is the actual auth logic?
abyss / semantic search
$ abyss query "user authentication flow"
src/auth/login.ts:23-45
handleUserLogin() -- validates credentials
Score: 0.94
src/middleware/session.ts:12-28
verifySession() -- checks JWT token
Score: 0.87
2 results. Exactly what you need.
Built With

Battle-tested technology stack

MCP
Protocol layer
📦
LlamaIndex
Orchestration
🌳
Tree-sitter
AST parsing
📄
MarkItDown
Doc conversion
🤖
MiniLM-L6-v2
Embeddings
🗃
ChromaDB
Vector storage
🔗
SCIP
Call graph index
🐍
Python 3.13
Runtime
Languages & Formats

Index everything

Source code with AST-aware parsing, documents with intelligent sectioning, structured data with hierarchical decomposition.

Source Code

Tree-sitter
C# Python Java TypeScript JavaScript HTML TSX / JSX

Documents

MarkItDown
Markdown PDF DOCX PPTX XLSX EPUB Images (OCR) CSV Jupyter

Structured

Dedicated
JSON XML .csproj .props .config
Integrations

Plug into your AI workflow

MCP server with 8 tools and 3 resources. Works with any MCP-compatible client.

VS Code

GitHub Copilot + MCP integration

Claude Desktop

Native MCP client support

Any MCP Client

Standard protocol, zero lock-in

Available MCP Tools

index_directory Recursively index a directory with include/exclude filters
query Semantic search with filters: language, kind, file path, line range, text
list_documents List all indexed files with metadata: name, date, size, chunk count
list_sources Unique metadata values: file paths, languages, kinds, chunk types
replace_document Re-index a single file after modification
remove_document Remove a file and all its chunks from the database
list_filterable_fields Describe all filterable metadata fields with types and operators
clear_database Erase all chunks and document registry (requires confirmation)
Quick Start

Up and running in minutes

From clone to semantic search in four steps.

1

Clone & install

powershell
$ git clone https://github.com/spashx/abyss.git
$ cd abyss
$ .\install-deps-for-dev.ps1
2

Configure MCP in VS Code

.vscode/mcp.json
{
"servers": {
"abyss": {
"command": "<path>\\.venv\\Scripts\\python.exe",
"args": ["-m", "abyss"],
"env": { "PYTHONPATH": "<path>\\src" }
}
}
}
3

Index your codebase

Use the index_directory MCP tool from your AI assistant:

MCP tool call
index_directory({
"path": "D:/repos/MyProject",
"include_extensions": [".cs", ".md"]
})
4

Search with natural language

MCP query
query({
"question": "how is payment processed?",
"top_k": 8,
"filters": { "languages": ["csharp"] }
})
Real-world output

Example of results with Abyss

A single structured prompt to an AI agent -- backed by Abyss MCP queries -- enables detailed analysis and insights

The prompt

ilspy-analysis-prompt.original.md
The ILSpy application is a .NET decompiler. The folder ILSpy contains the root application. I need a detailed analysis on the "Search" features: 1) Search panel UI and model 2) Search result factory and filtering # 1) MANDATORY - KNOWLEDGE BASE USAGE For any information about ILSpy, ALWAYS USE the Abyss MCP server and its tools: - list_documents List all indexed documents - query Semantic search + filtering - list_sources List sources and filter values # 2) ACTIONS TO PERFORM As a seasoned expert in C#/.NET programming and professionnal senior software architect, perform these tasks: 2.1) - search into the ABYSS knowledge base all entities (modules, classes, methods) implied with the " Search", "Search panel UI and model", "Search result factory and filtering". Identify the relations between this entities. 2.2) - identify the main workflows (dynamic calls) between the entities that enable to fullfill the search features. For each call, identify the objects that are used (C/R/U/D) 2.3) for each features, produce a mermaid class diagram representings the entities, a mermaid sequence diagram for the dynamics call, and the C/R/U/D status on used objects 2.4) Generate a slick, professional, HTML report with the informations above: - an executive summary about the report - the list of features and associated diagrams and informations - a list of recommendation (quality, cybersecurity) about the implementation of the features Overall it is HIGHLY IMPORTANT to have a PROFESSIONAL LOOKING REPORT. Prompt reformulation request before launching the execution: "To Sonnet 4.6: Create a detailled plan with tasks for the implementation of document analysis-prompt.original.md. Reformulate the requirements document with EARS notation,to have a precise, non ambiguous implementation plan in order to get the BEST POSSIBLE RESULT with an agentic IA like you."
Zero manual exploration. The agent autonomously issued multiple query and list_sources calls against the Abyss index to build the full picture before writing a single line of the report.

The result

ilspy-search-analysis.html
Entity inventory Mermaid class & sequence diagrams CRUD reference Security recommendations

The prompt

investigate-cve.original.md
# 1) CONTEXT AND OBJECTIVES The cdxgen application is a SBOM generator/analyzer that provides several features. The folder C:\dev\repos\cdxgen contains the whole core source of the application. This source code + documentation is fully indexed into the ABYSS RAG knowledge database. There is currently an open issue: (Investigate CVE-2025-69873 #3484). The repo owner did a pre-analysis. I want a clear and detailed analysis of the situation. # 2) MANDATORY - KNOWLEDGE BASE USAGE For any information about cdxgen, ALWAYS USE IN PRIORITY the Abyss MCP server: - list_documents List all indexed documents - query Semantic search with multi-criteria filtering - list_sources List indexed sources and filter values - list_filterable_fields Describe filterable metadata fields # 3) ANALYSIS OF THE ISSUE 3.1) Read the current Github issue that deals with the CVE (Investigate CVE-2025-69873 #3484) 3.2) As a seasoned JS/TS developer and PROFESSIONAL SENIOR CYBERSECURITY ENGINEER, perform a detailed analysis of the source code impacted by the vulnerability. Explain the context, expose the whole call tree, establish potential links to CWE and/or OWASP standards practices, and provide insights for the remediation if any. Use mermaid diagrams to illustrate your analysis. 3.3) Generate a slick, professional HTML report in cdxgen-cve-2025-69873.analysis.html with 3 chapters of information: - Level 1: executive summary, without too much technical details - Level 2: detailed analysis of 1.2) - Level 3: takeaways for a junior cybersecurity analyst
Zero manual exploration. The agent autonomously queried the Abyss index to retrieve CVE context, call trees, and impacted code paths before writing a single line of the report.

The result

cdxgen-cve-2025-69873.analysis.html
Executive summary CWE / OWASP mapping Call tree & Mermaid diagrams Junior analyst takeaways

Ready to search your code
by meaning?

Open source. Privacy-first. No cloud required.