Abyss is a local-first Retrieval-Augmented Generation (RAG) engine that indexes your source code and documentation, then exposes semantic search through the Model Context Protocol (MCP). It runs 100% offline with no GPU required.

Does Abyss require a GPU?

No. Abyss runs efficiently on CPU using the quantized all-MiniLM-L6-v2 sentence-transformers model. No GPU or cloud API is required.

What programming languages does Abyss support?

Abyss supports AST-aware parsing for C#, Python, Java, TypeScript, JavaScript, and HTML via Tree-sitter grammars. It also indexes documents (PDF, DOCX, PPTX, Markdown), structured files (JSON, XML), and images via OCR.

How does Abyss integrate with AI tools?

Abyss implements the Model Context Protocol (MCP) and integrates directly with VS Code (GitHub Copilot), Claude Desktop, and any MCP-compatible client. It provides 8 tools and 3 resources for semantic code search.

Is Abyss free and open source?

Yes. Abyss is free and open source under the GNU GPLv3 license. The source code is available on GitHub.

    ___    ____  __  _______ _____
   /   |  / __ \ \/ / ___// ___/
  / /| | / __  | \  /\__ \ \__ \
 / ___ |/ /_/ / / / ___/ /___/ /
/_/  |_/_____/ /_/ /____//____/

Your code, semantically searchable.

Local-first RAG engine that indexes your codebase AND documentation, searchable through MCP.
No cloud. No GPU. No data leaves your machine.

100% Offline No GPU Required MCP Native Multi-Language

Get Started View on GitHub

abyss -- semantic search

$ abyss query "how is payment processed?"

src/Orders/OrderService.cs:5-18

ProcessPayment() -- validates order and charges amount

Score: 0.94 | kind: method | callers: OrderController#Post()

src/Payment/PaymentProcessor.py:12-28

charge() -- executes payment via gateway adapter

Score: 0.87 | kind: function | callees: GatewayAdapter#submit()

docs/architecture.md -- "Payment Flow"

Section describing the end-to-end payment pipeline

Score: 0.82 | chunk_type: document | word_count: 245

3 results in 42ms -- all data stays local

Features

Everything your AI workflow needs

From code parsing to semantic search, Abyss handles the full pipeline locally.

Semantic Search

Search by meaning, not keywords. Ask "how is authentication handled?" and find the exact functions -- even if the word "auth" never appears.

AST-Aware Code Parsing

Tree-sitter splits code at syntactic boundaries -- methods, classes, functions -- not arbitrary line counts. Supports C#, Python, Java, TypeScript, and more.

MCP Native

8 tools + 3 resources accessible from VS Code, Claude Desktop, and any MCP-compatible client. Zero configuration friction.

Universal Document Ingestion

PDF, DOCX, PPTX, images (OCR), Markdown, Jupyter notebooks, CSV, JSON, XML -- all converted and indexed in one unified vector database.

SCIP Call Graph Enrichment

Optional SCIP indexing adds caller/callee relationships, symbol kinds, and documentation to every code chunk. Ask "who calls ProcessPayment()?" and get real answers.

Privacy by Design

Everything runs on your machine. The embedding model downloads once (~90MB) and is cached locally. No cloud APIs, no telemetry, no data exfiltration.

Advanced Query Filters

Filter by language, symbol kind, file path, line range, chunk type, or full-text substring. Precision search across your entire codebase.

Persistent ChromaDB Storage

Indexed data survives restarts in a local SQLite-backed ChromaDB database. No external server needed. Index once, query forever.

Debug HTML Reports

Per-file HTML debug reports showing every chunk, its metadata, and the exact text sent to the embedding model. Diagnose chunking quality visually.

Architecture

Five-stage ingestion pipeline

Files flow through discovery, parsing, enrichment, embedding, and storage -- fully automated.

File Discovery

Recursive glob traversal with size limits (10MB max), directory exclusions, extension filtering, and automatic file type classification (code, document, structured, unknown).

.cs .py .ts .md .pdf .docx .json .xml +20 more

Smart Parsing

Four specialized parsers dispatch by file type. CodeParser uses Tree-sitter AST grammars. DocParser leverages MarkItDown + header-based sectioning. JsonParser and XmlParser handle structured data hierarchically.

SCIP Enrichment (optional)

Matches each code chunk to a SCIP index by file + line range. Injects symbol, kind, callers[], callees[], and documentation -- enabling call-graph-aware search.

Semantic Header Injection

EmbedBuilder prepends a structured header to each chunk's text -- file path, language, symbol name, kind, callers, and callees. This dramatically improves embedding quality and search relevance.

enriched chunk

// File : src/Orders/OrderService.cs

// Language : csharp

// Symbol : ValidateOrder

// Kind : method

// Calls : PriceCalculator#Compute()

// Called by: OrderController#Post()

public void ValidateOrder(Order order)

{

if (order.Items.Count == 0)

throw new InvalidOperationException();

}

Embedding & Storage

Batch-encodes enriched text with all-MiniLM-L6-v2, then upserts into ChromaDB with cosine similarity. Files are tracked in a document registry with hash, size, and timestamp for incremental re-indexing.

Why Abyss

Semantically link your docs and codebase

Text search finds strings. Abyss finds meaning.

grep / text search

$ grep -r "auth" src/

utils/string.ts:42: // authorize string

Header.tsx:8: author="John"

oauth.ts:15: authToken = null

config.ts:3: // authentication cfg

user.ts:89: reauthorize()

... 841 more results

847 results. Where is the actual auth logic?

abyss / semantic search

$ abyss query "user authentication flow"

src/auth/login.ts:23-45

handleUserLogin() -- validates credentials

Score: 0.94

src/middleware/session.ts:12-28

verifySession() -- checks JWT token

Score: 0.87

2 results. Exactly what you need.

Built With

Battle-tested technology stack

⚙

MCP

Protocol layer

📦

LlamaIndex

Orchestration

🌳

Tree-sitter

AST parsing

📄

MarkItDown

Doc conversion

🤖

MiniLM-L6-v2

Embeddings

🗃

ChromaDB

Vector storage

🔗

SCIP

Call graph index

🐍

Python 3.13

Runtime

Languages & Formats

Index everything

Source code with AST-aware parsing, documents with intelligent sectioning, structured data with hierarchical decomposition.

Source Code

Tree-sitter

C# Python Java TypeScript JavaScript HTML TSX / JSX

Documents

MarkItDown

Markdown PDF DOCX PPTX XLSX EPUB Images (OCR) CSV Jupyter

Structured

Dedicated

JSON XML .csproj .props .config

Integrations

Plug into your AI workflow

MCP server with 8 tools and 3 resources. Works with any MCP-compatible client.

VS Code

GitHub Copilot + MCP integration

Claude Desktop

Native MCP client support

Any MCP Client

Standard protocol, zero lock-in

Available MCP Tools

index_directory Recursively index a directory with include/exclude filters

query Semantic search with filters: language, kind, file path, line range, text

list_documents List all indexed files with metadata: name, date, size, chunk count

list_sources Unique metadata values: file paths, languages, kinds, chunk types

replace_document Re-index a single file after modification

remove_document Remove a file and all its chunks from the database

list_filterable_fields Describe all filterable metadata fields with types and operators

clear_database Erase all chunks and document registry (requires confirmation)

Quick Start

Up and running in minutes

From clone to semantic search in four steps.

Clone & install

powershell

$ git clone https://github.com/spashx/abyss.git

$ cd abyss

$ .\install-deps-for-dev.ps1

Configure MCP in VS Code

.vscode/mcp.json

{

"servers": {

"abyss": {

"command": "<path>\\.venv\\Scripts\\python.exe",

"args": ["-m", "abyss"],

"env": { "PYTHONPATH": "<path>\\src" }

}

Index your codebase

Use the index_directory MCP tool from your AI assistant:

MCP tool call

index_directory({

"path": "D:/repos/MyProject",

"include_extensions": [".cs", ".md"]

})

Search with natural language

MCP query

query({

"question": "how is payment processed?",

"top_k": 8,

"filters": { "languages": ["csharp"] }

})

Real-world output

Example of results with Abyss

A single structured prompt to an AI agent -- backed by Abyss MCP queries -- enables detailed analysis and insights

The prompt

ilspy-analysis-prompt.original.md

The ILSpy application is a .NET decompiler. The folder ILSpy contains the root application. I need a detailed analysis on the "Search" features: 1) Search panel UI and model 2) Search result factory and filtering # 1) MANDATORY - KNOWLEDGE BASE USAGE For any information about ILSpy, ALWAYS USE the Abyss MCP server and its tools: - list_documents List all indexed documents - query Semantic search + filtering - list_sources List sources and filter values # 2) ACTIONS TO PERFORM As a seasoned expert in C#/.NET programming and professionnal senior software architect, perform these tasks: 2.1) - search into the ABYSS knowledge base all entities (modules, classes, methods) implied with the " Search", "Search panel UI and model", "Search result factory and filtering". Identify the relations between this entities. 2.2) - identify the main workflows (dynamic calls) between the entities that enable to fullfill the search features. For each call, identify the objects that are used (C/R/U/D) 2.3) for each features, produce a mermaid class diagram representings the entities, a mermaid sequence diagram for the dynamics call, and the C/R/U/D status on used objects 2.4) Generate a slick, professional, HTML report with the informations above: - an executive summary about the report - the list of features and associated diagrams and informations - a list of recommendation (quality, cybersecurity) about the implementation of the features Overall it is HIGHLY IMPORTANT to have a PROFESSIONAL LOOKING REPORT. Prompt reformulation request before launching the execution: "To Sonnet 4.6: Create a detailled plan with tasks for the implementation of document analysis-prompt.original.md. Reformulate the requirements document with EARS notation,to have a precise, non ambiguous implementation plan in order to get the BEST POSSIBLE RESULT with an agentic IA like you."

Zero manual exploration. The agent autonomously issued multiple query and list_sources calls against the Abyss index to build the full picture before writing a single line of the report.

The result

ilspy-search-analysis.html

Entity inventory Mermaid class & sequence diagrams CRUD reference Security recommendations

The prompt

investigate-cve.original.md

# 1) CONTEXT AND OBJECTIVES The cdxgen application is a SBOM generator/analyzer that provides several features. The folder C:\dev\repos\cdxgen contains the whole core source of the application. This source code + documentation is fully indexed into the ABYSS RAG knowledge database. There is currently an open issue: (Investigate CVE-2025-69873 #3484). The repo owner did a pre-analysis. I want a clear and detailed analysis of the situation. # 2) MANDATORY - KNOWLEDGE BASE USAGE For any information about cdxgen, ALWAYS USE IN PRIORITY the Abyss MCP server: - list_documents List all indexed documents - query Semantic search with multi-criteria filtering - list_sources List indexed sources and filter values - list_filterable_fields Describe filterable metadata fields # 3) ANALYSIS OF THE ISSUE 3.1) Read the current Github issue that deals with the CVE (Investigate CVE-2025-69873 #3484) 3.2) As a seasoned JS/TS developer and PROFESSIONAL SENIOR CYBERSECURITY ENGINEER, perform a detailed analysis of the source code impacted by the vulnerability. Explain the context, expose the whole call tree, establish potential links to CWE and/or OWASP standards practices, and provide insights for the remediation if any. Use mermaid diagrams to illustrate your analysis. 3.3) Generate a slick, professional HTML report in cdxgen-cve-2025-69873.analysis.html with 3 chapters of information: - Level 1: executive summary, without too much technical details - Level 2: detailed analysis of 1.2) - Level 3: takeaways for a junior cybersecurity analyst

Zero manual exploration. The agent autonomously queried the Abyss index to retrieve CVE context, call trees, and impacted code paths before writing a single line of the report.

The result

cdxgen-cve-2025-69873.analysis.html

Executive summary CWE / OWASP mapping Call tree & Mermaid diagrams Junior analyst takeaways

Ready to search your code
by meaning?

Open source. Privacy-first. No cloud required.

Start on GitHub Read the docs

Your code, semantically searchable.

Everything your AI workflow needs

Semantic Search

AST-Aware Code Parsing

MCP Native

Universal Document Ingestion

SCIP Call Graph Enrichment

Privacy by Design

Advanced Query Filters

Persistent ChromaDB Storage

Debug HTML Reports

Five-stage ingestion pipeline

File Discovery

Smart Parsing

SCIP Enrichment (optional)

Semantic Header Injection

Embedding & Storage

Semantically link your docs and codebase

Battle-tested technology stack

Index everything

Source Code

Documents

Structured

Plug into your AI workflow

VS Code

Claude Desktop

Any MCP Client

Available MCP Tools

Up and running in minutes

Clone & install

Configure MCP in VS Code

Index your codebase

Search with natural language

Example of results with Abyss

The prompt

The result

The prompt

The result

Ready to search your codeby meaning?

Ready to search your code
by meaning?