Back to Projects
Multimodal RAG
2025

Albot

A multimodal RAG system with knowledge graphs, adaptive retrieval, and layered memory.

Multimodal ingestion
Hybrid retrieval
Layered memory
Overview

Albot ingests text, images, audio, and video into an ArangoDB-backed knowledge graph, then answers complex queries through a retrieval stack that blends semantic, graph, and lexical signals.

Language DNA

Python

Albot leans on Python for orchestration, multimodal processing, retrieval logic, and rapid iteration across a fairly ambitious RAG surface. The result feels less like a single chatbot and more like a layered knowledge system.

FastAPIArangoDBNext.jsPyTorchWhisper
Multimodal retrieval stack
1

Albot is structured as a production-ready multimodal RAG system built around text, image, audio, video, and structured-data ingestion.

2

The backend ties FastAPI, ArangoDB, SQLite, retrieval orchestration, LLM routing, and memory management into one modular flow.

3

The system combines vector search, graph traversal, BM25, adaptive weighting, reasoning traces, namespace-scoped memory, and web fallback.

Intelligence layer

Handles multimodal ingestion through OCR, VLM-based image parsing, Whisper for audio, frame extraction for video, and structured-file support.

Builds a layered memory model with working, session, and semantic memory instead of treating conversation history as the only memory source.

Supports multi-LLM routing across OpenAI, Anthropic, Gemini, Groq, and OpenRouter.

Research and operations

Includes an agentic deep-research loop that can plan, search, collect, and synthesize multi-step information.

Persists sessions, traces, and memory artifacts for concurrent chats and cross-session retrieval.

Uses a Docker-first workflow so the full architecture can run locally or on commodity hardware.

System design

Bayesian weight optimization and Personalized PageRank are used to adjust retrieval behavior more intelligently.

SQLite and ArangoDB play different roles in the storage model, which helps separate sessions, traces, and long-lived knowledge.

Product capabilities

Built pipelines for OCR, VLM-based image understanding, Whisper transcription, and video frame extraction.

Combined vector search, graph traversal, BM25, Personalized PageRank, and Bayesian weighting for adaptive retrieval.

Designed working, session, and semantic memory tiers for cross-session continuity.

Workflow

Pipeline flow

1

Content is ingested and transformed into graph-aware knowledge atoms and embeddings.

2

Retrieval balances semantic, lexical, and relational signals depending on the query type.

3

The orchestration layer composes the answer with reasoning traces, citations, and memory-aware context.

Execution model

Graph-based retrieval, adaptive reasoning, layered memory, and deep research are central to the way the product behaves.

The system also has a stronger deployment orientation than a notebook-only prototype.

Actions
Case study

Retrieval approach

Albot is built around the idea that complex questions rarely respond well to a single retrieval signal. The system blends semantic similarity, graph structure, and lexical precision so that factual lookups, relationship-heavy prompts, and broad exploratory questions can all be handled through the same product surface. That makes the assistant feel more dependable across very different query styles.

Why it stands out

What makes the product more interesting than a standard chatbot is the way ingestion, memory, retrieval, and synthesis are treated as one connected system. Multimodal inputs are not bolted on as novelty features; they feed into the same knowledge pipeline and influence how later answers are formed. The result is a research assistant that behaves more like a persistent knowledge layer than a one-turn conversation interface.