runlocal.devCheck My GPU →
Issue #3Apr 12, 2026

Persistent AI memory on a Raspberry Pi 5

Local embeddings + ChromaDB + Ollama in ~150 lines. ~$100 of hardware. No tokens.

A full RAG long-term memory system runs in ~600MB of RAM on a Raspberry Pi 5. Total hardware cost: ~$100 (8GB Pi 5 + NVMe SSD). Every component local. No API keys. No subscription. Browse the catalog at /models — none of this depends on cloud inference.

The stack

Three components:

Layer Choice Why
Embeddings Ollama + nomic-embed-text Zero runtime cost, local, 768-dim vectors. Japanese performance is "acceptable," English is strong.
Vector store ChromaDB (Docker) ARM64-compatible (LanceDB had Docker issues on Pi 5). ~30MB idle, cosine similarity.
Interface Discord bot (Docker) Off-the-shelf, no custom UI work.

Host: Raspberry Pi 5 · 8GB RAM · 238GB NVMe · Debian Bookworm · Node.js 22.

The memory module

Roughly 150 lines of JavaScript, four functions:

// Generate 768-dim vector from Ollama's REST API
embed(text)

// Initialize ChromaDB collection with cosine similarity
init()

// Store user + assistant pair, indexed by channelId + timestamp
saveConversation(channelId, userMessage, botResponse)

// Vectorize incoming query, return top-5 similar history
searchMemory(channelId, query)

Bot integration is two additions to the existing message handler:

// Before generating response — retrieve history
const memories = await searchMemory(channelId, userMessage);
const contextualMessage = memories + "\n\n" + userMessage;

// After responding — persist this turn (non-blocking)
saveConversation(channelId, userMessage, botResponse);

That's the entire memory layer.

Resource footprint

Component Idle Active
Ollama ~33 MB +540 MB
ChromaDB ~30 MB ~30 MB
Total ~63 MB ~603 MB

On an 8GB Pi, that's under 8% of RAM at peak. Headroom for Discord, the bot runtime, and anything else.

Does it work

After a deliberate semantic session reset, the bot was asked "What did you eat today?" Response: "Curry!" — correctly pulled from a previous session. Pi restarts, conversation history persists and resurfaces when contextually relevant.

Known gaps

Documented honestly by the author as next steps — all standard RAG improvements:

  1. No similarity threshold. All top-5 results inject regardless of score. A cosine threshold (e.g., > 0.7) filters noise.
  2. No memory summarization. Raw turn pairs accumulate. Long-running bots will eventually inject redundant context.
  3. No file ingestion. Local notes, memos, and documents aren't part of the retrieval corpus yet.

Why this matters

The running cost is electricity. No tokens. No subscription. No API rate limits. For anyone building a personal assistant that knows your projects, your preferences, your history — this is the reference architecture. Every component is stable and ARM64-native.

If the CPU budget is tighter than a Pi 5, see the calculator — even an 8GB GPU comfortably runs nomic-embed-text alongside a small generator model.


Based on RunLocal Issue #3 · Full newsletter version on Substack →