Persistent AI memory on a Raspberry Pi 5
Local embeddings + ChromaDB + Ollama in ~150 lines. ~$100 of hardware. No tokens.
A full RAG long-term memory system runs in ~600MB of RAM on a Raspberry Pi 5. Total hardware cost: ~$100 (8GB Pi 5 + NVMe SSD). Every component local. No API keys. No subscription. Browse the catalog at /models — none of this depends on cloud inference.
The stack
Three components:
| Layer | Choice | Why |
|---|---|---|
| Embeddings | Ollama + nomic-embed-text |
Zero runtime cost, local, 768-dim vectors. Japanese performance is "acceptable," English is strong. |
| Vector store | ChromaDB (Docker) | ARM64-compatible (LanceDB had Docker issues on Pi 5). ~30MB idle, cosine similarity. |
| Interface | Discord bot (Docker) | Off-the-shelf, no custom UI work. |
Host: Raspberry Pi 5 · 8GB RAM · 238GB NVMe · Debian Bookworm · Node.js 22.
The memory module
Roughly 150 lines of JavaScript, four functions:
// Generate 768-dim vector from Ollama's REST API
embed(text)
// Initialize ChromaDB collection with cosine similarity
init()
// Store user + assistant pair, indexed by channelId + timestamp
saveConversation(channelId, userMessage, botResponse)
// Vectorize incoming query, return top-5 similar history
searchMemory(channelId, query)
Bot integration is two additions to the existing message handler:
// Before generating response — retrieve history
const memories = await searchMemory(channelId, userMessage);
const contextualMessage = memories + "\n\n" + userMessage;
// After responding — persist this turn (non-blocking)
saveConversation(channelId, userMessage, botResponse);
That's the entire memory layer.
Resource footprint
| Component | Idle | Active |
|---|---|---|
| Ollama | ~33 MB | +540 MB |
| ChromaDB | ~30 MB | ~30 MB |
| Total | ~63 MB | ~603 MB |
On an 8GB Pi, that's under 8% of RAM at peak. Headroom for Discord, the bot runtime, and anything else.
Does it work
After a deliberate semantic session reset, the bot was asked "What did you eat today?" Response: "Curry!" — correctly pulled from a previous session. Pi restarts, conversation history persists and resurfaces when contextually relevant.
Known gaps
Documented honestly by the author as next steps — all standard RAG improvements:
- No similarity threshold. All top-5 results inject regardless of score. A cosine threshold (e.g., > 0.7) filters noise.
- No memory summarization. Raw turn pairs accumulate. Long-running bots will eventually inject redundant context.
- No file ingestion. Local notes, memos, and documents aren't part of the retrieval corpus yet.
Why this matters
The running cost is electricity. No tokens. No subscription. No API rate limits. For anyone building a personal assistant that knows your projects, your preferences, your history — this is the reference architecture. Every component is stable and ARM64-native.
If the CPU budget is tighter than a Pi 5, see the calculator — even an 8GB GPU comfortably runs nomic-embed-text alongside a small generator model.
Based on RunLocal Issue #3 · Full newsletter version on Substack →