AI Agent · Architect, sole engineer · 2026 · Live

Jarvis Mac

A fully local voice assistant on Apple Silicon — streaming pipeline, speculative decoding, on-device RAG.

The AI part

End-to-end streaming on a Mac: VAD → STT → LLM (per-session prompt cache + speculative decoding) → TTS, with browser AEC and acoustic echo cancellation, plus a phoneme recognizer that learns mispronunciations from a single voice sample.

Stack

Pythonmlx-lm (Gemma 3.1 31B, MXFP4)Silero VAD v6parakeet-mlx (STT)Kokoro TTSSQLite + sqlite-vec + FTSFastAPI + WebSocketSwift / iOS

Tests

383 passing

Active users

Model

Gemma 3.1 31B (MXFP4)

Latency

Sub-second TTFB

Why I built this

I wanted to know how far you could push a serious voice assistant on a single laptop. Not a prototype. A daily driver.

How it works

Streaming pipeline. Mic → Silero VAD → parakeet-mlx STT → mlx-lm LLM → Kokoro TTS → speakers. Every stage starts the moment it has enough input; nothing waits for “done.”
Per-session prompt cache. The KV cache for the system prefix is reused across turns within a conversation, so the model effectively “remembers” without recomputing.
Speculative decoding. Profile-scoped — auto-detects when a smaller draft model is compatible with the target and falls back gracefully when it isn’t.
Pronunciation learning. A teach-by-voice UI: I say “say my name like this” once, and a wav2vec2-based acoustic comparator + a regex layer remember it forever.
Phase 4 RAG. A separate document indexer LXC ingests iCloud and TrueNAS docs into sqlite-vec; the assistant answers from them at conversation speed.
iOS twin. A native AVFoundation client that does hardware AEC on the phone, so I can talk to the same brain from anywhere.

What it took

SQLite schema v5 with FTS + vec embeddings, hybrid retrieval (last 4 messages + top 5 facts + 2 past convos).
launchd service so it’s just always there.
iCloud → TrueNAS rsync over SSH for offline doc availability.
383 tests, currently green. Per-user voice profiles (af_heart, af_jessica).

What I learned

When the round-trip latency matters more than the benchmark score, every layer of the stack has to cooperate. The win wasn’t a clever prompt — it was prompt caching and speculative decoding playing nicely together with a streaming TTS that doesn’t wait for the model to finish.