Indexing a year of video locally on a 5-year-old M1 Max with Gemma 4 31B
AI video editors assume your footage is already labeled—but the real bottleneck is making unlabeled archives queryable in plain English, which you can solve locally for $22/month instead of $140 in SaaS rents.
Read Original Summary used for search
TLDR
• Built a local-first video indexing pipeline that writes .description.md sidecars next to each clip using Gemma 4 31B, WhisperX transcription, and face detection—runs on a 2021 M1 Max with 64GB RAM
• Dropped from $140/month SaaS stack to $22 by realizing DaVinci Resolve Studio already ships 70% of what AI video editors sell, and the real problem is upstream: making footage searchable
• Enum constraints beat open-ended instructions for preventing hallucinations—structured schemas force models to pick from valid options (golden_hour | nighttime | dim_interior) rather than confabulate
• Local 31B with structured prompts closes most of the gap to cloud APIs for bulk operations; cloud premium earns its keep on the hard 10-20% edge cases
• One vision call captures everything (rating, lighting, color palette, audio quality, keywords, faces, GPS, transcript) because the vision pass is the expensive operation—schema must be exhaustive on day one
In Detail
The author runs an eco-lodge in Kenya's Maasai Mara and had years of unlabeled footage sitting on SSDs because editing time disappeared into shipping software. The initial solution—a $140/month SaaS stack of AI video editors—failed before running it: generative B-roll has no place on a real travel brand, and every AI editor assumes footage is already labeled. The real problem is upstream: how does an agent find "the elephant on the hill at golden hour" in an archive of IMG_*.mov files?
The solution is a local-first indexing pipeline that writes .description.md sidecars next to each clip. The pipeline extracts five frames per clip, runs WhisperX for transcription with speaker diarization, detects faces with insightface and stores 512-dim ArcFace embeddings, reverse-geocodes GPS coordinates, then sends everything to Gemma 4 31B Q4 running in LM Studio. One vision call captures exhaustive metadata (lighting enum, time-of-day enum, color palette, technical quality, keywords, prose description) because the vision pass is the expensive operation. The whole thing runs on a 2021 M1 Max with 64GB RAM, pushing 50GB of swap during bulk runs but staying within tolerance.
Four bugs taught specific lessons: WhisperX broke its diarization API between versions (fix: signature introspection with fallback), Claude CLI returns permission errors as successful responses in non-interactive mode (fix: string-match for "I need permission"), Gemma returned people_count: "many" instead of integers (fix: never union-type schema fields), and the initial cull prompt was photographer-portfolio-shaped when it should have been memory-permissive (fix: explicit mode declaration). The deeper insight: enum constraints beat instructions for confabulation prevention—a model can mis-pick from an enum but can't invent new values. Local 31B with structured prompts closes most of the gap to cloud APIs for bulk indexing; cloud is the re-rate pass on clips local flagged for review. AI video editors are pitched one layer too high—the valuable layer is the index that makes archives queryable in plain English, not the editor on top.