Gemini Embedding 2: Our first natively multimodal embedding model
Google's first embedding model that maps text, images, video, audio, and documents into a single unified space—and can process multiple modalities simultaneously in one request to capture cross-modal relationships.
Read Original Summary used for search
TLDR
• Processes interleaved multimodal input (e.g., image + text together) in a single request, not just separate modalities—captures nuanced relationships between media types
• Built on Gemini architecture with native multimodal understanding: 8192 token context, 6 images/request, 120-second videos, audio without transcription, 6-page PDFs
• Unified embedding space across 100+ languages simplifies RAG, semantic search, and clustering pipelines—no need for separate embedding models per modality
• Flexible output dimensions and support for complex downstream tasks like sentiment analysis across mixed media
In Detail
Gemini Embedding 2 represents a fundamental shift from text-only or modality-siloed embeddings to a truly unified multimodal approach. Built on the Gemini architecture, it maps text, images, videos, audio, and documents into a single embedding space that understands semantic intent across over 100 languages. The critical innovation is interleaved input processing—you can pass multiple modalities in one request (like an image with accompanying text), allowing the model to capture complex relationships between different media types rather than treating them as isolated inputs.
The technical specifications are substantial: 8192 input tokens for text, up to 6 images per request (PNG/JPEG), 120 seconds of video (MP4/MOV), native audio ingestion without transcription, and direct PDF embedding up to 6 pages. These aren't just expanded limits—they're designed to handle real-world content where information naturally spans multiple formats. The model supports flexible output dimensions, making it adaptable to different downstream tasks and computational constraints.
For developers, this collapses what used to require multiple specialized embedding models and complex pipeline orchestration into a single API call. RAG systems can now search across text documents, images, and videos simultaneously with consistent semantic understanding. Sentiment analysis can incorporate visual context alongside text. Data clustering can group mixed-media content based on true semantic similarity rather than forcing everything into text representations. The unified embedding space means cross-modal retrieval actually works—searching with text to find relevant images, or vice versa, with the model understanding how these modalities relate conceptually.