Gemini Embedding 2: Our first natively multimodal embedding model

Google's first embedding model that maps text, images, video, audio, and documents into a single unified space—and can process multiple modalities simultaneously in one request to capture cross-modal relationships.

Mar 11, 2026 · ai ml

Read Original

Gemini Embedding 2 represents a fundamental shift from text-only or modality-siloed embeddings to a truly unified multimodal approach. Built on the Gemini architecture, it maps text, images, videos, audio, and documents into a single embedding space that understands semantic intent across over 100 languages. The critical innovation is interleaved input processing—you can pass multiple modalities in one request (like an image with accompanying text), allowing the model to capture complex relationships between different media types rather than treating them as isolated inputs.

The technical specifications are substantial: 8192 input tokens for text, up to 6 images per request (PNG/JPEG), 120 seconds of video (MP4/MOV), native audio ingestion without transcription, and direct PDF embedding up to 6 pages. These aren't just expanded limits—they're designed to handle real-world content where information naturally spans multiple formats. The model supports flexible output dimensions, making it adaptable to different downstream tasks and computational constraints.

For developers, this collapses what used to require multiple specialized embedding models and complex pipeline orchestration into a single API call. RAG systems can now search across text documents, images, and videos simultaneously with consistent semantic understanding. Sentiment analysis can incorporate visual context alongside text. Data clustering can group mixed-media content based on true semantic similarity rather than forcing everything into text representations. The unified embedding space means cross-modal retrieval actually works—searching with text to find relevant images, or vice versa, with the model understanding how these modalities relate conceptually.

Gemini Embedding 2: Our first natively multimodal embedding model

TLDR

In Detail

TLDR

In Detail

Related