Krea 2 Technical Report
An independent AI lab built a top-10 text-to-image model from scratch—including custom data infrastructure processing 208TB of metadata, a PostgreSQL-based queue system, and a training pipeline explicitly optimized for creative exploration over aesthetic convergence.
Read Original Summary used for search
TLDR
• Rejects conventional aesthetic filtering in pretraining, arguing it introduces biases; keeps "undesirable" images if captions are accurate since models can learn to avoid those distributions later
• Built custom data infrastructure: PostgreSQL "krablets" processing 10K+ transactions/sec using FOR UPDATE SKIP LOCKED for automatic retries, fault tolerance, and dynamic worker scaling
• Multi-stage pipeline: progressive resolution pretraining (256→512→1024px), midtraining with Wikipedia PageRank for entity coverage, custom STPO variant to prevent policy divergence, and multi-reward RL with prompt-specific rubric evaluation
• Architecture ablations: lightweight per-block bias instead of MLPs (saves 20-30% parameters), Qwen 3 VL with layerwise feature aggregation, GQA + gated sigmoid attention
• Infrastructure lessons: aggressive 30-second checkpointing via Weka, extensive observability (DCGM + custom PCIe/NVLink/InfiniBand metrics), and discovering that doubling GPU count produced far more instability than expected
In Detail
Krea 2 is a foundation model series built explicitly for creative exploration rather than converging toward narrow aesthetic defaults. The team argues that conventional model-based filtering (aesthetic scores, IQA models) introduces implicit biases—for example, classifying motion blur as low quality when it's often a deliberate artistic choice. Their data curation philosophy keeps any image where the caption accurately describes it, even if undesirable, since precise understanding enables steering away from those distributions later. They use zero AI-generated images in pretraining, finding that even small amounts introduce biases and impose quality ceilings.
The technical infrastructure is built from scratch. Their data system uses PostgreSQL "krablets" (sharded Postgres instances) with a novel queue-based DAG processing approach: jobs are expressed as SQL queries with FOR UPDATE SKIP LOCKED, providing automatic retries on failure, fault tolerance (workers can crash without losing progress), dynamic scaling (1 to 1000 workers without resharding), and continuous incremental processing. This system handles 208TB of metadata and tens of thousands of contended UPSERT transactions per second. The training pipeline spans pretraining (progressive 256→512→1024px resolution), midtraining (hierarchical k-means clustering + Wikipedia PageRank to ensure entity coverage), SFT, preference optimization (custom STPO variant that adds auxiliary loss to prevent policy divergence), and RL with multi-reward GRPO including prompt-specific rubric rewards to avoid reward hacking.
Architecture decisions came from systematic ablations prioritizing stability, performance, efficiency, and simplicity. Key choices: GQA with gated sigmoid attention for stable training dynamics; replacing per-block MLPs with lightweight per-block bias terms (saving 20-30% of parameters); Qwen 3 VL as text encoder with shallow attention layers aggregating features across layers (last-layer features are suboptimal since they're optimized for next-token prediction, not image generation); and careful autoencoder selection (Qwen Image VAE for early models, FLUX 2 VAE for larger ones). They also built a prompt expander using SFT + RL on open-source LLMs, with explicit diversity rewards to prevent collapse to a single house style, and a style-reference system trained via self-supervised learning to minimize content leakage.
Infrastructure lessons from large-scale pretraining: aggressive checkpointing every 30 seconds via Weka filesystem; treating MTBF/MTTR as key reliability metrics rather than attempting perfect fault tolerance; extensive observability including DCGM metrics plus custom DaemonSets for PCIe replay counters, NVLink errors, and InfiniBand metrics (fabric instability was the single largest contributor to crashes); and discovering that scaling GPU count produced far more instability than anticipated—runs under 128 GPUs were very stable, but at very large scale they never completed a 24-hour run without a crash, often with no obvious cause beyond silent NCCL timeouts.