The First Fully General Computer Action Model
FDM-1 is the first AI model that can learn computer use from 11 million hours of unlabeled internet video—achieving 100x better token efficiency than OpenAI and processing nearly 2 hours of 30 FPS video in the same tokens other models burn on 1 minute.
Read Original Summary used for search
TLDR
• Solves the fundamental data problem: current computer agents train on <20 hours of expensive contractor-labeled screenshots; FDM-1 trains on 11M hours of internet video using an inverse dynamics model to auto-label actions
• Novel video encoder with masked compression handles variable information density (blank screen vs dense text), achieving 50x better compression than previous SOTA and 100x better than OpenAI
• Non-causal masked diffusion architecture for action labeling—you can't label Cmd+C until you see the paste, so they predict actions bidirectionally with confidence-based unmasking
• Exponential binning for mouse movements reduces state space while preserving precision for small movements
• Built eval infrastructure doing 1M+ rollouts/hour across 80k forking VMs with 11ms latency—critical because model never saw lag during training
In Detail
The core breakthrough is moving computer action models from a data-constrained to compute-constrained regime. Previous approaches required expensive contractor annotations, limiting the largest open dataset to under 20 hours of video. FDM-1 trains on 11 million hours of unlabeled internet video (coding livestreams, film editing, gameplay) using a three-stage recipe: train an inverse dynamics model (IDM) on 40k hours of contractor data, use that IDM to automatically label the full 11M hour corpus, then train the forward dynamics model on next-action prediction.
The technical innovations are substantial. Their video encoder uses a masked compression objective to handle variable information density—a cursor moving across a blank screen versus scrolling through dense text—achieving 100x faster convergence than a standard ViT. For action labeling, they use masked diffusion rather than causal prediction because labeling is fundamentally non-causal (you can't infer Cmd+C until you see the resulting paste). The model predicts actions conditioned on all frames simultaneously, unmasking high-confidence predictions first and spending more compute on ambiguous ones. Mouse movements are exponentially binned to reduce state space while preserving precision for frequent small movements.
The eval infrastructure is equally novel: 80k forking VMs running 1M+ rollouts per hour with 11ms round-trip latency. Forking lets them capture OS memory snapshots and replicate them across thousands of rollouts without corrupting the base environment. Results show the IDM-labeled data outperforms contractor data on general mouse movement and action tasks, though it's slower on typing due to IDM noise. The model generalizes to real-world tasks like self-driving with less than 1 hour of finetuning, starting at 50% accuracy versus near-zero for baseline models without computer use pretraining.