Your πAI

Google announced the Gemini 2.5 Computer Use model, a specialized agent that can operate real user interfaces by clicking, typing, scrolling, and filling forms. Built on Gemini 2.5 Pro’s vision and reasoning, it targets tasks that can’t be done with structured APIs alone—such as navigating web apps behind logins or manipulating dropdowns and filters. Developers access these capabilities through a new `computer_use` tool in the Gemini API (in Google AI Studio and Vertex AI) and run it in a loop: the agent receives a user request, a screenshot and action history, proposes an action (e.g., click or type), and then gets a fresh screenshot and URL after execution until the task completes or is halted. Google reports the model outperforms leading alternatives on multiple web and mobile control benchmarks, including Online-Mind2Web, WebVoyager, and AndroidWorld, while delivering lower latency on Browserbase’s harness. Safety is addressed through built-in training and opt-in guardrails: a per-step safety service reviews each proposed action, system instructions can require refusal or user confirmation for high-risk steps, and recommended practices prohibit actions like bypassing CAPTCHAs or controlling medical devices. Early deployments include Google teams using it in production for UI testing (rehabilitating over 60% of previously fragile end-to-end tests), plus external testers building personal assistants and workflow automation. Testimonials highlight faster execution (often ~50% faster) and improved reliability on complex screens. While primarily optimized for browsers, Google says it shows promising results on mobile UI control and is not yet tuned for desktop OS-level control. Read Google’s full post here: Gemini 2.5 Computer Use model.

We Value Your Feedback

Google releases Gemini 2.5 Computer Use model for low-latency browser & mobile UI control via Gemini