Google DeepMind’s Chatbot-Powered Robot Is Part of a Bigger Revolution

In a cluttered open-plan office in Mountain View, California, a tall and slender wheeled robot has been busy playing tour guide and informal office helper—thanks to a large language model upgrade, Google DeepMind revealed today. The robot uses the latest version of Google’s Gemini large language model to both parse commands and find its way around.

When told by a human “Find me somewhere to write,” for instance, the robot dutifully trundles off, leading the person to a pristine whiteboard located somewhere in the building.

Gemini’s ability to handle video and text—in addition to its capacity to ingest large amounts of information in the form of previously recorded video tours of the office—allows the “Google helper” robot to make sense of its environment and navigate correctly when given commands that require some commonsense reasoning. The robot combines Gemini with an algorithm that generates specific actions for the robot to take, such as turning, in response to commands and what it sees in front of it.

When Gemini was introduced in December, Demis Hassabis, CEO of Google DeepMind, told WIRED that its multimodal capabilities would likely unlock new robot abilities. He added that the company’s researchers were hard at work testing the robotic potential of the model.

In a new paper outlining the project, the researchers behind the work say that their robot proved to be up to 90 percent reliable at navigating, even when given tricky commands such as “Where did I leave my coaster?” DeepMind’s system “has significantly improved the naturalness of human-robot interaction, and greatly increased the robot usability,” the team writes.

The demo neatly illustrates the potential for large language models to reach into the physical world and do useful work. Gemini and other chatbots mostly operate within the confines of a web browser or app, although they are increasingly able to handle visual and auditory input, as both Google and OpenAI have demonstrated recently. In May, Hassabis showed off an upgraded version of Gemini capable of making sense of an office layout as seen through a smartphone camera.

Academic and industry research labs are racing to see how language models might be used to enhance robots’ abilities. The May program for the International Conference on Robotics and Automation, a popular event for robotics researchers, lists almost two dozen papers that involve use of vision language models.

Source : Wired