Demis Hassabis has never been shy about proclaiming big leaps in artificial intelligence. Most notably, he became famous in 2016 after a bot called AlphaGo taught itself to play the complex and subtle board game Go with superhuman skill and ingenuity.
Today, Hassabis says his team at Google has made a bigger step forward—for him, the company, and hopefully the wider field of AI. Gemini, the AI model announced by Google today, he says, opens up an untrodden path in AI that could lead to major new breakthroughs.
“As a neuroscientist as well as a computer scientist, I’ve wanted for years to try and create a kind of new generation of AI models that are inspired by the way we interact and understand the world, through all our senses,” Hassabis told WIRED ahead of the announcement today. Gemini is “a big step towards that kind of model,” he says. Google describes Gemini as “multimodal” because it can process information in the form of text, audio, images, and video.
An initial version of Gemini will be available through Google’s chatbot Bard from today. The company says the most powerful version of the model, Gemini Ultra, will be released next year and outperforms GPT-4, the model behind ChatGPT, on several common benchmarks. Videos released by Google show Gemini solving tasks that involve complex reasoning, and also examples of the model combining information from text images, audio, and video.
“Until now, most models have sort of approximated multimodality by training separate modules and then stitching them together,” Hassabis says, in what appeared to be a veiled reference to OpenAI’s technology. “That’s OK for some tasks, but you can’t have this sort of deep complex reasoning in multimodal space.”
OpenAI launched an upgrade to ChatGPT in September that gave the chatbot the ability to take images and audio as input in addition to text. OpenAI has not disclosed technical details about how GPT-4 does this or the technical basis of its multimodal capabilities.
Google has developed and launched Gemini with striking speed compared to previous AI projects at the company, driven by recent concern about the threat that developments from OpenAI and others could pose to Google’s future.
At the end of 2022, Google was seen as the AI leader among large tech companies, with ranks of AI researchers making major contributions to the field. CEO Sundar Pichai had declared his strategy for the company as being “AI first,” and Google had successfully added AI to many of its products, from search to smartphones.
Soon after ChatGPT was launched by OpenAI, a quirky startup with fewer than 800 staff, Google was no longer seen as first in AI. ChatGPT’s ability to answer all manner of questions with cleverness that could seem superhuman raised the prospect of Google’s prized search engine being unseated—especially when Microsoft, an investor in OpenAI, pushed the underlying technology into its own Bing search engine .
Stunned into action, Google hustled to launch Bard, a competitor to ChatGPT, revamped its search engine, and rushed out a new model, PaLM 2, to compete with the one behind ChatGPT. Hassabis was promoted from leading the London-based AI lab created when Google acquired his startup DeepMind to heading a new AI division combining that team with Google’s primary AI research group, Google Brain. In May, at Google’s developer conference, I/O, Pichai announced that it was training a new, more powerful successor to PaLM called Gemini. He didn’t say so at the time, but the project was named to mark the twinning of Google’s two major AI labs, and in a nod to NASA’s Project Gemini, which paved the way to the Apollo moon landings.
Some seven months later, Gemini is finally here. Hassabis says the new model’s ability to handle different forms of data including and beyond text was a key part of the project’s vision from the outset. Being able to draw on data in different formats is seen by many AI researchers as a key capability of natural intelligence that has largely been lacking from machines.
The large language models behind systems like ChatGPT get their flexibility and power from being built on algorithms that learn from enormous volumes of text data sourced from the web and elsewhere. They can answer questions and spit out poems and striking literary pastiches by replaying and remixing patterns learned from that training data (while also sometimes throwing in “hallucinated” facts).
But although ChatGPT and similar chatbots can use the same trick to discuss or answer questions about the physical world, this apparent understanding can quickly unravel. Many AI experts believe that for machine intelligence to advance significantly will require systems that have some form of “grounding” in physical reality, perhaps from combining a language model with software that can also see, hear, and perhaps eventually touch.
Hassabis says Google DeepMind is already looking into how Gemini might be combined with robotics to physically interact with the world. “To become truly multimodal, you’d want to include touch and tactile feedback,” he says. “There’s a lot of promise with applying these sort of foundation-type models to robotics, and we’re exploring that heavily.”
Google has already taken baby steps in this direction. In May 2022, the company announced an AI model called Gato capable of learning to do a wide range of tasks, including playing Atari games, captioning images, and using a robotic arm to stack blocks. This July, Google showed off a project called RT-2 that involved using language models to help robots understand and perform actions.
Hassabis says models that are better able to reason about visual information should also be more useful as software agents, or bots that try to get things done using a computer and the internet in a similar way to a person. OpenAI and others are already trying to adapt ChatGPT and similar systems into a new generation of far more capable and useful virtual assistants, but they are currently unreliable.
For AI agents to work dependably, the algorithms powering them need to be a lot smarter. OpenAI is working on a project dubbed Q* that is designed to improve the reasoning abilities of AI models, perhaps using reinforcement learning, the technique at the heart of AlphaGo. Hassabis says his company is doing research along similar lines.
“We have some of the world’ best reinforcement learning experts who invented some of this stuff,” he says. Advances from AlphaGo are hoped to help improve planning and reasoning in future models like the one launched today. “We’ve got some interesting innovations we’re working on to bring to future versions of Gemini. You’ll see a lot of rapid advancements next year.”
With Google, OpenAI, and other tech giants racing to speed up the pace of their AI research and deployments, debates about the risks that current and future models could bring have grown louder—including among heads of state. Hassabis was involved with an initiative launched by the UK government early this year that led to a declaration warning about the potential dangers of AI and calling for further research and discussion. Tensions around the pace at which OpenAI was commercializing its AI appear to have played a role in a recent boardroom drama that saw CEO Sam Altman briefly deposed.
Hassabis says that well before Google acquired DeepMind in 2014, he and his cofounders Shane Legg and Mustafa Suleyman were already discussing ways to research and mitigate possible risks. “We’ve got some of the best teams in the world looking for bias, toxicity, but other kinds of safety as well,” he says.
Even as Google launches the initial version of Gemini today, work on safety-testing the most powerful version, Ultra, due for launch next year, is still underway. “We’re sort of finalizing those checks and balances, safety and responsibility tests,” Hassabis says. “Then we’ll release early next year.”
Source : Wired