On a metallic door in San Francisco’s Mission District, a single character—“π”—offers a cryptic clue as to the virtuous circle of labor taking place beyond.
The door opens to reveal furious activity involving both humans and machines. A woman uses two joysticks to operate a pair of tabletop robotic arms that carefully lift and fold T-shirts into a neat pile. Several larger robots move pantry items from one cluttered box to another. In one corner of the room a man operates a plastic pincer that fits over his wrist and has a webcam on top. Robot parts litter the room.
The warehouse is home of Physical Intelligence, also known as PI or π (hence the symbol on the front door), a startup that aims to give robots a profound artificial intelligence upgrade. Such is the excitement and expectation around the company’s dream that investors are betting hundreds of millions that it will make the next earth-shaking breakthrough in the field of AI. Physical Intelligence last week announced it had raised $400 million from investors that include OpenAI and Jeff Bezos, at a valuation of over $2 billion.
Inside a glass-walled conference room on the second floor of the building, the startup’s CEO, Karol Hausman, a tall man with a soft German accent and a few days of stubble, lays out the vision.
“If I put you in control of a new robot, with a little bit of practice you’d probably be able to figure out how to control it,” Hausman says. “And if we really crack this problem, then AI will be able to do the same thing.”
Physical Intelligence believes it can give robots humanlike understanding of the physical world and dexterity by feeding sensor and motion data from robots performing vast numbers of demonstrations into its master AI model. “This is, for us, what it will take to ‘solve’ physical intelligence,” Hausman says. “To breathe intelligence into a robot just by connecting it to our model.”
Despite amazing AI advances in recent years, nobody has figured out how to make robots particularly clever or capable. The machines found in factories or warehouses are essentially high-tech automatons, going through precisely choreographed motions without a trace of wit or ingenuity.
Hausman is joined at the conference table by several other cofounders: Sergey Levine, a bespectacled young associate professor at UC Berkeley; Brian Ichter, a friendly, bearded fellow who previously worked with Hausman at Google; and Chelsea Finn, an assistant professor at Stanford University who joins via video link.
The assembled team has kindled hope of a robot revolution that draws inspiration from other recent AI advances, especially the remarkable abilities of the large language models (LLMs) that power conversational AIs like ChatGPT. They firmly believe that they can bring that same level of awe into the physical world—and do it soon.
AI’s language skills began changing in 2018, when OpenAI showed that a machine learning model known as a transformer could generate surprisingly coherent chunks of text when given a starting string. Computer scientists had spent decades trying to write programs to handle language in all its complexity and ambiguity. OpenAI’s model, known as Generative Pretrained Transformer or GPT, steadily improved as it was fed ever-larger quantities of data slurped from books and the internet, eventually becoming able to hold cogent conversations and answer a wide range of questions.
In early 2022, Hausman and Ichter, then at Google, together with Levine, Finn, and others, showed that LLMs could also be a foundation for robot intelligence. Although LLMs cannot interact with the physical world, they contain plenty of information about objects and scenes thanks to the vast scope of their training data. Though imperfect—like someone who understands the world purely by reading about it—that level of insight can be enough to give robots the ability to come up with simple plans of action.
Hausman and co. connected an LLM to a one-armed robot in a mock kitchen at Google’s headquarters in Mountain View, California, giving it the power to solve open-ended problems. When the robot was told “I spilled my Coke on the table,” it would use the LLM to come up with a sensible plan of action that involved finding and retrieving the can, dropping it in the trash, then obtaining a sponge to clean up the mess—all without any conventional programming.
The team later connected a vision language model, trained on both text and images, to the same robot, upgrading its ability to make sense of the world around it. In one experiment they put photos of different celebrities nearby and then asked the robot to give a soda can to Taylor Swift. “Taylor did not appear in any of the robot’s training data whatsoever, but vision language models know what she looks like,” says Finn, her long brown hair framing a broad grin.
Later that year, just as ChatGPT was going viral, the team decided to demo the robot at an academic conference in Auckland, New Zealand. They offered audience members a chance to control it back in California with typed commands of their choosing. The audience was wowed by the robot’s general problem-solving abilities; buzz was also growing around the broader implications of ChatGPT.
LLMs might help robots communicate, recognize things, and come up with plans, but their most basic ability to take actions is stunted by a lack of intelligence about the physical world. Knowing how to grasp an oddly shaped object is trivial for humans only because of a deep instinctive understanding of how three-dimensional things behave and how our hands and fingers work. The assembled roboticists recognized that the remarkable abilities of ChatGPT might perhaps translate into something similarly impressive in a robot’s physical skills—if actions rather than words could be captured on a vast scale and learned from. “There was an energy in the air,” Finn recalls of the event.
There have been signs that this may indeed work. In 2023, Quan Vuong, another Physical Intelligence cofounder, corralled researchers at 21 different institutions to train 22 different robot arms on a range of tasks using the same single transformer model. The result was more than the sum of its parts. “In most cases the new model was better than the one the researchers had developed specifically for their robot,” Finn says.
Just as humans use a lifetime of learning to go from fumbling objects in early childhood to playing piano a few years later, feeding robots vastly more training data might unlock extraordinary new skills.
Expectations of a robot revolution are also being stoked by the many humanoid robots now being touted by startups such as Agility and Figure as well as big companies like Hyundai and Tesla. These machines are still limited in their abilities, but tele-operated demos can make them seem more capable, and proponents are promising big things. Elon Musk recently went as far as to suggest that humanoid robots could outnumber human beings on Earth by 2040—a suggestion probably best taken with a truckload of salt.
The idea of investing hundreds of millions in a company that is chasing a fundamental research breakthrough might even seem nuts. But OpenAI has shown how big the payoff can be, and the company has contributed to both Physical Intelligence’s seed investment and its latest investment through its startup fund. “The rationale for investing is the talent,” says a source familiar with OpenAI’s thinking. “They have some of the best robotics people on the planet.”
OpenAI is evidently ramping up its own robotics efforts, too. Last week, Caitlin Kalinowski, who previously led the development of virtual and augmented reality headsets at Meta, announced on LinkedIn that she was joining OpenAI to work on hardware, including robotics.
Lachy Groom, a friend of OpenAI CEO Sam Altman and an investor and cofounder of Physical Intelligence, joins the team at the conference room to discuss the business side of the plan. Groom wears an expensive-looking hoodie and seems remarkably young. He stresses that Physical Intelligence has plenty of runway to pursue a breakthrough in robot learning. “I just had a call with Kushner,” he says in reference to Joshua Kushner, founder and managing partner of Thrive Capital, which led the startup’s seed investment round. He’s also, of course, the brother of Donald Trump’s son-in-law Jared Kushner.
A few other companies are now chasing the same kind of breakthrough. One called Skild, founded by roboticists from Carnegie Mellon University, raised $300 million in July. “Just as OpenAI built ChatGPT for language, we are building a general purpose brain for robots,” says Deepak Pathak, Skild’s CEO and an assistant professor at CMU.
Not everyone is sure that this can be achieved in the same way that OpenAI cracked AI’s language code.
There is simply no internet-scale repository of robot actions similar to the text and image data available for training LLMs. Achieving a breakthrough in physical intelligence might require exponentially more data anyway.
“Words in sequence are, dimensionally speaking, a tiny little toy compared to all the motion and activity of objects in the physical world,” says Illah Nourbakhsh, a roboticist at CMU who is not involved with Skild. “The degrees of freedom we have in the physical world are so much more than just the letters in the alphabet.”
Ken Goldberg, an academic at UC Berkeley who works on applying AI to robots, cautions that the excitement building around the idea of a data-powered robot revolution as well as humanoids is reaching hype-like proportions. “To reach expected performance levels, we’ll need ‘good old-fashioned engineering,’ modularity, algorithms, and metrics,” he says.
Russ Tedrake, a computer scientist at the Massachusetts Institute of Technology and vice president of robotics research at Toyota Research Institute says the success of LLMs has caused many roboticists, himself included, to rethink his research priorities and focus on finding ways to pursue robotic learning on a more ambitious scale. But he admits that formidable challenges remain.
“It is still a bit of a dream,” Tedrake says of the idea of unlocking general robotic abilities with learning on a huge scale. “Although people have shown signs of life.”
The secret to making progress, suggests Tedrake, may involve teaching robots to learn in new ways, for example by watching YouTube videos of humans doing things. One wonders if this method might lead to some strange behavior in future machines, like a preternatural ability to do TikTok dances or bottle flips. Tedrake explains that the approach would, at first, just teach robots about simple motions like reaching for something, and it would need to be combined with data collected from real robotic labor.
“When you and I bring our intelligence to watching YouTube videos we can infer the forces that people use,” he says. “There’s some amount of [learning] that just requires robots interacting with physical things.”
Hausman leads me downstairs to see how Physical Intelligence plans to pursue robot learning on a grand scale. A pair of robot arms are now trying to fold clothes without human help, using the company’s algorithm. The arms move quickly and surely to pick up a T-shirt, then fold the garment slowly and crudely, much as a child might, before plopping it down.
Certain tasks, such as folding clothes, are especially useful for training robots, Hausman says, because the chore involves dealing with a large variety of items that are often distorted and crumbled and which bend and flex while you are trying to manipulate them. “It’s a good task, because to truly solve it you need to generalize,” he says. “Even if you collect a lot of data, you wouldn’t be able to collect it in every single situation that any item of clothing could be in.”
Physical Intelligence hopes to gather a lot more data by working with other companies such as ecommerce and manufacturing firms that have robots doing a variety of things. The startup also hopes to develop custom hardware, such as the webcam-equipped pincer; it hasn’t said how this will be used, but it could perhaps enable crowdsourced training with people performing everyday tasks.
After watching the demos, I leave Physical Intelligence buzzing with the idea of much smarter robots. Stepping back into the sunshine, I wonder if the world is quite ready for something like ChatGPT to reach into the physical world and take over so many physical tasks. It might revolutionize factories and warehouses and be a boon for the economy, but it might also spark a broader panic about the potential for AI to automate labor.
A few months later, I check in with Physical Intelligence and discover that the team has already made some impressive robotic strides.
Haussman, Levine, and Finn squeeze into a Zoom window to explain that the company has developed its first model using a huge amount of training data on more than 50 complex common household tasks.
The trio shows me a video of one mobile robot unloading a dryer; another of a robot arm cleaning a messy kitchen table; and then a pair of robot arms that now seem remarkably proficient at folding clothing. I am struck by how human the robot’s motions seem. With a flick of its robotic wrist, it shakes a pair of shorts to flatten them out for folding.
The key to achieving more general abilities was not just copious amounts of data but also combining an LLM with a type of model borrowed from AI image generation. “It’s not ChatGPT by any means, but maybe it’s close to GPT-1,” Levine says in reference to OpenAI’s first large language model.
There are some oddly human, or perhaps toddler-like, bloopers, too. In one, a robot overfills a carton with eggs and tries to force it shut. In another, a robot tosses a container off a table instead of filling it with items. The trio seems unconcerned. “What’s really exciting for us is that we have this general recipe,” Hausman says, “that shows some really interesting signs of life.”
Source : Wired