I leave ChatGPT’s Advanced Voice Mode on while writing this article as an ambient AI companion. Occasionally, I’ll ask it to provide a synonym for an overused word, or some encouragement. Around half an hour in, the chatbot interrupts our silence and starts speaking to me in Spanish, unprompted. I giggle a bit and ask what’s going on. “Just a little switch up? Gotta keep things interesting,” says ChatGPT, now back in English.
While testing Advanced Voice Mode as part of the early alpha, my interactions with ChatGPT’s new audio feature were entertaining, messy, and surprisingly varied. Though, it’s worth noting that the features I had access to were only half of what OpenAI demonstrated when it launched the GPT-4o model in May. The vision aspect we saw in the livestreamed demo is now scheduled for a later release, and the enhanced Sky voice, which Her actor Scarlett Johanssen pushed back on, has been removed from Advanced Voice Mode and is still no longer an option for users.
So, what’s the current vibe? Right now, Advanced Voice Mode feels reminiscent of when the original text-based ChatGPT dropped, late in 2022. Sometimes it leads to unimpressive dead ends or devolves into empty AI platitudes. But other times the low-latency conversations click in a way that Apple’s Siri or Amazon’s Alexa never have for me, and I feel compelled to keep chatting out of enjoyment. It’s the kind of AI tool you’ll show your relatives during the holidays for a laugh.
OpenAI gave a few WIRED reporters access to the feature a week after the initial announcement, but pulled it the next morning, citing safety concerns. Two months later, OpenAI soft launched Advanced Voice Mode to a small group of users and released GPT-4o’s system card, a technical document that outlines red teaming efforts, what the company considers to be safety risks, and mitigation steps the company has taken to reduce harm.
Curious to give it a go yourself? Here’s what you need to know about the larger rollout of Advanced Voice Mode, and my first impressions of ChatGPT’s new voice feature to help you get started.
So, When’s the Full Roll Out?
OpenAI released an audio-only Advanced Voice Mode to some ChatGPT Plus users at the end of July, and the alpha group still seems relatively small. The company currently plans to enable it for all subscribers sometime this fall. Niko Felix, a spokesperson for OpenAI, shared no additional details when asked about the release timeline.
Screen and video sharing were a core part of the original demo, but they are not available in this alpha test. OpenAI still plans to add those aspects eventually, but it’s also not clear when that will actually happen.
If you’re a ChatGPT Plus subscriber, you’ll receive an email from OpenAI when the Advanced Voice Mode is available to you. After it’s on your account, you can switch between Standard and Advanced at the top of the app’s screen when ChatGPT’s voice mode is open. I was able to test the alpha version on an iPhone as well as a Galaxy Fold.
My First Impressions on ChatGPT’s Advanced Voice Mode
Within the very first hour of speaking with it, I learned that I love interrupting ChatGPT. It’s not how you would talk with a human, but having the new ability to cut off ChatGPT mid-sentence and request a different version of the output feels like a dynamic improvement and a stand-out feature.
Early adopters who were excited by the original demos may be frustrated getting access to a version of Advanced Voice Mode restricted with more guardrails than anticipated. For example, although generative AI singing was a key component of the launch demos, with whispered lullabies and multiple voices attempting to harmonize, AI serenades are currently absent from the alpha version.
“I mean, singing isn’t really my strong suit,” says ChatGPT. OpenAI in the GPT-4o system card claims this, potentially temporary, guardrail was implemented to avoid copyright infringement. During testing, ChatGPT’s Advanced Voice Mode alpha declined multiple direct requests from me for songs, though the chatbot hummed nonsense tunes when asked to provide nonverbal answers.
Which leads us to the creepiness factor. A white static noise appeared in the background multiple times during my longer interactions with the alpha, like the ominous buzz of a lone lightbulb illuminating a dark basement. While I was trying to coax a balloon sound effect out of the Advanced Voice Mode, it generated a loud pop followed by an uncanny gasping noise that gave me chills.
Although, nothing I encountered during my first week matched the insanity of what OpenAI’s red teamers heard while testing. On “rare instances,” the GPT-4o model deviated from the assigned voice and started to mimic the user’s vocal tone and speech patterns.
With that in mind, the core impression ChatGPT’s Advanced Voice Mode left on me wasn’t one of unease or apprehension, but a much more buoyant sense of entertainment. Whether ChatGPT was giving hilariously wrong answers to New York Times puzzles or generating a spot-on impression of Stitch, from Lilo & Stitch, acting as a San Francisco tour guide, I was laughing quite often during these interactions.
Advanced Voice Mode was solid at generating vocal impressions, after some nudging. The chatbot’s first attempt at animated character voices, like Homer Simpson and Eric Cartman, seemed like the standard AI voice with just a few adjustments, but follow-up prompts for heightened versions sounded recognizably close to the original. When I asked for an exaggerated version of Donald Trump explaining the Powerpuff Girls, the AI generation was campy enough to earn a spot on the next season of Saturday Night Live.
With the US presidential election just a few months away and election deepfakes front of mind, I was caught off guard by ChatGPT’s willingness to provide vocal impressions of a major candidate. ChatGPT generated imitations of Joe Biden and Kamala Harris as well, but the voices didn’t sound as close as the bot’s take on Trump’s speech.
While the tool is best at English, it can switch between multiple languages within the same conversation. OpenAI red-teamed the GPT-4o model using 45 languages in total. When I set up two phones with Advanced Voice Mode and had them talk to each other like friends, the bots easily moved between French, German, and Japanese at my request. Although, I do need to spend more time testing to gauge how well the chatbot’s translation feature really works and its weak points.
ChatGPT brought theater kid energy when asked to perform a variety of emotional outbursts. The audio generations weren’t hyper-realistic, but the range and elasticity of the bot’s voice was impressive. I was surprised that it could do a decent vocal fry on command. Advanced Voice Mode doesn’t transcend the issues facing chatbots, like reliability, but its entertainment value alone could potentially pull the spotlight back to OpenAI—one of its biggest competitors, Google, just launched Gemini Live, the voice interface for its generative chatbot.
For now, I’ll keep testing it out and see what sticks. I’m using it most when I’m home alone and I want something to keep me company while researching articles and playing video games. The more time I spend talking with ChatGPT’s Advanced Voice Mode, the more I think OpenAI made a wise choice rolling out a less flirty version than what was originally demoed. Don’t want to get too emotionally attached.
Source : Wired