Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more
Openai’s sound AI model has been obtained Have trouble with actor Scarlett Johanssonbut that doesn’t stop companies from continuing to advance their products in this category.
today, Chatgpt Maker Unveiled Third, all new proprietary voice models are called GPT-4O-Transcribe,,,,, GPT-4O-Mini-Transcribe and GPT-4O-MINI-TTSoriginally available in its application programming interface (API) for third-party software developers to build their own applications on custom demo sites, Openai.fmthe individual user has access to limited testing and fun.
Additionally, the GPT-4O-MINI-TTS model sound can be customized from multiple presets via text prompts to change its accent, tone, tone and other sound qualities, including conveying any emotion the user requires them, which should go a long way to go, which should effectively address any particular user’s expression (any particular user’s expression (The company previously denied that this was the case with Johnsonbut remove the surface mimicking voice option anyway). Now users can decide how they want to make a sound when they reply.
In a demo delivered via video call with VentureBeat, OpenAI technician Jeff Harris showed how to use text alone on the demo website, where users could sound like a calling crazy scientist or Zen, calm yoga teacher.
Discover and improve new functions in the GPT-4O basics
The model is The existing GPT-4O model OpenAI will be launched in May 2024 Currently, it provides many users with ChatGpt text and voice experiences, but the company has adopted the basic model and trained it with other data after training to make it excellent in terms of transcription and speech. The company has no specified when the model may enter chatgpt.
“ChatGpt has slightly different requirements on cost and performance tradeoffs, so while I hope they will move to these models in a timely manner, for the moment, this release is focused on API users,” Harris said.
It aims to replace Openai’s two-year-old Whisper open source text-to-speech model, providing lower word error rates in industry benchmarks, and improving performance in noisy environments with multiple accents and different voice speeds.
The company published a chart on its website showing that the error rate of the GPT-4O-Transcribe model is 2.46% lower in English compared to Whisper when recognizing words in 33 languages.

“These models include noise elimination and semantic speech activity detectors, which help determine when ideas are completed and improve transcriptional accuracy,” Harris said.
Harris told VentureBeat that the new family of GPT-4O-Transcribe models is not designed to provide “diagnostics” or the ability to tag and distinguish different speakers. Instead, it is primarily intended to receive one (or possibly multiple sounds) as a single input channel and to respond in that interaction with all inputs with a single output voice, no matter how long it takes.
The company is further hosting a competition for the public to find the most creative examples of using its demo voice website OpenAi.fm and share them online by tagging @Openai Account x. The winner will receive a custom youth engineering radio with the OpenAi logo, Olivier Godement, the only three in the world, said.
Audio Application Gold Mine
These enhancements make them particularly suitable for applications such as customer call centers, conference notes transcription, and AI-driven assistants.
What is impressive is The company’s new agent SDK Since last week, developers who have built applications on large text-based language models such as regular GPT-4O have only added smooth voice interactions with about “nine lines of code”.
For example, an e-commerce application built on GPT-4O can now answer turn-based user questions like “Tell me about my last order”, adjusting the code for only a few seconds by adding these new models.
“We’re introducing streaming voice-to-text for the first time, allowing developers to continuously input audio and receive live text streams, making conversations more natural,” Harris said.
Nevertheless, for developers looking for low-latency, real-time AI voice experiences, OpenAI recommends using its voice-to-speech model in real-time APIs.
Pricing and availability
The new model is available immediately through OpenAI’s API, with the following pricing:
• GPT-4O-Transcribe: $6.00 per million audio input token (approximately $0.006 per minute)
• GPT-4O-Mini-Transcribe: Audio input token per $1 million (~$0.003)
• GPT-4O-Mini-TTS: 0.60 text input token per 1 million, 12.00 audio output token per 1 million (~$0.015 $0.015)
However, they reached a fierce competition period ever in the AI transcription and speech space and had dedicated speech AI companies such as Elevenlabs offers its new scribe model This supports diagnostics and has a similar (but not too low) English error rate of 3.3%, and an hourly input audio (or $0.006 per minute, about equivalent) priced at $0.40.
Another startup, Hume AI offers new model octave TTS Customization of pronunciation and emotional deformations at sentence level, even word level, is entirely based on user’s description, rather than any preset sound. The price of Octave TT is not directly comparable, but it provides 10 minutes of audio for free, while the cost increases
At the same time, more advanced audio and voice models will also appear in the open source community, including a entitled ” Orpheus 3B that allows Apache 2.0 license to be availablewhich means developers don’t have to pay anything to run it – as long as they have the right hardware or cloud server.
Industry adoption and early results
According to recommendations shared by Openai with VentureBeat, several companies have integrated Openai’s new audio model into their platform, reporting significant improvements in voice AI performance.
Eliseai, a company focused on property management automation, found that Openai’s text-to-voice model makes interactions with tenants more natural and emotionally rich.
Enhanced sound makes AI-powered rental, maintenance and travel planning more attractive, thereby improving tenant satisfaction and improving call resolution rates.
Decagon, which builds an AI-driven voice experience, improves transcriptional accuracy by 30% using OpenAI’s speech recognition model.
Improved accuracy allows Decagon’s AI agents to execute more reliably even in noisy environments in real-world situations. The integration process was fast, and DeCagon incorporated new models into its system in one day.
Not all reactions to the latest version of OpenAI are warm. Dawn AI Appals Analytics Software Co-founder Ben Hylak (@Benhylak)a former Apple human interface designer posted on X, and while the models seem promising, the announcement “feels like a retreat of real-time sound”, suggesting a shift from Openai’s previous focus on Openai through Chatgpt to conversational AI.
Additionally, there were early leaks from X (formerly Twitter) before the release. Posted TestingCatalog News (@TestingCatalog) Minutes before the official announcement, detailed information about the new model listed the names of GPT-4O-Mini-TTS, GPT-4O-TRANSCRIBE and GPT-4O-MINI-TRANSCRIBE. The leak was credited to @stiventhedev, and the post quickly gained traction.
But looking forward, Openai plans to continue to refine its audio model and is exploring custom voice capabilities while ensuring secure and responsible use of AI. In addition to audio, Openai has invested in multi-modal AI, including video, for a more dynamic and interactive proxy experience.
Source link