As AI chatbots and art generators become more popular by the minute, some of the industry’s most prominent players are attempting to stay competitive with their own tools.
Meta’s AI chatbots
Meta recently introduced Voicebox, a text-guided, artificially intelligent speech generator that the company claims outperforms all existing models.
Voicebox is powerful enough to generate voices as easily as ChatGPT and Bing or Dall-E 2 can generate text and images.
Though the system isn’t yet widely available to the public, Meta has made demos available to anyone who wants to learn more about Voicebox.
The system, which generates natural-sounding audio clips, could be used in audio editing by content creators and editors, for example.
However, it is versatile enough to intelligently remove noise from voice clips, such as dogs barking, and regenerate the voice without skipping a beat.
One of the features offered by Voicebox is the ability to match the audio style of a sample and generate text-to-speech clips.
In essence, visually impaired users could give Voicebox a two-second audio clip of a friend, and it would be able to read that friend’s written messages in their voice using AI.
The new generative AI tool can solve tasks through in-context learning, which means it can process text it has never seen before and correctly generate context and inflections similar to how a human would read it by using existing knowledge to learn and tackle new challenges.
The ethical and legal implications of this game-changing tool are difficult to dismiss. Without permission, anyone could create audio clips using recordings of a person’s voice and claim to have them say whatever they wanted.
Meta claims in the paper that a binary classification model can distinguish between real-world speech and that generated by Voicebox. In any case, because the system is not publicly accessible, Meta’s metaphorical feet have yet to be burned.
For optimal performance, Meta trained Voicebox on 60,000 hours of English audiobooks and 50,000 hours of multilingual audiobooks in six languages.
Its training allows it to perform multilingual text-to-speech with no training, as well as speech denoising, styling, editing, and generating a variety of speech samples.
Meta AI claims in a paper that it can generate diverse audio samples 20 times faster and more intelligibly than Microsoft’s VALL-E.
Meta claims Voicebox can convert written text into spoken words in one or more languages without being specifically trained for each language separately.
Voicebox was found to reduce average word error rate from 10.9% to 5.2% and increase audio similarity from 0.335 to 0.481 when compared to the previous state-of-the-art model, YourTTS.
To read our blog on “Meta’s AI projects are dying with over 33% of workers left” click here
