Most building blocks of Artificial Intelligence are increasingly plug&play. This means they are accessible to anyone with a basic knowledge of programming (mainly in Python). This is one of the recent revolutions in the field I will never stop emphasizing. If a company makes a product out of it, and you can bet it will sooner or later, you don’t even need these skills.
This also applies to voice cloning, i.e. the process in which one uses a computer to generate the voice of a real individual. What is amazing (and potentially scary, see below) is that the machine learning technology upon which voice cloning relies is becoming trivial and accessible to everyone. You can install user-friendly libraries to use the technology on your machine or to integrate it into your products. If you still do not have basic coding skills, you can look for a start-up that offers the service as a paid product. It’s not quite as fun as to own and control the technology yourself, but it’s okay too.
Not only is the technology available to everyone, but it only requires a small amount of data to work the magic. Why? Today, with almost any NLP task, it’s common to reuse a general model (trained on lots of data) and fine-tune it for a specific task, in this case a general text-to-speech model that you refine and teach it the features of your voice. You don’t want to reinvent the wheel every time!
So I wanted to try this process firsthand. It only took me 20-30 sentences reading aloud to produce a clone of my voice. I recorded them with a conventional built-in microphone. The results could have been improved by using high quality recordings and increasing the number of sentences (although I haven’t had the time to experiment and figure out what would have been the right quantity to maximize quality). The whole process took me less than 15 minutes.
After fine-tuning the model, I asked to read-aloud an old tweet of mine and recorded the result. My English pronunciation is poor and the model did a good job of recreating this feature as well. You can listen to the generated voice in the link above.
A company recently experimented with combining voice cloning and automatic text generation. The result is a funny podcast between Joe Rogan and Steve Jobs.
You can do fantastic things with this technology, such as easily creating synthesized voices for a brand, a game etc, but it can also be used to create deepfakes (see info here about audio deepfake). The malicious use of deepfakes may become an issue in the near future, and we need to increase awareness about this phenomenon in the general public.