This is the transcript of the interview with Christine Lindeloo translated into English. The interviewers were Jente Kater and Susanne van Nierop representing Dag van de Audio.
Dag van de Audio: We have Christine Lindeloo from Vocoda.ai here with us. The "VO" is already in there, so you might guess it has something to do with voice. Can you explain exactly what you do?
Christine Lindeloo: Sure! At Vocoda.ai, we actually build our own software. Using AI, we can record and digitize voices, making it possible to have various language models where voices are available and can be used with the help of AI.
This allows us to apply a voice to spoken words or have a digitized voice speak text, using text-based or other speech methodologies.
Many voice AI services you see now are, if I may say, just a fancy layer over an API. But with you, you are really building your own software.
Correct.
That seems like an enormous task. Where are you currently at?
Right now, we’ve reached the point where we’re deploying our own AI models in actual client projects. So, we can create natural voices and use them.
What kind of projects are those? And who are your clients?
Well, one of our big clients is Booking.com. They produce a lot of ads and performance marketing content for many different markets. They face the challenge of needing to be available in the languages of those markets. To scale quickly and be flexible, they work with us.
So, you have a voice in English, for example, and you can create versions in German and French within no time.
Exactly. They’ve chosen a brand voice – one they feel is the voice of Booking – and they want that voice to be heard in different languages and markets where they are present. So, they work with us to scale that, making one voice available in multiple languages.
Can you tell us more about the process of training such a model? I visited you once, and you had just trained Ernest Hemingway’s voice, with permission from his estate, so that it could even sound like him in Chinese. But that was based on a small fragment, right?
Yes, that’s right.
There are different methodologies for using AI voices. One way is to have a model trained on a lot of spoken word data in a specific language, like Dutch. We work directly with Dutch voice actors to make the model speak Dutch as well as possible, and in a style relevant to the project.
The voice actor visits us and gets briefed on where the voice will be used, and the scope of its application. They also receive compensation for this. Then, we fine-tune the model based on that actor’s voice. For a text-to-speech method, we need about four hours of recorded audio from the actor.
For speech-to-speech, like with Ernest Hemingway, we use a "guide voice" to read the script, and then we overlay the characteristics of Hemingway’s voice so that it sounds like him, but in Mandarin.
Are there other ways this process works?
A voice actor can come into the studio to record, and then you can layer another voice on top. But you could also imagine you and I recording something and suddenly sounding like someone else.
With clients like Booking, the potential is clear because you can personalize quickly and switch between languages. What else is possible? Do you get other types of requests from brands?
Yes, we get requests for scaling voices across markets. Sometimes we get asked to provide male voices in one market and female voices in another. We’ve also been asked for gender-neutral voices, which AI makes possible.
How do you see the future for voice actors? Many are worried about losing their jobs.
I think, right now, with text-to-speech, we’re not quite at the point where we can manipulate it to sound exactly like a human interpreting and internalizing a text. Voice actors are still needed, especially for speech-to-speech work or to stay deeply involved in the process, as is our approach.
When we build a model based on a specific actor, we work closely with them. We have sessions where we ask if the generated output sounds recognizable to them and adjust accordingly. We do more recording sessions to refine it further.
But the shift to using more AI voices is already happening.
You’re known for being state-of-the-art. What’s the next step for you in the coming months or years?
For us, the priority is offering as many languages as possible. We’re working hard on developing various languages with a focus on quality. That’s why we’ve chosen to collaborate closely with voice actors, getting lots of feedback to improve the quality of the output.
We believe that, generally, models will operate faster and faster, making them easier to use, and the applications will continue to grow. You’ll see more AI voices in content where it’s appropriate, like making websites more accessible by reading out content.
Thank you, Christine, for joining us in the studio for the Dag van de Audio.
Watch the interview in Dutch: