Gen-AI in Audio Storytelling — What's the Use?
22 July 2024The release of ChatGPT in late 2022 blew the lid off generative AI for the general population, but its use for content—particularly audio production—has been akin to a slow cooker over many years, and may need more time before it can be more widely served.
From the EBU’s Technology and Innovation Department, Senior Project Manager Ben Poor shares his reflections.
While no stranger to the world of generative AI, I can’t claim to be an expert. In the world of R&D, my focus has always been on D rather than R—how best to implement and apply new technologies.
Readers may be familiar with the EBU’s EuroVOX project, which is similarly motivated—adapting the evolution of AI-based language tools to the service of content producers within their daily workflows.
While I have many colleagues in a similar position, my perspective is more on radio and audio. So when I was asked to talk about generative AI at the 50th EBU Audio Storytelling Festival (heir to the International Features Conference), it was an opportunity to embrace the topic more broadly.
The brief was to predict what impact AI might have on audio storytelling and discuss opportunities, though it’s already becoming mainstream. An AI-wave swept over the industry in 2023, with every sign that it will continue in 2024 and beyond.
ChatGPT cemented Large Language Models (LLMs) in the public consciousness. These are datasets (models) built by hoovering up vast oceans of publicly available textual content (e.g. websites, forums, documents) and run through a process that can learn the statistical relationships between words. With sufficient data and the right instructions, this process can then generate well-structured text, as we see in ChatGPT.
The same techniques can be applied to images and audio.
Very simply, with training on enough content and guidance on that content’s characteristics and parameters, it’s possible to create Large Models for audio, images, and even video.
Indeed, the latest version of the GPT model that underpins ChatGPT is considered ‘multimodal’, because it can understand text and images.
GPT is currently in its fourth iteration since its creation in 2018, but the explosion in interest and usage can be traced to the it, and other models, being more easily accessible, sparking a vibrant ecosystem of experimentation and innovation.
That’s why the start point of my presentation was to look at the latest in the world of ‘consumer generative AI’, and how we can usefully apply it to audio storytelling.
To tell this story I needed a sympathetic subject, so I chose one of my two boxers, Nelly, who is blind. Two years ago, we adopted her from a rescue association, where she had been abandoned because of her disability.
I wanted to set the scene by telling her origin story and how she had found a happy home with our family. Having recently returned from a trip that included her first beach experience (which she loved), I decided to include that detail, and the fact that she has entirely taken over my sofa (as is unavoidable with Boxer dogs).
To generate the story, I turned to the latest version of ChatGPT, using some smart prompting techniques I’d learned from a recent EBU Academy course led by the amazing Mark Egan as part of their School of AI.
My prompt was:
Write me a 3-part story over 9 paragraphs, detailing the adventures of a female boxer dog called Nelly. She is 6 years old and has been adopted by a family who wanted a friend for their other dog, called Popcorn. Nelly is blind and was going to be euthanized before she was adopted by her new family. Nelly likes to sleep and eat and play with her new friend Popcorn. She loves to meet new people, and when she is happy she will wiggle her bum around, doing her special dance.
The first part of the story should talk about her starting a new life with her new adopted family, and some of her first reactions about being in a new home. It should talk about some of the challenges she has being blind, and often bumping into things.
The second part of the story should talk about the day she went to the beach, and for the first time felt free to run around and jump in the air. She was very happy and had a great time digging in the sand.
The third part of the story should talk about her being happy, sleeping on the sofa. She is tired after running around so much on the beach but is happily snoring next to her family. She dreams of being on the beach again.
The results were impressive in terms of structure, though hardly a creative masterpiece. It also incorrectly assumed that my other dog was a labrador. These outcomes are unsurprising, and the statistical basis of the generation means it feels a bit…average. It is also too adept at filling in missing information, hallucinating, and being ‘confidently incorrect’.
Now I had my story, I needed the audio, so I turned to something more familiar: voice cloning.
One aspect of EuroVOX is that we’re able to take an audio clip with multiple speakers and effectively make them speak different languages. This employs several processes, including transcription, translation, and voice synthesis, as well as dynamically taking speakers’ vocal patterns when synthesizing translated phrases.
In this case, I cloned the speech patterns of my colleague Dr David Wood, which produced quite outstanding results that perfectly suited the content. I honestly believe he could have a prosperous future career as a voice artist for children’s books.
The final stage was to add some audio production in the form of introductory and interchapter music. Lalya Gaye, of the EBU’s AI and Data Initiative, directed me towards a host of resources for AI music generation. I hadn’t known the technology was so mature, but using tools such as Udio and Suno I was able to generate realistic music in different languages and styles, complete with lyrics that told the story.
I then built a presentation using a tool that creates PowerPoint slides from prompts, describing the process and including image assets, again using generative AI.
Not a bad afternoon’s work.
In the event, my presentation went smoothly (watch the video here), even if watching the other polished submissions hammered home that the process of storytelling is far from a mathematical calculation.
When I’d finished, one concerned audience member asked, ‘Won’t this all put us out of our jobs?’ I regret that my impromptu reply may have been unconvincing, but the truth is I honestly don’t believe it will.
What I put together was perhaps a good first draft, a sketch pad, which is how such techniques have long been used. By asking a model to generate some first ideas, it can be useful as a jump-off point for a more refined creative process that a human takes forward. Synthetic voices can be really convincing, and allowing anyone to put a voice to their text can be invaluable, because hearing it before entering the studio may save time and money.
But these tools can’t yet replace the art of storytelling.
As a coda, I’m taking some of my learnings by using similar techniques to automate the production of multilingual news bulletins, which has potential uses for smart speakers and connected cars.
The key difference is this is not generating the story from scratch, but rather using the EBU’s News Pilot platform to aggregate and synthesise summaries based on real news contributed and verified by the EBU membership. The fact that this can then be voiced synthetically in dozens of languages in minutes could be valuable to help our Members reach new platforms without having to invest significant resources.