Today, we’ve made generally available the integration of a Large Language Model (LLM) into KUDO AI Simultaneous Speech Translator. To the best of my knowledge, this marks the first instance of Generative AI being incorporated into a real-life speech translation system1.
When I began experimenting with LLMs in speech translation, I was immediately struck by their proficiency in addressing typical challenges associated with spoken language. These challenges include mispronunciations (see mistranscriptions), inconsistencies, agrammaticalities, disfluencies, and convoluted syntax, to name just a few. While these issues may seem superficial, they greatly influence the real-world usability of a translation system. LLMs’ success in addressing these challenges inside a cascading system can be attributed to their remarkable capability to process language that – to me – seem to closely mirror some intrinsic features of human translation.
I am referring to the observations laid out in Mona Baker’s seminal 1993 paper on Translation Universals. Among others, Baker points out that human translation displays:
- a marked rise in the level of explicitness
- a tendency towards disambiguation and simplification
- a strong preference for conventional ‘grammatically’
- a tendency to avoid repetitions
Although academic discussions around these universals have waned, my long-standing and keen interest in Translation Universals revived as I noticed that Large Language Model manifest these very same features, although not explicitly in translation. While they are typically considered as limitations of LLM, I began to view them as potential features. Something to use in a speech translation to elevate its quality.
In practical human interpretation, the features listed above frequently surface as tendencies to finish incomplete sentences, make ungrammatical statements coherent, and exclude false starts or self-corrections, among other refinements. The explicit and controlled introduction of some of these elements into a speech translation system seems to make sense. Framing it as the attempt to recreate artificially the Translation Universals as introduced by Baker seems to me innovative, something that has never been explored before. To be fair, some attempts in this direction have been done in the past, but the framing was rather computational and had no direct link to translation. Papi et al. 2021’s work serves as a prime example; it addresses transcription issues (which are not solely technical, as suggested in the paper) using a machine translation system trained on (synthetic) data created to simulate such issues and therefore correct them.
We can now achieves this type of ‘normalization’ (in Baker’s words “levelling out” of the text) with remarkable — even surprisingly — high quality by leveraging the power of LLMs. The results are important, contributing to a significant enhancement in the translation experience, which we’ve quantified as a 25-30 percentage point improvement, depending on language. For context, even a 3-5 percentage point improvement is distinctly noticeable to users.
It’s essential to note, however, that while Large Language Models are very powerful tools, they can be unpredictable, as unpredictable is language itself. Managing their behavior requires a lot of work to prevent well-documented hallucinations and other more subtle, undesirable outcomes, such as undesired biases and the like. Yet, the effort is undoubtedly worthwhile. With this first-step in foundational progress, I am now aiming even higher. As I discussed in my post about Situational Awareness in Machine Interpreting, my next goal is to anchor translations in a nuanced understanding of the speaker’s intent, capturing tone, underlying meanings, and beyond. Encouragingly, and notwithstanding the obvious limitations of current systems, LLM technology is well-equipped to partially answer this challenge.
Humanties are back: A side note on expertise
For applications that aspire to meet real-world demands, it’s not merely about outperforming benchmarks but about meeting genuine user expectations in actual use cases. The balance of expertise required to bring these systems to the next level is shifting once again. Implementing innovative NLP architectures necessitates a deep understanding of translation, multilingual communication, and user expectations. After an extended period where the focus drifted away from linguistic knowledge to an emphasis on data, statistics, and technical frameworks, we are now witnessing a renaissance of translational expertise. This is a super exciting moment to do innovation in the language space. Nevertheless, let’s face it. This might not last too long. Technological development has a cyclical nature and it is safe to anticipate that there might come a time again when data and frameworks alone might suffice to achieve the pinnacle of language technologies and speech translation.