Simultaneous Speech Translation: from sentence to context-based approach

Speech translation is shifting from the traditional approach of translating single, isolated sentences to the processing of larger segments of speech.
This enhances translation accuracy both semantically (ensuring consistency in word choice across different parts of the speech) and grammatically (maintaining agreement in elements like gender, tense, or number across sentences).
In simultaneous speech translation, achieving this level is particularly challenging because the process occurs in real-time, with only partial information about the speaker’s full message.

Progress is an incremental process—sometimes with big, dramatic leaps, and other times with painstakingly small, almost invisible steps. In speech translation, the ultimate goal is clear: creating a system capable of accurately translating across languages and cultures, capturing not just words but also their intended meaning, while seamlessly adapting to the communicative context. But let’s face it—machine interpreting has had (and still has) its fair share of growing pains.

One of the biggest challenges has been inherited from its cousin, machine translation: the inability to consider co-text (not to be confused with context—more on that in this blog post). Co-text refers to the surrounding text, the linguistic breadcrumbs that help make sense of meaning. For years, machine translations were processed sentence by sentence, blissfully ignoring what came before and after a given sentence. The result? Translations that lacked continuity and often missed the mark.

Why Co-Text Matters: An Everyday Example

Let’s break it down with a simple example. Translating from English (a gender-neutral language) into Spanish or Italian (gendered languages) illustrates the problem clearly. Take these two sentences:

She is a good doctor. // Very good.

When translating these two sentences in isolation, as we used to do until recently, the system would have had no problem to translate the gender agreement of the first “good”, since the sentence contains “she”, which dictates the gender agreement of the adjective and “doctor” to set the lexical choice for that adjective in Italian or Spanish. But it would have no idea how to handle the adjective “good“ in the second sentence, since no knowledge of the previous sentence would be available while translating this sentence. Without this co-text (the first sentence), the system would have default, with high probability, to the masculine form “molto bravO”, picking O just because of statistical pattern or bias. Even worse, it would have picked not only the wrong gender, it would have picked “molto buonO“, i.e. a different adjective, which implies the doctor is tasty rather than capable. Knowing the co-text, i.e. having memory of what came before, in the case of speech translation, the translation is handled correctly: “molto bravA”.

Translating sentence by sentence was like watching a movie where every other scene is missing. You would most certainly get the gist, but the experience was frustrating, to say the least, or it could even become easily offensive for some. While the example above is of course very simple, I think it makes clear what we are talking about.

The Turning Point: Co-Text Awareness in Speech Translation

Here’s the good news: machine translation, and by extension, speech translation, has come a long way in the last months. Thanks to advances in Large Language Models and improved Neural Machine Translation Models, systems can now take co-text into account to make better choices. This means they’re no longer translating in a vacuum segment by segment but leveraging what’s been said so far to improve accuracy.

For speech translation, this is no small feat. First of all unlike written translation, which can also look both back and ahead at the co-text (what comes before and after the current segment), simultaneous speech translation can only use what came before since it processes chunks of spoken language as they’re uttered, often before a full sentence is complete. And it has to do in real-time, with all constraints, both computationally and linguistically, of the case.

But here’s what’s remarkable: modern systems can now handle co-text with surprising precision. This is what we just introduced in KUDO AI recently. And it took a while to put into production a system that can “look back” at what has been said before to improve translation of what is currently translating. To the best of my knowledge, this is the only simultaneous speech-to-speech translation system without rewriting mechanisms performing this kind of look back.

Here is a simple yet clear example. Let’s translate 3 variations of the same sentence. Check the video with the translation into Italian for the real time translation done by the system while the input was unfolding. This is the order I pronounced them:

He is a very good professor working all day long I tell you very very good.
She is a very good professor working all day long I tell you very very good.
They are very good professors working all day long I tell you very very good.

With co-text awareness, the system consistently produces correct translations. It recognizes the gender or plurality in the earlier segment (“he,” “she,” or “they”) and adjusts the adjective in later parts of the translation accordingly. For example, it knows to translate “very good” as “molto bravA” when referring to “she” and “molto bravO” for “he.” This kind of continuity, as much as basic it can sound, is a game-changer.

For the full context, remember that the system is not waiting for the entire phrase to translate, but it translate it bit by bit through segmentation and without rewriting the output (a technique typically used in Speech-to-Text translation). In this example, for each phrase it makes out 3 different segments, which are translated one after the other. The improvement now is that it keeps in memory the sentegment preceding, influencing the translation.

The Challenges Ahead

Of course, not everything is perfect. Far from that. For each challenge you solve, many others still need to be tackled. And simultaneous speech translation still faces a mountain of challenges, from capturing nuanced cultural references to handling ambiguous language. Many of these hurdles can be addressed with existing technologies, while others—like true mastery of real-world conversational dynamics—will take more time to solve, and new technological breakthrough.

This is also true for the very small problem you are solving. Sometimes it is only partly solved, since reality in language is even more complex that what a simple example can elucidate. The example above is indeed simple. But in many others there are other questions that make choices difficult. Think of grammatical choices that require complex disambiguation (understanding the semantic dependencies of parts of the speech, often in non-adjacent segments) as well as the need to consider pragmatics and cultural or language/political choices. In the case of gender for example, questions that goes beyond grammatical agreement, for example the use of gender-neutral language.

Still, the progress we’ve seen is exciting. The ability to integrate co-text into real-time translations marks a significant leap forward, transforming bit by bit what was once a clunky, error-prone system into something increasingly sophisticated and reliable. It’s a thrilling time to witness the evolution of machine interpreting, even if the road ahead remains bumpy… if you think that I am here celebrating such a small (and for humans banal) improvement!

Leave a Reply Cancel reply