From Research to Reality: A Practical Guide to Machine Translation in Customer Support

Enterprise

Aug 3

We live in an age where a pocket-sized device can seemingly translate between Swahili and Swedish in real-time. It feels like Douglas Adams' Babel Fish has finally arrived, a tiny technological marvel wriggling its way into our digital lives. But anyone who's used Google Translate, Microsoft Translator, or any of their AI-powered cousins knows the reality is messier. The fish, it turns out, sometimes has hiccups.

I saw these hiccups up close. A few years ago, I was at Google, leading the global rollout of machine translation for customer support. We were staring at millions in potential cost savings, a revolution in how global businesses interact with their customers. But, before uncorking any champagne, we needed to confront a core truth: AI translation, for all its impressive feats, remains charmingly, frustratingly imperfect.

I wanted to deep dive into research and emerge with insights that would guide practical implementations of machine translation. Could an AI chatbot truly understand a customer's furious complaint about a delayed shipment in Urdu? Could it discern a polite refund request from a sarcastic jab in Japanese? The short answer, as with most things in life, and especially in tech, is a resounding: "It depends."

To answer the longer, more complex question – the one that kept stakeholders up at night – I wrote a document. This wasn't a glossy marketing piece. It was a deep dive into the guts of machine translation, designed specifically for non-technical readers. It seemed at the time to bridge the gap between my engineering team and the folks controlling the budget.

This document, a significantly redacted version of which I'm sharing here, unblocked our early launches. It delivered millions in cost savings – but not without a fight. We battled stakeholder resistance, customer service agent skepticism, and the widespread misconceptions about what machine translation could and could not do.

The core of the document, and what made it so effective, was its unflinching honesty. It didn't promise perfection; it promised understanding. It laid bare the technology's quirks – the biases, the occasional blunders, the inherent unpredictability of generative AI in a production environment. And that honesty, paradoxically, built trust.

By exploring why machine translation models still make errors, I realized that launching this product wasn't just about deploying cutting-edge algorithms. It required a holistic system: AI guardrails (which my team built), human-in-the-loop processes, and a commitment to continuous improvement after launch. These are standard practices today, but they were less common back then. This document helped pioneer that shift.

So, what can you use this for? Think of it as a preemptive strike against confusion. It's a tool to supplement your Product Requirements Documents (PRDs), to save precious meeting time by answering repetitive questions before they're asked. Pair it with a concrete risk mitigation plan – outlining proactive monitoring, human intervention protocols, and clear feedback channels – and you'll not only reassure stakeholders but also ensure a smoother product rollout. It's about managing expectations, not just technology. Because the Babel Fish, while increasingly fluent, still needs a little help from its friends.

Understanding Translation Quality

The following is a brief overview and analysis of existing information about translation quality of the NLP technology behind systems like Microsoft Translator and Google Translate. It is meant to be informative to stakeholders on what to expect for translation quality when using ML models for production use cases such as live 1:1 chat support.

At a Glance

Deep Dive

Breaking Down Translation Errors:

Types of Bias in Machine Translation

Gender Bias

Informality Bias

Drivers of Machine Translation Quality in Production: Channel, Human-Error, Languages

How Channel Impacts Quality

Human-caused translation errors

Reasons for Shortcomings in Specific Languages

Why Existing Metrics Don’t Work for Enterprise Use Cases

What’s Next in Research

Low-resource language translation quality with multilingual models

At a Glance

How far are we from perfect translations by machines?

Machine translations do not aim for “perfect” translations, as “perfect” translation consists of always understanding cultural nuances, multi-phrase expressions like idioms are unique to different cultures.
On Side-by-Side evaluations of human vs machine translation, human translators do not score a “perfect” score of 6/6 either.
Rather than striving for “perfect” translation as humans have not yet achieved yet either, machine translations aim to provide translations that are as comprehensible as possible, and preserve the original meaning as much as possible, with ideally correct spelling, punctuation, and grammar.

What risks and severity does launching live machine translations (i.e. customer support channels) pose for our company?

Errors vary in severity and frequency, generally in the following categories: Punctuation, Grammar, Spelling, Cultural Taboo, Professionalism.
Granular translation quality metrics will be assessed by bilingual experts who review audited messages from live traffic.
Evaluations show messages that contain the above errors still generally pass the Comprehensibility threshold for most languages. For all languages, we have outlined a rigorous testing plan & HITL process <insert links here> for ensuring quality at launch.

Why is translation better for some languages than others and what are we doing about it?

In short, availability of web text in the target language varies.
English text on the web, which goes into training machine translation models, exists in the Petrabytes. In contrast, however, low-resource languages tend to have orders of magnitude less data available.
Achieving consistent language quality across languages in 1 underlying model is an active area of research today. In the meantime, we have implemented AI guardrails <insert links here> to ensure quality at launch.

Deep Dive

Breaking Down Translation Errors:

There are 5 different types of sentence-level translation errors. Note that BLEU, a long-standing yet outdated standard machine translation quality metric, does not identify which of the following errors are made:

Named entity errors
Numerical errors
Meaning errors
Insertion of content
Content missing

Expanding these further, common translation errors include:

Vocabulary / Terminology
Literal rendition of common idioms
Formal / Informal style
Too long sentences
Single word errors, errors of relation, structural/informational errors
Incorrect verb forms, tense
Consistent translation of a word in one manner in spite of context
Grammar and syntax errors
Punctuation errors
Omissions / additions
Compound words translated as individual words
Machine neologisms

Human analysis of the [redacted] translation model used for webpage translation quality <link redacted> showed the breakdown of errors of a machine translation system that human evaluators found:

Lexical Choice/Multi-word expression error (X%)
Lexical Choice/Other wrong word (Y%)
Lexical Choice/Named entity error (Z%)
Lexical Choice/Archaic word choice (T%)
Reordering Error (W%)

How does one interpret this for product launch readiness? X% of the errors made by the machine were on multi-word expressions – often these errors are caused by idioms that exist only in the source language. The question now is how risky are these errors in a customer support conversation? How often do agents use idioms, such as let’s get the ball rolling, with customers perhaps wanting a refund?

To assess how grave these likely and expected mistakes might become, one could perform an analysis on existing chat transcripts to find the frequency of idioms and multi-word expressions. We understand this may differ per product, per language, per agent. Still, conventional wisdom and a quick glance through case transcript logs for chat seem to indicate the rate of occurrence of idioms and difficult-to-translate words from English (what the human agent will say) is quite low.

Types of Bias in Machine Translation

Gender Bias

The field of machine translation has made significant progress in gender-related translation issues, but this is still an active area of research. While we don’t anticipate this being a severe problem in Chat conversations between users and agents, it’s important to highlight that MT trained on a large text corpus scraped off the web has bias(es), as we’ll go over in the next section.

Informality Bias

Often, in languages besides English, the formal “you” is preferred for business communications. It is likely that, if using <Model Names>, the translation will switch from the formal to informal “you” within the same conversation. This provides drawbacks for our company when trying to uphold brand guidelines and professionalism in agent communications, which varies in significance in certain regions and markets than others.

Not only do we face the challenge of accurately translating a short phrase that may be missing context, we also must be aware of the shortcomings of <Model Name> when used in business communications. Other times, short phrases used in a business context can mean entirely different things than in a consumer setting, as in the example below:

[Image Redacted]

Above, “Get a Quote” was translated to “Speak a proverb” rather than the intended commercial meaning

[Section shortened]

Drivers of Machine Translation Quality in Production: Channel, Human-Error, Languages

How Channel Impacts Quality

Translation quality tends to work better when given more context clues and longer inputs. Email exchanges in support, typically comprising longer form text, tend to have more context clues for machine translation to have higher quality.

We expect email translation quality to trend more accurately than chat translation quality using <Model Name> in the backend.

Human-caused translation errors

Finally, another aspect that may further decrease translation quality, or its comprehensibility to the agent or user, is human-caused errors or low-quality source messages. These include spelling errors and typos that form valid words in, say, the English language but are not what the user or agent meant to say. For example, in the following image, the user types “shows and error” when they meant to say “shows an error”.

[Image Redacted]

A sample chat transcript from an example support ticket

A translated version of this sentence in Spanish contains “y” for and, whereas the correct “meaning” should be “un”. Technically this translation is accurate in translating “and” (EN) to “y” (ES). However the reader on the opposite end may find it more jarring as confusing “y” for “un” is less likely than a user typing “and” instead of “an”, which are more similar words and a frequent type of misspelling in English.

Reasons for Shortcomings in Specific Languages

The simplest and most obvious reason for shortcomings is lack of training data available. As shown below, despite having X TB of Hindi available over web scraping, that isn’t nearly as many examples as for English (Y PB), or Japanese (Z PB).

Other sources state Vietnamese, Swahili, Hindi, Thai, Urdu, Hawaiian, Yoruba, Sindhi, Bengali and others are spoken by large populations but have fewer written language on the web (the primary training data for ML models). On the other hand, languages such as German, English, Chinese, Spanish, French, Japanese and more of the European and Western languages are high-resource.

Some major improvements to multilingual models have happened language by language (or in batched languages) – for instance, when correcting gendered translation errors (“he is a doctor” vs “she is a doctor”), [section redacted].

Even in Cloud Translation services offered by companies like Microsoft, AWS, or Google, certain services are limited to certain languages. Enterprises and SMBs are invited to use machine translation to detect more than one hundred languages, from Afrikaans to Zulu. However, one can only build custom models in Y language pairs - a significantly smaller subset than what’s available for consumers.

[Image Redacted]

The above figure shows a much lower volume available for <language> corpus used as training data

In the below illustration, the difference in data available for low resource languages vs high resource languages vary by orders of magnitude. Accordingly, translation quality suffers for low resource languages not just in chat but in the translation system used in production.

[Image Redacted]

The Machine Translation Landscape: Quality (BLEU) vs Language Pair Resource Size (source redacted)

While research is still working towards a single multilingual model across ZZ languages, domains, dialects, registers, styles etc, we can anticipate weaker quality translations for low-resource languages.

For <Company>’s support channels, one key dimension to consider is volume. Here is a language prioritization document <insert link> shows contact volume per language – note the highest volumes tend to be the high-resource languages, in which <Model Name> may suffice.

Why Existing Metrics Don’t Work for Enterprise Use Cases

Consumer-facing machine translation apps like <Model Name> were trained on large amounts of text scraped off the web due to the availability and low cost. However, this poses problems for business communications which require more formal language.

There is a discrepancy in the quality metrics used to evaluate translation ML used for consumer tools and business communications. BLEU is an automatic grading that is the primary metric for accuracy in research settings. The score is based on similarity to the reference (the “correct” human-generated answer), which could fail to evaluate the syntactic and semantic equivalence.

In business communications, BLEU is not enough. Further quality evaluation must be done, often manually by human agents trained to answer questions about Google products, in order to ensure readiness for launch in support. This is typically called Human SxS (side-by-side) evaluation. Research <insert link> shows that there is a discrepancy between human-evaluated quality versus machine-evaluated quality (BLEU). Human SxS, unfortunately, is expensive and takes much longer to assess compared to automatically calculated metrics like BLEU or BLEURT.

In addition, recent research is revealing metrics like BLEU, spBleu or chrf correlate poorly with human ratings (source). Therefore <Model Name> or <Model Name>, though achieving best-in-class BLEU scores published in research papers shared widely at conferences, can still yield less desirable translation quality in enterprise use cases.

A better alternative, put forth by Google Research in 2022, might be Neural Metrics. However, these metrics have not yet been widely adopted across the industry.

What’s Next in Research

Low-resource language translation quality with multilingual models

As further research (link redacted) in the past few years has shown in the LLM space, generalist models are more valuable than specialists for translations. A single XB parameter multilingual model outperforms ZZ individual models of ~YYYM params each. This means even premium model training on low-resource languages may still not yield as accurate translation quality as simply training a much larger multilingual base model.

In 2021, a single ZZZB parameter model handled YYY languages, and was deployed to XX language pairs via <Model Name> for production.

[section shortened]

AIGenerative AIEnterprise AIHuman In The LoopMachine TranslationsLLMsLanguage ModelsNatural Language Processing

Rachel Wu https://rachelw.org

From Research to Reality: A Practical Guide to Machine Translation in Customer Support

Understanding Translation Quality

At a Glance

Deep Dive

Breaking Down Translation Errors:

Types of Bias in Machine Translation

Gender Bias

Informality Bias

Drivers of Machine Translation Quality in Production: Channel, Human-Error, Languages

How Channel Impacts Quality

Human-caused translation errors

Reasons for Shortcomings in Specific Languages

Why Existing Metrics Don’t Work for Enterprise Use Cases

What’s Next in Research

Low-resource language translation quality with multilingual models

The AI Bottleneck: Why Early Chatbots Floundered

Google TGIF: An Unexpected Destination in my AI Journey

rachelw.org