Translation powered by LLM
Nowadays, it’s very handy to have LLM translate text for us in almost any language. ChatGPT is my daily translator now. It’s powerful and easy to use. However, if you have a large amount of documents to translate, the problem becomes challenging. In this article, I will show you my journey to create a translation service powered by LLM with an infrastructure based on Kubernetes.
Objective
As a father of two kids, my children are always sick. I have to check on the medical information in different languages.
Recently, I decided to translate an archive(a subset) of notices from BASE DE DONNÉES PUBLIQUE
DES MÉDICAMENTS into English and Chinese for daily use. The total amount of notices is 14k. Although I have only a minor subset downloaded, it would be a nightmare to translate them using the chatGPT interface. You can find an example of the notice here DOLIPRANE 1000 mg, comprimé - Notice patient.
So I need to create a translation service to handle this task and make it easy to use so that I can reuse it in the future when the notices are updated.
The service should:
- Low cost
- Translate the text from any language to another language. (I want my friends to use it too)
- Be able to handle a large amount of text
- Format the output in a readable way/ programmatically processable way (So that I can show it in an web interface)
The plan
Use commercial APIs to set up a baseline as their quality is generally good. Then compute the cost. If the cost is too high, I will try to use local LLM to create a translation service.
I will begin with a prompt engineering with OpenAI chat, and test the prompt on several local models at the same time. Then based on the cost and the quality of the translation, I will choose the final solution. At the begining, I will work on fixed language pairs, French to English and French to Chinese.
The prompt engineering
To begin with the work, a good prompt will save us a lot of time.
An example of prompt
I took advantage of an prompt from Dify.ai. You can find the workflow here. To summarize, this a multi-step workflow for term-based translation. It does:
- Identify the technical terms and make a glossary for them.
- Direct translation of the text.
- Identify the problems of translation.
- Refine the translation based on the problems identified.
The prompts are as follows:
Step1
|
|
Step2
|
|
Step3
|
|
Step4
|
|
This prompt gives us a good start. These steps are expensive together. And there is a bug in the workflow: the step 1 ’s output is not used in the following steps. Aside from the cost issue, after several trials, I found this prompt doesn’t work well for my task, the translation is not accurate and the output is not readable.
I also tried to make it with Chain of Thought. I asked the model to make a plan for the translation to simulate these steps. The result is not good and it is no longer deterministic in terms of translation quality and formatting for our task without adding a lot of overhead on the length of the prompt and the inference time.
My prompt
After several trials with ChatGPT and local LMs with LM Studio, I found that the following prompt is enough for my task.
|
|
There are several things to note:
- the
<>
is used to indicate the task and constraints. Some may prefer [] but I found that <> is more effective. - the contraints are important. It’s better to have them in the prompt. And the contraints should be clear and easy to understand.
- the examples are important. This makes the few-shot prompting. They help the model to understand the task and the constraints. It’s crucial to have the corresponding examples for each language pair and it will be the future improvement of the prompt.
- the output should be in a json format. It’s easy to process the output in a programmatic way. I used pydantic to validate the output.
The document processing strategy
The HTML document contains a lot of house-baked tags, which are not easy to process. After some trails, I converted the HTML to a json format. This allows us to keep the structure of the document and to process the text easily.
|
|
This allows us the retrieve the information from the document easily by sections.
Before fixing the prompt, I tried to translate the whole raw document with the prompt. The result was not good. The translation was not accurate and the output was not readable. Then I tried to pass the json format to the model. The model failed to keep the keys in the json format, the keys are also translated and it caused a lot of problems.
The final solution is just pass sentence by sentence to the model. This allows the model to focus on the sentence and to keep the structure of the json format.
I am still working on the prompt. The next step will be adding more context to the prompt and not just the single sentence. So far, the prompt works well on most of the cases.
The cost and the choices of models
commercial APIs
By translating the doc sentence by sentence, the cost is very high. By using OpenAI 4o, translate a single doc as a whole with the prompt cost at total about 500 dollars. It’s not acceptable at all. As a tier-1 user of OpenAI, the rate limit is 500 RPM. I can translate about 3 docs per minute. It’s still too slow.
Even with the batch API, the cost is still too high for this task.
There is Deepseek’s API which is known for its low cost. But it’s still $0.14 / 1M tokens for Input, and $0.28/1M tokens.
Local LLM
I then try to use the local LLM to create a translation service.
With the local LLM of max size of 7B (4B quantization), the cost is just my electricity bill and more time to wait. By using 2 RTX A2000 6GB (70 Watt TDP) and 1 RTX 3090 (undervolting to 250 Watt), I can translate ~4 doc in 1 minute. It’s acceptable.
I tried several models:
- Mistral-7B
- LLaMa3.2 8B
- LLaMa3.2 1B
- Qwen2.5 7B
- Qwen2.5 1.5B
- Yi-1.5 6B
…
LLaMa and Mistral are not good at Chinese. Qwen2.5 is shockingly good. And the most shocking part for me is that the 1.5B model is good enough for a lot of things and not just translation.
The final solution
The model and the prompt
With all the elements above, the solution is clear. I will use the Qwen2.5 1.5B model to create a translation service on my GPUs. The translation will be on sentence level, and the output will be in a json format.
The devices
- 2 RTX A2000 6GB (70 Watt TDP)
I didn’t use the RTX 3090 finally because it’s too power hungry and noisy for the room.
At the begining, I asked help from my friend to use his GPU but the latency was too high.
The tech stack - with Kubernetes
- Argo Workflow for the translation job running
- Ollama for the model serving (External service)
- LiteLLM as models Gateway (proxy)
- MongoDB for the document storage
- Langfuse for LLM calls tracing
- Prometheus for the monitoring of the job
The illustration of the architecture is as follows:
The Ollama services are deployed as external services instead of internal pods with GPU Operators in K8S, so that I can turn off the machines when I don’t need them and the service is still available without breaking the kubernetes cluster. I only turn on the GPU machines when I need to run the model.
In the future posts, I will show you how the infrastructure is done.
LiteLLM configuration
At the beginning, there is no LiteLLM and I mixed several machines. So I handled the load balancing within my Python code to make a round-robin call to the models. It’s not efficient and not scalable.
Then I discovered LiteLLM. I decided to make it my general LLM Gateway.
I setup my configuration based on its tutorial. The configuration is as follows:
|
|
Don’t forget the reason to use LiteLLM is to use its routing strategies. There are several strategies available:
- Weighted Pick
- Rate Limit Aware
- Least Busy
- Latency Based
- Cost Based
I used the default strategy which is the Weighted Pick. It’s enough for my case. It chooses the model based on the RPM (requests per minute) and TPM (tokens per minute) as the weight.
For more information, you can check the LiteLLM documentation.
With Ollama, the TPM can be retrieved with the following command:
|
|
example output (from my macbook):
|
|
Langfuse tracing
Langfuse is a tool to trace the LLM calls. It’s very useful to debug the LLM calls. I use the selfhost version which lacks of the prompt engineering feature. The web UI is very cool for the tracing. And the SDK is very easy to integrate in my job.
Here is the example from the doc of Langfuse, you can check more information here.
|
|
Final words
In this article, I showed you my journey to create a translation service powered by LLM. There are more to do:
- The prompt engineering is not finished. I need to add more context to the prompt.
- A more flexible way to handle different languages.
- The infrastructure is not finished. I need to show you how to deploy the service with Kubernetes.
- The Translation is not validated. I am currently exploring the validation process by using the LLM as judge method. I am trying to do it the agentic way with several frameworks like: Dspy, Autogen, Pydantic.ai, Swarm, and more. So far, I found that the Autogen’s Society of Mind Agent with RoundRobinGroupChat is very funny.
I hope you enjoyed the article. If you have any suggestions, please let me know.
In the next article, I will show you my exploration on different agentic frameworks to validate the translation.
References
- Prompt Engineering Guide
- BASE DE DONNÉES PUBLIQUE DES MÉDICAMENTS - The source of the notices and the License
- LiteLLM documentation - The LLM Gateway
- Langfuse documentation - LLM engineering platform
- Ollama documentation - Ollama is a lightweight, extensible framework for building and running language models on the local machine.
- Dify.ai The next-gen development platform - Easily build and operate generative AI applications. Create Assistants API and GPTs based on any LLMs.