[Uncaptioned image]

UniCoder: Scaling Code Large Language Model via Universal Code

Intermediate reasoning or acting steps have successfully improved large language models (LLMs) for handling various downstream natural language processing (NLP) tasks. When applying LLMs for code generation, recent works mainly focus on directing the models to articulate intermediate natural-language reasoning steps, as in chain-of-thought (CoT) prompting, and then output code with the natural language or other structured intermediate steps. However, such output is not suitable for code translation or generation tasks since the standard CoT has different logical structures and forms of expression with the code. In this work, we introduce the universal code (UniCode) as the intermediate representation. It is a description of algorithm steps using a mix of conventions of programming languages, such as assignment operator, conditional operator, and loop. Hence, we collect an instruction dataset UniCoder-Instruct to train our model UniCoder on multi-task learning objectives. UniCoder-Instruct comprises natural-language questions, code solutions, and the corresponding universal code. The alignment between the intermediate universal code representation and the final code solution significantly improves the quality of the generated code. The experimental results demonstrate that UniCoder with the universal code significantly outperforms the previous prompting methods by a large margin, showcasing the effectiveness of the structural clues in pseudo-code.111https://github.com/ASC8384/UniCoder

1 Introduction

The field of code translation and generation has advanced significantly Szafraniec et al. (2023); Yan et al. (2023) with the advent of code-specific large language models (LLMs). Code LLMs, such as StarCoder Li et al. (2023b) and Code-Llama Rozière et al. (2023), are capable of generating executable code by analyzing natural language prompts. Chain-of-thought (CoT) prompting Wei et al. (2022b) has emerged as the leading technique in enhancing LLMs, where the intermediate steps provide a structured pathway from the problem statement to the solution, effectively mirroring the human problem-solving process.

Refer to caption
Figure 1: An example of UniCoder. The Code LLM solves the code generation question by “translating” the pseudocode description (Universal Code) into executable code of the target programming language.

Considering the low accuracy of CoT in coder generation, structure CoT (SCoT) Li et al. (2023a) is proposed to minimize the gap between the intermediate steps and the generated code. More intuitively, using a universal code as the intermediate representation to handle multiple programming languages (PL) is promising. Here, universal code is a blueprint for implementing an algorithm, which helps to make the design of algorithms logically clear and readily comprehensible. Moreover, it is universal across different programming languages (PL-agnostic) since it typically does not follow specific syntax and omits execution details. Yet, how the universal code is used for code translation and generation in multilingual scenarios remains underexplored.

In this work, we scale up the code LLMs to support multiple programming languages via the universal code (UniCode), which is used as an efficient and language-independent intermediate representation of the key algorithm principles. Specifically, we first define UniCode by specifying grammar rules and providing paradigms, followed by prompting GPT-4 OpenAI (2023) to create an instruction dataset UniCoder-Instruct comprising natural-language questions, code solutions, and the corresponding universal code, as shown in Figure 1. Then, the UniCoder model is built by performing instruction tuning Wei et al. (2022a) on multi-task learning objectives, including zero-shot question-answer generation (questioncode), question-universal-code generation (questionUniCodecode), universal-code-solution translation (UniCodecode), and Universal-code-of-Thought (UoT) objectives. In UoT, the model is required to generate the universal code before the executable code.

UniCoder is evaluated on the Python benchmark (Humaneval Chen et al. (2021) and MBPP Austin et al. (2021)) and the extended multilingual benchmark MultiPL-E Cassano et al. (2022). The results demonstrate that UniCoder consistently achieves state-of-the-art performance across all languages, notably surpassing the previous baselines. Furthermore, the ablation study verifies the efficacy of the proposed method, and extra discussions provide insights into the effect of our method. The contributions are summarized as follows:

  • We introduce the universal code UniCode, which is agnostic to programming languages, allowing LLMs to grasp the essence of algorithms step by step. In addition, the instruction dataset UniCoder-Instruct is collected and provided for follow-up research.

  • We propose UniCoder, a code generation method that uses multi-task learning objectives to fine-tune the code LLMs with the help of UniCode. The objectives include question-answer generation (QA), question-universal-code generation (QP), universal-code-answer translation (PA), and Universal-code-of-Thought (UoT).

  • As extensive experiments show, our method UniCoder consistently outperforms the previous baselines on different benchmarks, including HumanEval, MBPP, and MultiPL-E. To further verify the effectiveness of the universal code, we propose UniCoder-Bench to test the capabilities of code LLMs.

2 UniCoder-Instruct

Refer to caption
Figure 2: Definition of the universal code.
Refer to caption
Figure 3: Prompt of generating UniCode.

Definition of Universal Code.

Universal code is designed for expressing algorithms in a form that is easily understood by humans, blending programming language syntax with natural language descriptions and mathematical notation to outline the steps of an algorithm without the complexity of full coding details. It omits machine-specific implementations to focus on the core logic, making it a popular choice for documentation in educational materials and the preliminary design phases of software development. By abstracting away from the intricacies of actual code, pseudocode facilitates clear communication of algorithmic concepts across various programming environments. The definition of the universal code, as shown in Figure 2, is based on the following principles:

  • Comments: Provide explanations and context for code segments, making it easier for others to understand the intent and functionality.

  • Variables: Enhance code readability and maintainability by using meaningful names that convey the purpose of the variables without relying on data type specifications.

  • Input/Output: Simplify the interaction with data entering and leaving the system, ensuring these operations are clear and easy to trace.

  • Conditionals: Clarify decision-making processes within the code by using structured and indented conditional statements that define clear execution paths.

  • Loops: Facilitate the repetition of code blocks in a controlled manner, with clearly defined start and end conditions, making the iterative processes understandable.

  • Functions/Procedures: Increase modularity and reusability by naming functions and procedures descriptively, and by using parameters effectively to encapsulate functionality.

  • Formatting: Improve the overall visual organization of the code by applying consistent indentation, which helps in delineating hierarchical structures and logical groupings within the code.

Construction From Instruction Dataset.

For a programming language L, given the existing code instruction pair (qα,aα)DsL, where qα and aα are question and answer from DsL, we create the universal code instruction dataset DuαL by prompting LLMs to generate the universal code pα and then add (qα,aα,pα) into DuαL. Figure 2 shows the definition of the code universal and Figure 3 is the prompt for LLMs to generate UniCode. {Definition of Universal Code}, {Question}, and {Answer} denote the slots for definition of the universal code pα, the question of the instruction data qα, and the answer of the instruction aα, respectively. Given K different programming languages Lall={Lk}k=1K, the multilingual programming instruction dataset with the universal code Duα={DuαLk}k=1K are created for supervised fine-tuning (SFT) Ouyang et al. (2022). In this work, we adopt the open-source instruction dataset.

Construction From Code Snippets.

For the unsupervised data (code snippets) widely existing on many websites (e.g., GitHub), we also construct the instruction dataset with the universal code from raw code snippets. Specifically, we ask the LLM to generate the question qβ and the corresponding code answer aβ pair based on the original code snippet c using the prompt “Please generate the self-contained question and answer based on the given code snippet”. Then, we generate UniCode pβ and construct (qβ,aβ,pβ) triplets the same way as in Paragraph 2. In addition, an LLM scorer is applied to filter out the low-quality (qβ,aβ,pβ) triplets. Therefore, given raw code snippets of different programming languages Lk{Lk}k=1K, we can construct instruction dataset with the universal code Duβ={DuβLk}k=1K directly from such unsupervised data. Finally, we combine these two instruction datasets to obtain Du=DuαDuβ, where DuLk=DuαLkDuβLk for each program langauge LkLall.

Evaluation Task for Universal Code.

To test the capability of the LLMs in generating UniCode from questions and translating UniCode into answers, we design a code reconstruction task for evaluation. Given the code snippet c, we require the LLM to generate UniCode p and then translate it into the code c. The evaluation metric is not the similarity between c and c but whether the restored code c can pass the test cases. We expand the HumanEval and MBPP datasets to create our benchmark UniCoder-Bench comprising 164 HumanEval samples and 500 MBPP test samples.

Refer to caption
Figure 4: Overview of UniCoder. (a) The function of universal code UniCode; (b) The framework of our method UniCoder. The universal code as the intermediate representation, our proposed framework can support code generation, code translation, and code summarization. In (a), the LLM encodes the code snippets of multilingual programming languages or the problem description questions into UniCode. Then UniCode is translated into the target output, i.e., the executable code of multilingual programming languages with a descriptive code summarization. In (b), we first ask the LLM to generate UniCode with few-shot prompts. In the second stage, the instruction dataset, containing questions, answers, and UniCode, is fed into the code LLM for fine-tuning.

3 UniCoder

3.1 Model Overview

In Figure 4, we first define the concept of the universal code with the essential components and then prompt the LLM to generate UniCode p based on the existing instruction data (questions q and answers a) and the raw code snippets c. UniCode is regarded as the intermediate representation for different tasks, including code generation, code translation, and code summarization. Our proposed model UniCoder is trained on the instruction dataset Du with the multilingual objectives to fully unleash the potential of UniCode.

3.2 Code LLM with Universal Code

Given the instructions dataset with K multilingual programming languages Du={DuLk}k=1K, the pre-trained code LLM trained on Du can support Universal-code-of-Thought (UoT). It can be described as:

P(p,a|q)=P(p|q;)P(a|q,p;) (1)

where q (question) and a (answer) are the instruction pair from Du. Given the question q, the code LLM first generates UniCode p and then outputs the final answer a, where p provides key algorithm ideas with natural language comments.

3.3 Multi-task Supervised Fine-tuning

To fully unleash the potential of the UniCode, we design multiple objectives to enhance the understanding and generation capability of code LLM.

Multi-task Fine-tuning.

all=qa+qp+pa+uot (2)

where qa is the question-answer generation objective, qp is the question-universal-code generation objective, pa is the universal-code-answer translation objective, and uot is the Universal-code-of-Thought (UoT) objective.

Here, we introduce all four training objectives. For all the following objectives, the multilingual corpora Du={DuLk}k=1K are given. is the code LLM and K is the number of programming languages.

Question-Answer Objective.

The training objective qa of the standard instruction fine-tuning can be described as:

qa=k=1K𝔼q,aDuLk[logP(a|q;)] (3)

where q and a are the question and answer pair.

Question-Universal-Code Objective.

The training objective qp of the auxiliary universal code generation task can be described as:

qp=k=1K𝔼q,pDLk[logP(p|q;)] (4)

where q and p are the question and UniCode.

Universal-Code-Answer Objective.

The training objective pa of generating the executable code answer from UniCode can be described as:

pa=k=1K𝔼p,aDLk[logP(a|p;)] (5)

where p and a are UniCode and the answer.

Universal-Code-of-Thought Objective.

The training objective uot of generating UniCode and then the executable code answer can be described as:

uot=k=1K𝔼q,p,aDLk[logP(p,a|q;)] (6)

where q, a, and p are the question, answer, and UniCode, respectively.

4 Experimental Setup

4.1 Instruction Dataset

GPT-4 (gpt-4-1106-preview) OpenAI (2023) is used as the foundation model to generate the UniCoder-Instruct. We randomly extract code snippets within 1024 tokens from the StarCoder dataset Li et al. (2023b) and let GPT-4 summarize the code snippets as the universal code. Based on each code snippet and the corresponding universal code, a self-contained coding problem with a correct solution is created.

4.2 Baselines

Proprietary Models.

Based on a neural architecture known as generative pre-trained Transformers (GPT) Vaswani et al. (2017); Radford et al. (2018), GPT-3.5 and GPT-4 are LLMs trained on massive datasets of text, code, math equations, and more. They are also trained to follow instructions Ouyang et al. (2022), which allows them to generate human-like responses. We use GPT-3.5 Turbo and GPT-4 as the proprietary models because they perform excellently in various code understanding and generation tasks.

Open-Source Models.

To narrow the gap between open-source and closed-source models, a series of open-source models and instruction datasets are proposed to improve code LLMs and bootstrap their instruction-following ability. Starcoder Li et al. (2023b), Code Llama Rozière et al. (2023), and DeepSeek-Coder Guo et al. (2024a) with different model sizes are introduced into the based model. OctoCoder Muennighoff et al. (2023), WiazrdCoder Luo et al. (2023), MagiCoder Wei et al. (2023), and WaveCoder Yu et al. (2023) are further fine-tuned on these based code LLMs.


We apply data decontamination before training our UniCoder models to decontaminate the code snippets from the starcoder data Li et al. (2023b), by removing exact matches from HumanEval Chen et al. (2021), MBPP Austin et al. (2021), DS-1000 Lai et al. (2023), and GSM8K Cobbe et al. (2021).

4.3 Evaluation Benchmark


The HumanEval test set Chen et al. (2021) is a crafted collection of 164 Python programming problems to test the abilities of code generation models. For each problem, there are roughly 9.6 test cases to check whether the generated code works as intended. Humaneval has become one of the most popular benchmarks to measure how well these code-writing AI models perform, making it a key tool in the field of AI and machine learning for coding.


The MBPP dataset Austin et al. (2021), comprising approximately 1,000 Python programming challenges sourced from a crowd of contributors, is tailored for beginners in programming, focusing on core principles and the usage of the standard library. The MBPP test set comprised of 500 problems is selected to evaluate the few-shot inference of the code LLMs.


The MuliPL-E test set Cassano et al. (2022) translates the original HumanEval test set to other 18 programming languages, i.e., Javascript, Java, Typescript, C++, and Rust. We use the MultiPL-E to evaluate the multilingual capabilities of the code LLMs.

4.4 Evaluation Metrics


We adopt the Pass@k metric Chen et al. (2021) to improve the reliability of our evaluation. We then count the total number of successfully passing test cases, denoted as k, to compute the Pass@k, thereby enhancing the accuracy and consistency of the performance assessment.

Pass@k=𝔼[1(nkc)(nk)] (7)

where n is the total number of generated samples for each problem, and c is the number of correct generated code snippets passing all the test cases (n>kc).

4.5 Impletmentation Details

We expand the open-source Evol-Instruct dataset evol-code-alpaca-v1 Xu et al. (2023) with nearly 110K samples into the instruction dataset with the universal code. For the code snippets collected from starcoderdata 222https://huggingface.co/datasets/bigcode/starcoderdata, we choose 5K code snippets of each language (Python, Javascript, C++, Java, Rust, and Go) to construct the synthetic instruction dataset with universal code. Finally, we obtain the instruction dataset UniCoder-Instruct contains nearly 140K training samples. Code-Llama and DeepSeek-Coder-Base are used as the foundational code LLMs for supervised fine-tuning (SFT). We fine-tune these foundation LLMs on nearly 150K samples generated from evol-codealpaca-v1 and the starcoder pre-training data. UniCoder is fine-tuned on Standford_Alpaca333https://github.com/tatsu-lab/stanford_alpaca with 8 NVIDIA A100-80GB GPUs. The learning rate first increases into 8×105 with 50 warmup steps and then adopts a cosine decay scheduler. We adopt the Adam optimizer Kingma and Ba (2015) with a global batch size of 128 samples, truncating sentences to 1536 tokens.

5 Results and Discussion

lcccccc Models Base Model Params Instruction Data Model Weight HumanEval MBPP
Proprietary Models
[0.4pt on 3pt off 3pt]- GPT-3.5 - - - - 72.6 81.6
GPT-4 - - - - 85.4 83.0
Open-source Models
[0.4pt on 3pt off 3pt]- StarCoder Li et al. (2023b) - 15B 33.6 43.3
WizardCoder Luo et al. (2023) StarCoder 15B 57.3 51.8
OctoCoder Muennighoff et al. (2023) StarCoder 15B 46.2 43.5
WaveCoder-SC Muennighoff et al. (2023) StarCoder 15B 50.5 51.0
[0.4pt on 3pt off 3pt]- Code-Llama Rozière et al. (2023) - 7B 33.5 41.4
Code-Llama-Instruct Rozière et al. (2023) Code Llama 7B 34.8 44.4
WaveCoder-CL Yu et al. (2023) Code Llama 7B 48.1 47.2
Magicoder-CL Wei et al. (2023) Code Llama 7B 60.4 64.2
UniCoder (our method) Code Llama 7B 65.4 65.2
[0.4pt on 3pt off 3pt]- DeepseekCoder Guo et al. (2024a) - 6.7B 49.4 60.6
WaveCoder-DS Yu et al. (2023) Deepseek-Coder 6.7B 64.0 62.8
UniCoder (our method) Deepseek-Coder 6.7B 70.6 64.3

Table 1: Evaluation results of Pass@1 on the HumanEval and MBPP benchmark. We use self-reported scores whenever available. All methods use greedy decoding and We use the reported scores of the previous work.

@lccccccc|c@ Model Params Programming Language
Java Javascript C++ PHP Swift Rust Avg.
Proprietary models
GPT-3.5 - 69.2 67.1 63.4 60.9 - - -
GPT-4 - 81.6 78.0 76.4 77.2 - - -
Open-source models
[0.4pt on 3pt off 3pt]- CodeLlama Rozière et al. (2023) 34B 40.2 41.7 41.4 40.4 35.3 38.7 39.6
CodeLlama-Python Rozière et al. (2023) 34B 39.5 44.7 39.1 39.8 34.3 39.7 39.5
CodeLlama-Instruct Rozière et al. (2023) 34B 41.5 45.9 41.5 37.0 37.6 39.3 40.5
WizardCoder-CL Luo et al. (2023) 34B 44.9 55.3 47.2 47.2 44.3 46.2 47.5
[0.4pt on 3pt off 3pt]- StarCoderBase Li et al. (2023b) 15B 28.5 31.7 30.6 26.8 16.7 24.5 26.5
StarCoder Li et al. (2023b) 15B 30.2 30.8 31.6 26.1 22.7 21.8 27.2
WizardCoder-SC Luo et al. (2023) 15B 35.8 41.9 39.0 39.3 33.7 27.1 36.1
[0.4pt on 3pt off 3pt]- CodeLlama Rozière et al. (2023) 7B 29.3 31.7 27.0 25.1 25.6 25.5 27.4
CodeLlama-Python Rozière et al. (2023) 7B 29.1 35.7 30.2 29.0 27.1 27.0 29.7
UniCoder (Our method) 7B 46.4 50.2 39.2 40.4 41.2 32.4 41.6

Table 2: Evaluation results of Pass@1 (%) performance on the MultiPL-E benchmark. The baseline results are partly from the previous work Wei et al. (2023).

5.1 Main Results

Python Code Generation.

Table 5 shows that UniCoder significantly beats previous strong open-source baselines using UoT, closing the gap with GPT-3.5 and GPT-4. Magicoder Wei et al. (2023) and Wavecoder Yu et al. (2023) both prove the effectiveness of instruction datasets from code snippets. Further, UniCoder outperforms the WizardCoder with 15B parameters and Evol-Instruct techniques with the help of the UniCode.

Multilingual Code Generation.

Table 5 shows that UniCoder significantly outperforms strong baselines CodeLlama and Starcoder. For the different backbones (Code Llama and Deepseek-Coder), our method beats most previous methods, especially in other languages, which demonstrates that UniCoder-Instruct can bring the capability of multilingual understanding and generation.

5.2 Discussion

Ablation Study.

To verify the efficacy of each component, we conduct the ablation study step by step on HumanEval and MBPP. In Table 3, we observe that removing the multi-tasks objective (only keeping the UoT objective: Equation 6) will have a 1.6 performance drop in HumanEval and a 1.3 drop in MBPP. Removing UniCode will further degrade the performance. The results support the effectiveness of each component of UniCoder.

ID Methods HumanEval MBPP
UniCoder 70.6 64.3
① - Multi-tasks Objective 67.4 60.2
② - Universal Code 66.8 59.8
Table 3: Ablation study of our proposed method on HumanEval and MBPP. UniCoder is fine-tuned on the UniCoder-Instruct with the multi-task objectives.

Effect on Universal Code.

To discuss the effect of the different formats of the universal code, we use different definitions of universal code for UniCoder. Specifically, we randomly sample 5K samples to generate the instruction dataset with different formats of UniCode.

  • UniCode 1: It describes the naming conventions, variable declaration, operators, conditional statements, loops, and function structure that pseudocode should have.

  • UniCode 2: It separates the first set of standards and provides code examples for each, instead of applying them all together in the examples.

  • UniCode 3: It describes the code structure, variable rules, control structures, functions, comments, and assignment rules that pseudocode should have.

  • UniCode 4: It is similar to the first standard but specifies type-free names for variables.

  • UniCode 5: It provides an abstract, high-level architectural description, without setting standards for the code itself.

  • UniCode 6: It uses latex algorithm and algorithmic packages for description.

ID Methods HumanEval MBPP
UniCode 1 53.2 51.5
UniCode 2 52.8 51.2
UniCode 3 53.5 50.5
UniCode 4 53.8 49.5
UniCode 5 49.5 50.2
UniCode 6 48.2 48.4
UniCode 14 55.5 52.2
Table 4: Evaluation results of our method with different formats of the universal code.

In Table 4, we can observe that the evaluation results of UniCode 1UniCode 4 have better performance. Compared to the universal code format UniCode 5 and UniCode 6, UniCode 1UniCode 4 has a clear definition and common structure, which brings more support for code generation. Notably, the experiment ⑦ performs the best by combing the training data of ①④. The experimental results show that the concrete definition of UniCode and the combination of it can effectively improve the model performance.

5.3 Code-UniCode-Code

To compare the capabilities of different code LLMs, we create a test set (denoted as UniCoder-Bench) by prompting the code LLM to generate UniCode and translate it into the executable code. We check the correctness of each translated code with the test cases, denoted as Pass@1 of the universal code. Code-Llama-7B is fine-tuned on the Code Alpaca dataset and our dataset UniCoder-Instruct separately. The results of fine-tuned Code-Llama models on UniCoder-Bench are shown in Table 5. Our method UniCoder is more accurate in passing the test cases than the Code-Llama baselines, demonstrating its excellent code understanding and generation abilities.

Method Params Python Other Languages
Code-Llama-Instruct 7B 33.3 26.2
Code-Llama-Alpaca 7B 44.2 29.1
UniCoder 7B 45.2 31.3
Table 5: Pass@1 scores of our method UniCoder and two Code-Llama baselines for Code-UniCode-Code.

6 Related Work

Code Understanding and Generation.

Code understanding and generation as the key tasks to substantially facilitate the project development process, including code generation Chen et al. (2021); Austin et al. (2021); Zhang et al. (2023); Chai et al. (2024a); Deng et al. (2024), code translation Szafraniec et al. (2023), automated testing Deng et al. (2023), bug fixing Muennighoff et al. (2023), code refinement Liu et al. (2023c), code question answering Liu and Wan (2021), and code summarization Ahmad et al. (2020). Researchers Chai et al. (2023) have undertaken extensive endeavors to bridge natural language and programming languages. With less ambiguous prompt styles, Mishra et al. (2023) using pseudocode improves the performance of NLP tasks. Oda et al. (2015) uses traditional machine learning to achieve code to pseudocode conversion. Jiang et al. (2022) also shows that designers and programmers can speed up the prototyping process, and ground communication between collaborators via prompt-based prototyping. To verify that the generated code is correct, there are some code synthesis evaluation frameworks, including EvalPlus Liu et al. (2023b), HumanEval Chen et al. (2021), HumanEval-X Zheng et al. (2023), and MBPP Austin et al. (2021).

Large Language Models for Code.

Since CodeBERT Feng et al. (2020) first connected code tasks with pre-trained models, large language models for code have developed rapidly, demonstrating extraordinary performance on almost all code tasks, rather than a single task. Prominent large models include Codex Chen et al. (2021), AlphaCode Li et al. (2022), SantaCoder Allal et al. (2023), Starcoder Li et al. (2023b), WizardCoder Luo et al. (2023), InCoder Fried et al. (2022), CodeT5 Wang et al. (2021), CodeGeeX Zheng et al. (2023), Code Llama Rozière et al. (2023), and Code-QWen Bai et al. (2023). To improve the performance of code generation, researchers used optimized prompts Liu et al. (2023a); Reynolds and McDonell (2021); Zan et al. (2023); Beurer-Kellner et al. (2023), bring test cases Chen et al. (2023) and collaborative roles Dong et al. (2023). There are also some related studies on using large language models for other code tasks, such as dynamic programming Dagan et al. (2023), compiler optimization Cummins et al. (2023), multilingual prompts Di et al. (2023), and program of thoughts Chen et al. (2022) (PoT).

Chain-of-Thought Prompting.

To unleash the potential of LLMs Zhang et al. (2024); Liu et al. (2024); Que et al. (2024); Du et al. (2024) in addressing complex reasoning tasks, chain-of-thought (CoT) prompting Wei et al. (2022b); Kojima et al. (2022) extends in-context learning with step-by-step reasoning processes, which handles complex reasoning tasks in the field of the code and mathematics by encouraging them to engage in step-by-step reasoning processes. Following this line of research, X-of-Thought (XoT) reasoning (CoT and its structural variants further) Chai et al. (2024b); Yao et al. (2023); Li et al. (2023a); Lei et al. (2023); Guo et al. (2023); Ji et al. (2024); Guo et al. (2024b) further expands the capabilities and applications of LLMs in complex reasoning and planning scenarios.

Intermediate Repersentation

In the field of natural language processing, there exist many works using intermediate representation Gan et al. (2021); Yang et al. (2022, 2024, 2019, 2020b, 2020a); Liang et al. (2024), such as text generation and translation. The universal code is used as the intermediate representation, which typically omits details that are essential for the machine implementation of the algorithm. We perform the coarse-to-fine pattern for the code generation and translation, where the universal code first summarizes the algorithm process and then the programming language gives the accurate solution. The Unicode provides explicit help for code generation such as Chain-of-thought in LLM.

7 Conclusion

In this work, we put forth a state-of-the-art framework UniCoder for both code translation and code generation. Using the universal code UniCode as the intermediate representation, we effectively bridge different programming languages and facilitate code tasks. In addition, we collect a dataset UniCoder-Instruct with 140K instruction instances from existing instruction datasets and the raw code snippets. After being fine-tuned on UniCoder-Instruct with multi-task learning objectives, our model generates UniCode and translates it into the final answer (executable code). The evaluation results on code translation and generation tasks demonstrate that our method significantly improves the generalization ability, showing the efficacy and superiority of UniCoder.


We acknowledge the following limitations of this study: (1) The evaluation focuses on benchmark datasets (Humaneval, MBPP, and MultiPL-E), and the model’s effectiveness in real-world programming scenarios or industry applications is not fully explored. (2) Our method is developed and evaluated primarily on programming language benchmarks. Its effectiveness in other domains or for non-programming-related tasks is not assessed, which limits the generalizability of our findings.


This work was supported in part by the National Natural Science Foundation of China (Grant Nos. U1636211, U2333205, 61672081, 62302025, 62276017), a fund project: State Grid Co., Ltd. Technology R&D Project (ProjectName: Research on Key Technologies of Data Scenario-based Security Governance and Emergency Blocking in Power Monitoring System, Proiect No.: 5108-202303439A-3-2-ZN), the 2022 CCF-NSFOCUS Kun-Peng Scientific Research Fund and the Opening Project of Shanghai Trusted Industrial Control Platform and the State Key Laboratory of Complex & Critical Software Environment (Grant No. SKLSDE-2021ZX-18).


