API Pack: A Massive Multilingual Dataset for API Call Generation

Zhen Guo    Adriana Meza Soria    Wei Sun    Yikang Shen    Rameswar Panda
Abstract

We introduce API Pack, a multilingual dataset featuring over one million instruction-API call pairs aimed at advancing large language models’ API call generation capabilities. Through experiments, we demonstrate API Pack’s efficacy in enhancing models for this specialized task while maintaining their overall proficiency at general coding. Fine-tuning CodeLlama-13B on just 20,000 Python instances yields over 10% and 5% higher accuracy than GPT-3.5 and GPT-4 respectively in generating unseen API calls. Scaling to 100k examples improves generalization to new APIs not seen during training. In addition, cross-lingual API call generation is achieved without needing extensive data per language. The dataset, fine-tuned models, and overall code base are publicly available at https://github.com/zguo0525/API-Pack.

Machine Learning, ICML

1 Introduction

Large language models (LLMs) have shown promise in assisting with software engineering tasks (Hou et al., 2023; Ebert & Louridas, 2023), with a primary focus on code generation (Wang et al., 2023b; Liang et al., 2023; Zan et al., 2023; Shrivastava et al., 2023; Wei et al., 2023; Muennighoff et al., 2023). Our work builds on these advances but targets at the time-consuming task developers often face in finding API call code examples. Currently, developers typically seek out examples in documentation sites or API hubs (Sadowski et al., 2015), sifting through lengthy pages to locate relevant information (Meng et al., 2018). This process is recognized as cumbersome and inefficient. Our research aims to transform this workflow by exploring LLMs’ capabilities to identify appropriate API endpoints and generate corresponding API calls based on natural language prompts.

To achieve our research goal, we created API Pack, a dataset designed to advance LLMs’ API call generation capabilities. This multi-lingual dataset, comprising over one million instances with 10 programming languages, is by far the largest-open-source instruction dataset (see Table 1) for API call generation and API call intent-detection (identifying the appropriate API endpoint to solve a task based on a natural language prompt). We evaluated API Pack impact on API call generation in various ways (see Section 4). One of our fine-tuning experiments shows that CodeLlama-13B fine-tuned with only 20,000 instances of API Pack (all in Python) surpasses GPT-3.5 and GPT-4 for API call generation for unseen new APIs.

API Pack distinguishes itself from prior works (Xu et al., 2023b; Patil et al., 2023; Qin et al., 2023) in two different ways - scale and multilinguality. With over a million examples spanning more real-world APIs and use cases than predecessors, API Pack facilitates rigorously evaluating generalization capabilities by controlling training data volume. In addition, API Pack includes API calls across 10 different programming languages (see Table 1). This allows a rare assessment of cross-lingual skill transfer - how gains in one language apply to others. Such transfer is vital but underexplored in previous studies.

Feature API Pack (this work) APIBench (Gorilla) ToolBench ToolBench (ToolLLM) API Bank ToolAlpaca ToolFormer
API call intent detection? ✓(*)
API call code generation?
Multi-lingual API calls?
Multi-API call scenario?
Data generation method: custom self-instruct self-instruct custom custom custom custom
# of Sources 4 3 8 1 53 / 5
# of APIs / Tools 11,213 1,645 8 16,464 53 400 5
# of API calls 1,128,599 16,450 / 37,204 568 3,938 9,400
# of Instances 1,128,599 16,450 2,746 12,657 264 3,938 22,453
Table 1: A comparison of API Pack with other instruction tuning datasets for API intent detection and/or API call code generation. The upper section of the table reports the features that each dataset covers, and the bottom section reports the data statistics available (/ means unavailable). The symbol * indicates the existence of a constraint (i.e., API Pack instructions include the API name desired).

We summarized three highlights from our experiments:

  • Fine-tuning CodeLlama-13B on 20,000 Python instances yields over 10% and 5% higher accuracy than GPT-3.5 and GPT-4 respectively for unseen API calls.

  • Scaling the instruction data to 100,000 Python instances improves generalization to new APIs, confirming the benefits of a larger dataset.

  • Cross-lingual API call generation can be achieved by fine-tuning models on a large amount of data in one language plus small amounts of data from other languages. An excessive amount of data for each target programming language is not crucial.

We also evaluated the performance of API Pack combined with other instruction datasets (Magicoder-OSS-Instruct-75K and Magicoder-Evol-Instruct-110K (Wei et al., 2023)). Our results show that API Pack improves API call generation without affecting the code generation performance on HumanEval+ (Liu et al., 2023a) and MBPP (Austin et al., 2021) benchmarks.

The paper is organized as follows: Section 2 reviews related work. Section 3 describes the construction of API Pack, our novel multilingual dataset. Section 4 outlines the experimental design for fine-tuning models on API Pack and evaluating performance. The key findings are presented in Section 5. We conclude the paper in Section 6, address limitations of the study in Section 7, and outline future work plans in Section 8.

2 Related Work

2.1 Methods to Generate Instruction Data with LLMs

As manually crafting instruction data is a labor-intensive process (Xu et al., 2023a), a growing trend of work has proposed the use of LLMs as a mean to reduce the human effort required to create instruction datasets. On the one hand, LLMs have made feasible to produce large amounts of instruction data in an automatic fashion. On the other hand, generating synthetic data with LLMs increases the risk of introducing incorrect data, instances that lack complexity or that are insufficiently diverse. Researches have work on these concerns and propose different methods to create and filter LLM-generated data. Two methods widely adopted are Self-Instruct (Wang et al., 2023a) and Evol-Instruct (Xu et al., 2023a). The latter, in particular, addresses the challenge of creating open-domain instructions.

Two phases, instance generation and filtering, are essential in creating synthetic instruction data with Self-Instruct (Wang et al., 2023a) and Evol-Instruct (Xu et al., 2023a). In Self-Instruct, a small pool of instances is created, and an LLM mutates selected instances with in-context examples. New instances are added back to the pool only if they pass filtering. Self-Instruct filters instances based on ROUGE-L similarity and other heuristics. Evol-Instruct generates new instances via LLM prompting with prompts that have a specific target. This method achieves instance filtering by classifying instructions based on a set of heuristics. For both methods, the heuristics for instance filtering are determined by manual analysis of the data. Even though Self-Instruct (Wang et al., 2023a) and Evol-Instruct (Xu et al., 2023a) have became a standard to create instruction data, custom versions of these methods are also employed to create instruction datasets for specific domains.

Beyond data filtering, performing a quality check is still critical in order to use synthetic data to fine-tune models. A simple, but time-consuming way to perform this quality review is asking an expert annotator (a human) to determine the correctness of each instance generated (Wang et al., 2023a). An alternative to the human method is to prompt a strong LLM (i.e., ChatGPT) with either scoring or classifying instructions (Liu et al., 2023b) based on heuristics determined by manually reviewing a small sample of the data generated. This scoring approach has also being used to determine instructions complexity (Chen et al., 2023; Lu et al., 2023).

2.2 LLMs for API call Code Generation and Intent Detection

A growing body of research explores the integration of Large Language Models (LLMs) and Application Program Interfaces (APIs). Part of this work focuses on, API call intent-detection, accurately identifying the appropriate API endpoint to solve a natural language task. Other research work, however, centers on creating LLMs that generate code to invoke APIs functionality (e.g.,(Patil et al., 2023),(Xu et al., 2023b)). This second approach, which our work also contributes to, targets developers as end users, and seeks to improve their experience in finding API call code examples.

API call intent-detection LLMs (e.g., (Qin et al., 2023), (Li et al., 2023a), (Tang et al., 2023), (Yang et al., 2023), (Schick et al., 2023)) typically work as part of hybrid architectures in which LLMs find the appropriate API endpoint(s) to use, and other software components (e.g., retrievers, API code DBs, code generation libraries) generate the code to call them. In this hybrid approach, API calls are executed internally to return a final answer to the user. API call intent-detection studies have explored single and multi API-intent scenarios. For the latter, only a few studies have obtained good results (i.e., (Qin et al., 2023)).

API call code generation has been much less addressed in research as only a few studies have explored it. One of them is the Gorilla project (Patil et al., 2023), which created an LLM to generate API calls for loading pre-trained machine learning models from three known model hubs (Torch Hub, TensorFlow Hub v2, HuggingFace). Another important work is the ToolBench dataset (Xu et al., 2023b), which serves as benchmark to enhance the tool manipulation capabilities of open-source LLMs.

2.3 Datasets for Code-related Tasks

As it was a concern that only closed-source models (e.g., Code-Davinci002, Google’s Bard) performed well on popular code benchmarks (e.g., HumanEval (Chen et al., 2021), HumanEval+ (Liu et al., 2023a), MBPP (Austin et al., 2021)) the open-source community created open-source instruction datasets for code-related tasks in an attempt to even out the performance of open-source Code LLMs with their close-source counterparts. WizardCoder training set (Luo et al., 2023) is one of these attempts. This dataset was constructed by customizing the Evol-Instruct(Xu et al., 2023a) method to the realm of code, and it was used to fine-tune the StarCoder model (Li et al., 2023b). The resulting model, WizardCoder, achieved a performance very close to GPT4 and GPT3.5 models. In a similar vein, CommitPack (Muennighoff et al., 2023), an instruction dataset that comprises four terabytes of Git commits across 350 programming languages, was also used to fine tuned StarCoder. The resulting model achieved good performance on the HumanEval Python benchmark. Another important contribution motivated by the same goal is OSS-INSTRUCT (Wei et al., 2023), an approach to generate more diverse, realistic, and controllable coding instruction data. OSS-INSTRUCT was created based on a seed of open-source code snippets (outputs). Then, a teacher model was tasked with creating coding problems for the code snippets (inputs). This instruction dataset was used to fine tune the family of Magicoder models which surpassed ChatGPT on HumanEval+ (Liu et al., 2023a) benchmark.

3 API Pack

API Pack is an instruction dataset with more than one million instances. In API Pack, an instance contains an input-output pair plus additional information about the API and respective endpoints. Inputs are instructions to find an API call to solve a coding task. They include a task description in software engineering languages and the name of the API to be used. Conversely, outputs are API call examples, specifically HTTP request code snippets curated from OpenAPI specification (OAS) files.

API Pack curates data from four hubs that store OAS files: RapidAPI 111https://rapidapi.com/categories, APIGurus 222https://apis.guru/, the Swaggerhub 333https://app.swaggerhub.com/search, and a company’s public API Hub 444https://{anonymous_url}. Table 2 presents the total number of APIs, unique endpoints, and total instances (in different programming languages) that API Pack contains.

Table 2: Final count of data curated per source, an instance has one API call.
Source APIs Unique Endpoints Total Instances
Company API Hub 73 2,884 17,206
APIs Gurus 1,980 37,097 495,533
Swaggerhub 5,045 26,747 345,765
RapidAPI 4,115 21,525 270,095
Total 11,213 88,253 1,128,599

The construction process of API Pack comprises four major stages: data pre-processing (3.1), API Database (DB) creation (3.2), instructions generation (3.3), and data validation (3.4). Figure 1 shows the overall pipeline.

Refer to caption
Figure 1: Dataset curation pipeline.

3.1 Data Pre-processing

First, we filter out files with non-English data or zero endpoints for all the OAS files collected. Second, we extracted information about the endpoints that each OAS file contains: name, functionality, description, method, and path. We also extracted information at the API level. Specifically, we collected the API name, the API description, and the API Provider. This information is the same for all the endpoints present in an OAS file. While OAS files provide a standard structure to document an API’s design, not all information fields are mandatory. Thus, we applied a second filter to remove instances with crucial data to generate API calls empty (e.g., method, either the path or the endpoint_name). We also filter out instances were the functionality, description and endpoint_name were empty as at least one of these values must exist to generate instructions.

3.2 Create API DBs

After extracting relevant data from the OAS files, we built an API DB with the data extracted. API DBs contain an array of independent instances in JSON format. Each instance contains all the information relevant to an endpoint (endpoint_name, functionality, description, path, method), and the API each endpoint belongs to (API name, API description, API provider). These information was extracted directly from the raw OAS files. Each instance also contains an API call example in a given programming language, and a string that identifies the programming language (e.g., cURL, python, java). We used openapi-snippet 555 https://www.npmjs.com/package/openapi-snippet to generate API calls (api_call) in 10 different programming languages (cURL, libcurl, java, node, python, go, ruby, php, swift, JavaScript) for the endpoints of three data sources. For one data source we directly extracted API calls from the OAS files. Appendix F shows the structure of an API DB instance and provides details on programming language diversity for each source.

Figure 2 shows an example of an API call in cURL. Note that this example does not include real argument values. Instead, a string (e.g., REPLACE_BASIC_AUTH, [“string”], “string”) serves as a placeholder to indicate the need of an argument. This is the standard way API documentation libraries integrate API call code examples into documentation (e.g, swagger-codegen 666https://github.com/swagger-api/swagger-codegen Sphinx documentation 777https://www.sphinx-doc.org/en/master/, DocFX 888https://github.com/dotnet/docfx). Developers usually copy these examples and replace the placeholder strings to test an API call code snippet.

3.3 Instructions Generation

The process to generate instructions comprises two steps: 1) creating high-quality instruction examples, and 2) generating instruction candidates.

We created instruction examples as follows. First, we randomly selected three endpoints from each API DB file. Then, we created an instruction example for each endpoint selected by bootstrapping its information (e.g., functionality, description, endpoint_name, path) and the respective API name into different templates. Through this method, we obtained three instruction examples for each API DB. Note that an LLM was not prompted to generate the initial version of these examples. We refined instruction examples in two ways. For API Gurus and the company’s API Hub, three people (all authors of this paper) corrected grammatical errors, removed unnecessary information (i.e., urls in the instruction), and verified that the API name was part of all the instruction examples. After manually reviewing the examples for these two sources, we noticed that the errors found were repetitive. Therefore, for the other two sources, the Swaggerhub and RapidAPI, we replaced human labor with a Large Language Model (LLM), Mistral-7B-Instruct-v0.2999https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2. The prompt used to refine these instruction examples (Prompt LABEL:lst:prompt_refinement in Appendix H) was distilled from the error patterns that the researchers identified by correcting instruction examples for API Gurus and the company’s API Hub.

We used these high-quality instruction examples to generate instruction candidates for the instances in each API DB. We prompted an LLM (Mistral-7B-Instruct-v0.2) to generate instructions by providing the endpoint’s information as input and the high-quality instructions created for each API DB (see previous paragraph) as in-context examples (see Prompt LABEL:lst:prompt_generation in Appendix H). For each API DB, we generated five instruction candidates per instance. Figure 2 shows one of the instruction candidates generated (Appendix G shows all the candidates generated for the same instance).

{
"instruction_candidates": [
{
"idx": 1,
"candidate": "Id like to confirm if 3D
Secure is supported for a specific card brand
or BIN using the Adyen BinLookup API. For 3D
Secure 2, it will also provide device fingerprinting keys.",
"input_tokens_mean": -0.5497539341557909,
},
...
],
"api_call": "curl --request POST \ --url https://paltest.adyen.com/pal/servlet/BinLookup/v40/get3dsAvailability \ --header Authorization: Basic REPLACE_BASIC_AUTH \ --header content-type: application/json \ --data ’{"additionalData":{"property1":"string","property2":"string"},
"brands":["string"],
"cardNumber":"string",
"merchantAccount":"string",
"recurringDetailReference":"string",
"shopperReference":"string"}’"
...
}
Figure 2: Fragment of an API Pack instance (this example shows an API call in cURL and one of the five instruction candidates generated).

3.4 Data Validation

The data validation process comprises three steps: 1) verifying if API calls are valid HTTP request examples in a given programming language, 2) verifying if instructions are of high-quality, and 3) selecting the instruction with the best quality for training.

In order to verify that API calls were valid HTTP request examples, we first compared the API call’s content with the instance data to check the correctness of the endpoint_name and the method being used (e.g., get, post, put, delete, custom method name). Then, we used regular expressions to validate the url format. As these API calls are code examples, we considered that placeholder strings may exist for the url domain, path parameters, or query parameters. We also checked that the programming language keywords in the API call correspond to the language string id assigned to each instance.

To validate the set of instructions generated for each API call, we first selected a random sample of 121 instances, each instance contains one API call and five instructions. The 121 instances selected contain a unique functionality, description, and endpoint_name. One of the authors labeled the instructions for these 121 instances (605 instructions in total) as good or bad. By analyzing the instructions labeled as bad we noticed that they matched at least one of the following three characteristics:

  • Contains multiple instructions instead of a single one.

  • Include unnecessary text before or after the main instruction (e.g., ”User query”, ”query”, ”Instruction:”).

  • Fail to accurately use the correct API name.

We created three prompts to automatically label instructions considering these characteristics. All the prompts included a fixed set of in-context examples. To select a champion prompt, we used an LLM (Mistral-7B-Instruct-v0.2) to label the 605 instructions that one of the authors had manually classified as good or bad. We repeated this process with each prompt, and compare the results against our human-crafted oracle. We used the prompt that showed the best results (Prompt LABEL:lst:prompt_scoring in Appendix H) to task Mistral-7B-Instruct-v0.2 with classifying all the instructions in our dataset as good or bad. We removed instances with less than two good instructions from our dataset.

In order to select a best candidate to use for training from the (sub)set of good instructions, we calculated the likelihood of an LLM to recreate the input text used to generate each instruction. We prompted an LLM, Mistral-7B-Instruct-v0.2 to accomplish this task (see Prompt LABEL:lst:prompt_backtranslation in Appendix H); the text used as input was each instruction candidate itself. The LLM returned the log probability of each token for the re-generated input text. We calculated the mean of these log probabilities (input_tokens_mean), and link this metric to the respective instruction candidate. Appendix G shows five instruction candidates and their respective input_tokens_mean. We selected instructions labeled as good with the best input_token_mean (value closer to zero) for training.

Our final dataset contains 1,128,599 instances, each with a valid API call example and at least two high-quality instructions. Appendix A shows the data instances filtered out at each stage of the pipeline.

4 Experiments and Evaluation Framework

In this section, we introduce our experimental settings and evaluation framework.

4.1 Preamble

In order to optimize the instruction-following capabilities of the language models fine-tuned with our dataset, we post-processed API Pack into two instruction-tuning templates, which we refer to as zero-shot and three-shot templates. The first template (zero-shot) targets the scenario where the output is expected to be a straightforward inference from the given input, which highlights the model’s direct instruction-following capability. The second template (three-shot) emphasizes the model’s ability to learn and generate output with in-context learning. We provide the mathematical representation for each template below:

Zero-shot: this template models a direct probabilistic relationship between input (x) and output (y). The formula is expressed as:

p(y|x)=f(x)y (1)

In this formula, f is the language model mapping of an input instruction to its corresponding output.

Three-shot: this template extends the model’s capacity by incorporating contextual pairs (xi,yi). It is represented by the equation:

p(y|x,{xi,yi}i=13)=f(x,{xi,yi}i=13)y (2)

In this formulation, f considers the primary input x as well as the context from three additional pairs (xi,yi). The three-shot template is available in Appendix LABEL:lst:prompt_eval.

4.2 Experimental Settings

A. Selecting the baseline: our first experimental setting serves the purpose of selecting a base model for the rest of our experiments. To do so, we fine-tune Mistral 7b (Jiang et al., 2023), CodeLlama 7b and 13b, as well as Llama 2 13b (Touvron et al., 2023) on a subset of API Pack (20,000 instances in Python programming language). We evaluate the performance of each 20k fine-tuned resulting model. We use the respective base-model for the fine-tuned model that showed the best performance in the rest of our experiments.

B. Inference with retrieval: our second experiment has the goal of understanding the influence of retrieval augmentation on model generalization. Thus, we evaluate the models under four distinct prompt settings during the test time:

  • 0-shot: no API examples provided for the model.

  • 3-shot random: 3 randomly selected API examples.

  • 3-shot retrieved: 3 retrieved relevant API examples.

  • 3-shot retrieved & re-ranked: 5 retrieved API examples, selecting 3 out of 5 using a re-ranker model.

Note that these prompt settings are for testing/inference, different from the instruction-tuning templates used in fine-tuning. Here, we used bge-large-en-v1.5 (Zhang et al., 2023) as the embedding model for retrieval, and beg-reranker-large (Xiao et al., 2023) for re-ranking. Appendix C illustrates the inference pipelines (0-shot, 3-shot) we used to evaluate the performance of models.

C. Scaling experiment: we conduct a scaling experiment to investigate whether more API data improves a model’s generalization ability for unseen API data. For this experiment we fine-tune models on progressively larger API datasets, all with unique API calls in Python. Specifically, we fine-tuned models with 20k, 40k, 80k, and 100k instances respectively. Our hypothesis is that exposure to a greater diversity of APIs during fine-tuning will improve the model’s ability to generalize to new, unseen APIs.

D. Cross-lingual generalization: To test the model’s ability to generalize to new programming languages, we supplement a cURL dataset of 100,000 instances with 1000 instances from each of the nine additional languages: Go, Java, JavaScript, libcurl, Node.js, PHP, Python, Ruby, and Swift in API Pack. The goal is to see if a model generalizes to new languages without requiring a large amount of multi-lingual data.

E. Integration of API Pack with Magicoder: This experiment investigates the effect of combining API Pack with existing instruction datasets. We integrate a subset of 50,000 entries from API Pack into the Magicoder datasets (Magicoder-OSS-Instruct plus Magicoder-Evol-Instruct) and fine-tune the CodeLlama-13b model. The focus is on assessing improvements in API call code generation, especially for Level 3 tasks under a 3-shot prompting.

4.3 Evaluations

To measure the generalization capabilities enabled by API Pack, we establish a comprehensive evaluation framework spanning three levels of complexity for API call generation.

  • Level 1, seen APIs and Endpoints assess generalization to new instructions within familiar APIs.

  • Level 2, seen APIs and new Endpoints test generalization to new endpoints of known APIs.

  • Level 3, unseen APIs and Endpoints validate performance on entirely new APIs.

Endpoint and API call accuracy at each level is measured by SequenceMatcher, which identifies the longest matching subsequences while excluding insignificant elements. A threshold of 0.9 is applied to compare the generated output with the ground truth Endpoint and API call.

Table 3: Evaluation for models fine-tuned with 20k Python API dataset. A comparison with (non-fine-tuned) GPT-3.5 and GPT 4 performance for level 3 is also included.
Model Fine-tuning Testing Evaluation Accuracy (%)
template Level 1 Level 2 Level 3
Intent API Call Intent API Call Intent API Call
Mistral-7b zero-shot 0-shot 17.2 10.9 14.1 11.4 14.3 11.2
3-shot (retre) 42.0 29.7 35.4 28.7 39.1 29.1
three-shot 0-shot 40.5 28.5 24.0 18.3 15.2 12.1
3-shot (retre) 64.1 55.4 49.1 42.8 50.8 42.5
CodeLlama-7b zero-shot 0-shot 8.1 6.1 10.0 7.0 11.0 7.8
3-shot (retre) 52.6 42.6 43.6 35.9 50.2 40.1
three-shot 0-shot 12.1 9.3 13.7 10.2 16.8 13.0
3-shot (retre) 60.6 52.7 54.1 47.3 55.9 49.1
Llama-2-13b zero-shot 0-shot 9.4 6.2 11.6 9.0 10.9 8.4
3-shot (retre) 44.5 33.9 45.4 35.6 46.7 39.1
three-shot 0-shot 15.7 10.2 14.0 11.2 11.7 9.6
3-shot (retre) 59.5 51.5 50.8 44.3 52.7 44.2
CodeLlama-13b zero-shot 0-shot 9.8 6.8 10.8 8.1 12.1 8.5
3-shot (retre) 55.6 44.4 50.6 43.3 52.3 44.1
three-shot 0-shot 14.4 10.3 15.9 13.3 14.2 8.9
3-shot (retre) 63.5 55.5 56.8 51.4 56.1 49.5
gpt-3.5-1106 none 3-shot (retre) - - - - 47.2 39.5
gpt-4-1106 none 3-shot (retre) - - - - 53.5 44.3

5 Results

In this section, we present and briefly discuss the results of the experiments described in Section 4.

5.1 Fine-tuned CodeLlama Excels in API call Generation

Table 3 shows the evaluation results for the four models fine-tuned with 20,000 API Pack instances in Python. Overall, note that the fine tuning of CodeLlama-13b excels in API call (code generation) for the three-shot retrieved setting, where it achieved the top rates for the three evaluation levels (55.5% in Level 1, 51.4% in Level 2, and 49.5% in Level 3). Also note that the models fine-tuned with the three-shot template consistently show better performance than those fine-tuned with the zero-shot template. This result suggests that bootstrapping data with the three-shot template is important to improve the model’s in-context learning abilities. Another key insight is the substantial improvement observed prompting with 3-shot (retre) versus 0-shot at testing time. This trend is consistent across all models and levels, indicating that providing models with relevant examples improves their accuracy in generating API calls. Finally, note that the fine-tuning of CodeLlama-13b with API Pack outperforms GPT-3.5 and GPT-4 models (not fine-tuned) for Level 3.

5.2 Retrieval Augmentation Improves API Call Generation

The results of fine-tuning Mistral-7b and CodeLlama-13b models with the 20,000 Python API dataset using the three-shot template, highlights three-shot retrieval as the best approach to improve API call generation accuracy. See Appendix D for further details.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Scaling instruction dataset on CodeLlama-13b with zero-shot and three-shot retre evaluation. The x-axis is the size of the fine-tuning data in Python. The y-axis is Endpoint or API call accuracy.

5.3 Scaling Instruction Dataset Helps Generalization

Fig. 3 clearly shows an upward trend in 0-shot performance that correlates with the dataset size (20k, 40k, 80k, and 100k Python API data). This improvement reflects the advantage of larger datasets in offering diverse examples, which is crucial in 0-shot prompting where the model relies exclusively on its pre-existing knowledge. For 3-shot, the graph also shows improvements as the dataset size increases, albeit less significantly than in the 0-shot case. This indicates that while additional fine-tuning data is beneficial, model pretraining is more instrumental for few-shot learning ability.

5.4 Multi-lingual Generalizations in API Call Performance

Refer to caption
Refer to caption
Figure 4: Different Models in Multi-Lingual API Call Performance: zero-shot vs. three-shot in cURL, Python, and Java.

Figure 4 compares three fine-tuning approaches for multi-lingual API call performance: a model fine-tuned exclusively on 100,000 instances of ‘cURL’ data, or cURL Model, three expert models, each fine-tuned on 100,000 samples of ‘cURL’, ‘Python’, and ‘Java’ data separately, and a mixture model fine-tuned on 100,000 instances of ‘cURL’ data with additional samples of 9,000 instances each for nine different languages.

We distilled three key observations from this graph. First, generalizing to new programming languages without prior exposure remains a challenge in zero-shot scenarios. Second, despite the zero-shot limitations, the models demonstrate better-than-expected performance in three-shot contexts, indicating some degree of in-context learning adaptability, even without specific language fine-tuning. Third, mixed-language fine-tuning improves performance in both zero-shot and three-shot scenarios, which suggests that even a small amount of fine-tuning data in various languages can contribute to the model’s overall linguistic versatility and effectiveness in API call tasks.

We also analyzed the models’ adaptability across ten programming languages. We include the results of this analysis in Appendix E.

5.5 Improving Code Models with API Pack

We also evaluated the performance of mixing a subset of 50,000-entries of API Pack with Magicoder dataset and fine-tuned CodeLlama-13b model. The resulting model shows an increase of over 35.3% in API call code generation accuracy for Level 3, specifically with the 3-shot setting. This improvement does not come at the expense of general coding efficiency, as the resulting model still performs well on benchmarks such as HumanEval+ and MBPP. See Table 4 for further details.

Table 4: Evaluation for Code Generation with CodeLlama-13b
Data Mixture Bench. (pass@10) Level 3 (3-shot)
HumanEval+ MBPP Endpoint
- 47.8 58.3 -
Magicoder 60.8 66.4 17.0
Magicoder + API Pack 61.3 64.3 52.3

6 Conclusion

In this paper, we introduce API Pack, a multilingual dataset of more than one million instruction-API call instances that seeks to improve models’ code generation ability. Leveraging this unprecedented scale and diversity, we explore two critical research questions: (1) Does exposure to a greater diversity of APIs during fine-tuning improve the model’s generalization to new API data? (2) Can models generalize to new languages without requiring a large amount of multi-lingual data? Our results demonstrate that increasing data volume does improve generalization capabilities, and that cross-lingual code generation can be achieved by training in only one programming language plus small amounts of data from other languages. Moreover, we also explored the usability of API Pack with other code instruction datasets. Our results show that API Pack improves API call code generation without affecting the general performance of other code generation tasks.

7 Limitations

While advancing the field of code generation with API call integration our research presents limitations:

  • API Pack instructions must include the API name to achieve correct intent-detection. This limitation hinders the model’s ability to intuitively interpret and respond incomplete queries.

  • Challenges with Multi-API Call Scenarios. As API Pack is not a multi-API call dataset, the models fine-tuned with this dataset will potentially struggle with scenarios involving multiple, interdependent API calls. This restricts API Pack effectiveness in complex, real-world software development tasks.

8 Future Work

Building upon these promising findings, we propose three future research directions:

  • Enriched API Classification. Eliminating the explicit need of API names in dataset instructions to allow intuitive context-based translation that better resembles natural coding workflows.

  • Argument Augmentation. Incorporating API calls with arguments for comprehensive, functional examples. This expanded scope enables models to generate detailed, actionable code that developers can test.

  • Multi-Step API Scenarios. Including multi-API call sequences to mirror real-world complexity and dependencies in advanced development settings.

9 Broader Impact Statement

Our research has a broader impact on Software Productivity as it aims to accelerate software development workflows by automating routine coding tasks. Alongside promising productivity gains, the integration of advanced LLMs into software development workflows raises sociotechnical concerns (e.g., job displacement, developers’ continuous oversight to ensure accuracy). Therefore, a responsible innovation lens assessing and responding to potential harms remains vital.

References

  • Austin et al. (2021) Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Sutton, C. Program Synthesis with Large Language Models, August 2021. URL http://arxiv.org/abs/2108.07732. arXiv:2108.07732 [cs].
  • Chen et al. (2023) Chen, L., Li, S., Yan, J., Wang, H., Gunaratna, K., Yadav, V., Tang, Z., Srinivasan, V., Zhou, T., Huang, H., and Jin, H. AlpaGasus: Training A Better Alpaca with Fewer Data, November 2023. URL http://arxiv.org/abs/2307.08701. arXiv:2307.08701 [cs].
  • Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. Evaluating Large Language Models Trained on Code, July 2021. URL http://arxiv.org/abs/2107.03374. arXiv:2107.03374 [cs].
  • Ebert & Louridas (2023) Ebert, C. and Louridas, P. Generative AI for Software Practitioners. IEEE Software, 40(4):30–38, July 2023. ISSN 1937-4194. doi: 10.1109/MS.2023.3265877. URL https://ieeexplore.ieee.org/abstract/document/10176168?casa_token=JPr1zeiL9IYAAAAA:noU0xEJ-kpRsoaLTTNiSFttcQ_Fw1lBtZRQWnykXIz6rRNvIW4qW-5nljQZYP7H_dH1yJ-3qc-bW. Conference Name: IEEE Software.
  • Hou et al. (2023) Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., and Wang, H. Large Language Models for Software Engineering: A Systematic Literature Review, September 2023. URL http://arxiv.org/abs/2308.10620. arXiv:2308.10620 [cs].
  • Jiang et al. (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7B, October 2023. URL http://arxiv.org/abs/2310.06825. arXiv:2310.06825 [cs].
  • Li et al. (2023a) Li, M., Song, F., Yu, B., Yu, H., Li, Z., Huang, F., and Li, Y. API-Bank: A Benchmark for Tool-Augmented LLMs, April 2023a. URL http://arxiv.org/abs/2304.08244. arXiv:2304.08244 [cs].
  • Li et al. (2023b) Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., Liu, Q., Zheltonozhskii, E., Zhuo, T. Y., Wang, T., Dehaene, O., Davaadorj, M., Lamy-Poirier, J., Monteiro, J., Shliazhko, O., Gontier, N., Meade, N., Zebaze, A., Yee, M.-H., Umapathi, L. K., Zhu, J., Lipkin, B., Oblokulov, M., Wang, Z., Murthy, R., Stillerman, J., Patel, S. S., Abulkhanov, D., Zocca, M., Dey, M., Zhang, Z., Fahmy, N., Bhattacharyya, U., Yu, W., Singh, S., Luccioni, S., Villegas, P., Kunakov, M., Zhdanov, F., Romero, M., Lee, T., Timor, N., Ding, J., Schlesinger, C., Schoelkopf, H., Ebert, J., Dao, T., Mishra, M., Gu, A., Robinson, J., Anderson, C. J., Dolan-Gavitt, B., Contractor, D., Reddy, S., Fried, D., Bahdanau, D., Jernite, Y., Ferrandis, C. M., Hughes, S., Wolf, T., Guha, A., von Werra, L., and de Vries, H. StarCoder: may the source be with you!, December 2023b. URL http://arxiv.org/abs/2305.06161. arXiv:2305.06161 [cs].
  • Liang et al. (2023) Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as Policies: Language Model Programs for Embodied Control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9493–9500, May 2023. doi: 10.1109/ICRA48891.2023.10160591. URL https://ieeexplore.ieee.org/abstract/document/10160591?casa_token=NZCPW7T2O5QAAAAA:lnnQxWsEhgimKw52mjcQJ-GMER2nOCA11yJHSUvZGA_VZiHcM_qYfKBnd2GCRDbNcLGakL2SgQ.
  • Liu et al. (2023a) Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, October 2023a. URL http://arxiv.org/abs/2305.01210. arXiv:2305.01210 [cs].
  • Liu et al. (2023b) Liu, W., Zeng, W., He, K., Jiang, Y., and He, J. What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning, December 2023b. URL http://arxiv.org/abs/2312.15685. arXiv:2312.15685 [cs].
  • Lu et al. (2023) Lu, K., Yuan, H., Yuan, Z., Lin, R., Lin, J., Tan, C., Zhou, C., and Zhou, J. #InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models, August 2023. URL http://arxiv.org/abs/2308.07074. arXiv:2308.07074 [cs].
  • Luo et al. (2023) Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C., Ma, J., Lin, Q., and Jiang, D. WizardCoder: Empowering Code Large Language Models with Evol-Instruct, June 2023. URL http://arxiv.org/abs/2306.08568. arXiv:2306.08568 [cs].
  • Meng et al. (2018) Meng, M., Steinhardt, S., and Schubert, A. Application Programming Interface Documentation: What Do Software Developers Want? Journal of Technical Writing and Communication, 48(3):295–330, July 2018. ISSN 0047-2816. doi: 10.1177/0047281617721853. URL https://doi.org/10.1177/0047281617721853. Publisher: SAGE Publications Inc.
  • Muennighoff et al. (2023) Muennighoff, N., Liu, Q., Zebaze, A., Zheng, Q., Hui, B., Zhuo, T. Y., Singh, S., Tang, X., von Werra, L., and Longpre, S. OctoPack: Instruction Tuning Code Large Language Models, August 2023. URL http://arxiv.org/abs/2308.07124. arXiv:2308.07124 [cs].
  • Patil et al. (2023) Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Large Language Model Connected with Massive APIs, May 2023. URL http://arxiv.org/abs/2305.15334. arXiv:2305.15334 [cs].
  • Qin et al. (2023) Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., Zhao, S., Tian, R., Xie, R., Zhou, J., Gerstein, M., Li, D., Liu, Z., and Sun, M. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, July 2023. URL http://arxiv.org/abs/2307.16789. arXiv:2307.16789 [cs].
  • Sadowski et al. (2015) Sadowski, C., Stolee, K. T., and Elbaum, S. How developers search for code: a case study. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pp. 191–201, New York, NY, USA, August 2015. Association for Computing Machinery. ISBN 978-1-4503-3675-8. doi: 10.1145/2786805.2786855. URL https://dl.acm.org/doi/10.1145/2786805.2786855.
  • Schick et al. (2023) Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools, February 2023. URL http://arxiv.org/abs/2302.04761. arXiv:2302.04761 [cs].
  • Shrivastava et al. (2023) Shrivastava, D., Larochelle, H., and Tarlow, D. Repository-Level Prompt Generation for Large Language Models of Code. In Proceedings of the 40th International Conference on Machine Learning, pp. 31693–31715. PMLR, July 2023. URL https://proceedings.mlr.press/v202/shrivastava23a.html. ISSN: 2640-3498.
  • Tang et al. (2023) Tang, Q., Deng, Z., Lin, H., Han, X., Liang, Q., Cao, B., and Sun, L. ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases, September 2023. URL http://arxiv.org/abs/2306.05301. arXiv:2306.05301 [cs].
  • Touvron et al. (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
  • Wang et al. (2023a) Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions, May 2023a. URL http://arxiv.org/abs/2212.10560. arXiv:2212.10560 [cs].
  • Wang et al. (2023b) Wang, Y., Le, H., Gotmare, A. D., Bui, N. D. Q., Li, J., and Hoi, S. C. H. CodeT5+: Open Code Large Language Models for Code Understanding and Generation, May 2023b. URL http://arxiv.org/abs/2305.07922. arXiv:2305.07922 [cs].
  • Wei et al. (2023) Wei, Y., Wang, Z., Liu, J., Ding, Y., and Zhang, L. Magicoder: Source Code Is All You Need, December 2023. URL http://arxiv.org/abs/2312.02120. arXiv:2312.02120 [cs].
  • Xiao et al. (2023) Xiao, S., Liu, Z., Zhang, P., and Muennighoff, N. C-Pack: Packaged Resources To Advance General Chinese Embedding, December 2023. URL http://arxiv.org/abs/2309.07597. arXiv:2309.07597 [cs].
  • Xu et al. (2023a) Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. WizardLM: Empowering Large Language Models to Follow Complex Instructions, June 2023a. URL http://arxiv.org/abs/2304.12244. arXiv:2304.12244 [cs].
  • Xu et al. (2023b) Xu, Q., Hong, F., Li, B., Hu, C., Chen, Z., and Zhang, J. On the Tool Manipulation Capability of Open-source Large Language Models, May 2023b. URL https://arxiv.org/abs/2305.16504v1.
  • Yang et al. (2023) Yang, R., Song, L., Li, Y., Zhao, S., Ge, Y., Li, X., and Shan, Y. GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction, May 2023. URL http://arxiv.org/abs/2305.18752. arXiv:2305.18752 [cs].
  • Zan et al. (2023) Zan, D., Chen, B., Zhang, F., Lu, D., Wu, B., Guan, B., Yongji, W., and Lou, J.-G. Large Language Models Meet NL2Code: A Survey. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7443–7464, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.411. URL https://aclanthology.org/2023.acl-long.411.
  • Zhang et al. (2023) Zhang, P., Xiao, S., Liu, Z., Dou, Z., and Nie, J.-Y. Retrieve Anything To Augment Large Language Models, October 2023. URL http://arxiv.org/abs/2310.07554. arXiv:2310.07554 [cs].

Appendix

Appendix A Data Filtering at each Stage of the Pipeline

Table 5: Data Filtering Progress
Source/Instances Before Data Validation After Removing Invalid API calls After Removing Instances without Good Instructions
API Hub 27,635 17,712 17,206
APIs Gurus 500,160 499,250 495,533
Swaggerhub 351,756 351,756 345,765
RapidAPI 274,014 273,388 270,095
Total 1,153,565 1,142,106 1,128,599

Appendix B Hyperparameters for Training

We fine-tune the models using the HuggingFace Transformers library on a cluster consisting of 1 node with 8 NVIDIA H100 80GB GPUs, with Fully Shared Data Parallelism (FSDP). Techniques such as mixed precision, gradient checkpointing, and AdaFactor optimizer are used to improve training efficiency. The key hyperparameters are summarized in Table 6.

Table 6: Hyperparameters for Training
Hyperparameter Name Value
Learning rate 2×105
Batch size 128
Max seq length 4096
Number of epochs 2
Warmup ratio 0.03

Appendix C Testing Pipeline

Figure 5 shows an example of our testing pipeline including zero and few-shot.

Refer to caption
Figure 5: Testing pipeline.

Appendix D Retrieval methods comparison

Figure 7 illustrates the impact of different retrieval methods for three-shot API call generation.

Table 7: Performance comparison for different retrieval methods in three-shot prompting.
Model Testing Evaluation Accuracy (%)
Level 1 Level 2 Level 3
Endpoint API Call Endpoint API Call Endpoint API Call
Mistral-7b 3-shot (rand) 54.5 41.8 48.2 41.2 45.2 37.0
3-shot (retre) 64.1 55.4 49.1 42.8 50.8 42.5
3-shot (retre & rerank) 63.0 53.6 49.0 42.2 51.5 43.9
CodeLlama-13b 3-shot (rand) 49.2 38.6 49.8 43.6 50.0 41.4
3-shot (retre) 63.5 55.5 56.8 51.4 56.1 49.5
3-shot (retre & rerank) 61.0 52.9 55.1 49.2 55.9 49.3

Appendix E Analysis of Cross-linguistic Flexibility

Figure 6 illustrates the models’ adaptability across ten programming languages. The mixture model is on par with expert models in three-shot testing, highlighting the models’ potential in multi-lingual programming applications.

Refer to caption
Figure 6: Three-Shot Performance in Ten Languages: Demonstrating the Model’s Multi-Lingual API Call Adaptability.

Appendix F API DB instance

Figure 7 shows and the structure of an API DB instance. We used openapi-snippet 101010https://www.npmjs.com/package/openapi-snippet, an open-source package that takes as input an OpenAPI v2.0 or v3.0.x specification file (OAS file) and translates it into an HTTP Archive 1.2 request object, to generate API calls. We generated API calls (api_call) in 10 different programming languages (cURL, libcurl, java, node, python, go, ruby, php, swift, JavaScript) for RapidAPI, API Gurus, and the Swaggerhub. For the company’s public API Hub API calls were extracted directly from the OAS files. We extracted API calls in eight different programming languages from this source (cURL, java, node, python, go, ruby, php, swift).

{
"api_name" : "The name of the API the endpoint belongs to",
"api_description": "The description of the API the enpoint belongs to",
"api_provider": "The APIs provider name",
"endpoint_name": "The name of the function to call",
"functionality": "A brief description of endpoints functionality",
"description": "A long description of endpoints functionality",
"path": "The enpoints name plus specific versioning as it appears in the API calls URL",
"method": "HTTP method used in the API call (e.g., get, post, put, delete)",
"api_call": "HTTP request code example to invoke an endpoint",
"lang": "The programming language in which the API call is written",
}
Figure 7: Structure of an API DB instance

Appendix G Instruction Candidates

Figure 8 shows all the candidates generated for an instance with their respective id and input_tokens_mean.

"instruction_candidates": [
{
"idx": 1,
"candidate": "Id like to confirm if 3D Secure is supported for a specific card brand or BIN using the Adyen BinLookup API. For 3D Secure 2, it will also provide device fingerprinting keys.",
"input_tokens_mean": -0.5497539341557909
},
{
"idx": 2,
"candidate": "To confirm if 3D Secure is enabled for a specific card brand or BIN number using the Adyen BinLookup API, you can make use of the post-get3dsAvailability endpoint. This API call will provide you with information about the availability of 3D Secure, along with device fingerprinting keys for 3D Secure 2 transactions. Let me know if you need assistance in forming the request or handling the response.",
"input_tokens_mean": -0.5088001283229344
},
{
"idx": 3,
"candidate": "To confirm if a specific card brand or BIN supports 3D Secure and retrieves device fingerprinting keys for 3D Secure 2, please utilize the Adyen BinLookup APIs post-get3dsAvailability endpoint.",
"input_tokens_mean": -0.5554471563123543
},
{
"idx": 4,
"candidate": "To confirm if 3D Secure is supported for a specific card brand or BIN number using the Adyen BinLookup API, please make a POST request to the endpoint post-get3dsAvailability’. This API will return whether 3D Secure is enabled and for 3D Secure 2, it will also provide device fingerprinting keys.",
"input_tokens_mean": -0.5466722401065375
},
{
"idx": 5,
"candidate": "To confirm if 3D Secure is supported by Adyen for a given card brand or BIN number, you can utilize the Adyen BinLookup API. Simply send a POST request to the post-get3dsAvailability endpoint with the required card details. If 3D Secure is available, the response will include device fingerprinting keys for 3D Secure 2.",
"input_tokens_mean": -0.5726057469087047
}
]
Figure 8: Examples of five instruction candidates generated for the same instance.

Appendix H Prompts

Listing 1: Prompt for instruction refinement
Your task is to refine and enhance a user query that involves a specific API. The original query youll work with includes key details about the APIs functionality, description, endpoint, and name. Focus on these essential aspects when revising the query:
1. **Integration of API Details:** Make sure the revised query includes relevant details about the APIs functionality, description, and name, without directly mentioning the endpoint.
2. **Grammar and Syntax Correction:** Analyze the original query for grammatical mistakes such as improper verb forms (e.g., can followed by s or es’) or misplaced punctuation (like commas or colons). Correct these to improve clarity and professionalism.
3. **Relevance and Conciseness:** Eliminate any extraneous information from the original query. Strive for brevity while ensuring all critical details are included.
4. **User-Centric Rewrite:** Rework the query to reflect a users perspective, focusing on their specific needs and how the API can address those needs.
For each query, you will receive:
- **Input:** An original user query with API details.
- **Your Task:** Revise the query based on the guidelines above.
Example for Practice:
### Input:
Functionality: Search Ecards
Description: Allows searching the inventory system for ecards.
Endpoint: searchECards
API: eCards Search API
User query to refine: "Please tell me how to searches ecards with the eCards Search API."
### Output (refined user query):
"Can you guide me on how to search for ecards using the eCards Search API?"
Another Example:
### Input:
Functionality: KPI of realized sales
Description: Provides KPIs for documents issued, requires ACLs from /instances, and uses the typeKpi parameter.
Endpoint: getKPIs
API: Blackbird Analytics
User query to refine: "I need to certificates shows kpis related to all documents issued through blackbird. to use it you must have pass a list of acls retrieved by /instances and specify which kpi using typekpi parameter."
### Output (refined user query):
"How can I access KPIs for documents issued by Blackbird Analytics, and what are the required typeKpi parameters?"
Remember, the goal is to modify the user query to be clear, effective, and grammatically correct, fully showcasing how the user can leverage the specific API.
Now, here is your actual task:
### Input:
Functionality: {functionality}
Description: {description}
Endpoint: {endpoint}
API: {api_name}
User query to refine: {template generated instruction}
### Output (refined user query):
Listing 2: Prompt for instruction generation
Your task is to create a user query that effectively utilizes a specific API. The APIs functionality, description, and name will be provided to you. Your query should be designed in a way that makes the best use of this APIs unique capabilities. When crafting your query, focus on:
1. **API Name Integration:** Clearly include the APIs name in your query to ensure relevance.
2. **Specificity:** Replace broad or vague terms with precise, concrete details relevant to the APIs purpose.
3. **Conciseness:** Keep your query as brief as possible while still fully conveying the needed information. Avoid unnecessary verbosity.
4. **Excluding API Endpoint:** Do not include the APIs endpoint in your query; focus only on the users need and how the API fulfills it.
Create a query that a user might realistically use when interacting with the given API. Think about typical scenarios or problems that the API is designed to solve and formulate your query accordingly.
Examples for practice:
###Input:
Functionality: {functionality}
Description: {description}
Endpoint: {endpoint}
API: {api_name}
###Output:
{output}
###Input:
Functionality: {functionality}
Description: {description}
Endpoint: {endpoint}
API: {api_name}
###Output:
{output}
###Input:
Functionality: {functionality}
Description: {description}
Endpoint: {endpoint}
API: {api_name}
###Output:
{output}
Remember, the goal is to demonstrate how a user would benefit from this specific API in a realistic scenario, using precise and clear language. Here is the actual task for you:
###Input:
Functionality: {functionality}
Description: {description}
Endpoint: {endpoint}
API: {api_name}
###Output:
Listing 3: Prompt for instruction backtranslation
Your task involves a reverse-engineering process where you will analyze a user query to infer specific details about an API endpoint. Based on the given user query, you are expected to:
1. **Identify the Endpoints Identifier:** Derive the endpoint identifier that aligns with the functionality implied by the user query.
2. **Determine Endpoint Functionality:** Interpret the user query to understand and describe the functionality of the endpoint.
3. **Describe the Endpoint:** Provide a detailed description of the endpoint based on the needs and context presented in the user query.
4. **Specify the API Name:** Identify and state the name of the API to which this endpoint belongs, as suggested by the user query.
Your response should clearly articulate these four elements (identifier, functionality, description, API name) in a manner that reflects an accurate understanding of the user query. Consider the query as a real-world scenario or problem that the endpoint is designed to address.
Examples for practice:
###Input:
{generated instruction}
###Output:
Functionality: {functionality}
Description: {description}
Endpoint: {endpoint}
API: {api_name}
###Input:
{generated instruction}
###Output:
Functionality: {functionality}
Description: {description}
Endpoint: {endpoint}
API: {api_name}
###Input:
{generated instruction}
###Output:
Functionality: {functionality}
Description: {description}
Endpoint: {endpoint}
API: {api_name}
The goal is to showcase your ability to connect a users needs with the appropriate API endpoint, demonstrating an understanding of how the endpoints features align with user requirements. Your response should be precise, insightful, and reflective of the querys implications. Here is the actual task for you:
###Input:
{generated instruction}
###Output:
Functionality: {functionality}
Description: {description}
Endpoint: {endpoint}
API: {api_name}
Listing 4: Prompt for instruction scoring
**Your Task**: Evaluate the provided instruction from an AI assistant and classify its quality as **Good** or **Bad** based on specific criteria.
**Criteria for Bad Instruction**:
1. Contains multiple instructions instead of a single, clear directive.
2. Includes unnecessary additional text before or after the main instruction.
3. Fails to accurately use the specified API name and endpoint.
**Input Structure**:
You will receive an **INPUT** consisting of three elements:
1. An **instruction** generated by the AI assistant.
2. The **API name** related to the instruction.
3. The **API endpoint** relevant to the instruction.
**Output**:
Classify the instruction as **Good** or **Bad** with a concise justification.
**Examples**:
1. **Input**:
- **Instruction**: "Create a new message in IBM Event Streams using the REST Producer API." "How do I format and send a message body via the IBM Event Streams REST Producer API?" "Can you help me construct a message using the IBM Event Streams REST Producer?" "Use the IBM Event Streams REST Producer API to send a message with specific data." "Whats the proper syntax for creating and sending a message through the IBM Event Streams REST Producer?"
- **API Name**: IBM Event Streams REST Producer
- **API Endpoint**: produceMessage
- **Output**: Bad. This instruction is classified as **Bad** because it combines four separate instructions into one, each asking for different guidance related to the IBM Event Streams REST Producer API.
2. **Input**:
- **Instruction**: "Id like to send a new message to IBM Event Streams. How do I format the request body to ensure it is correctly processed using the IBM Event Streams REST Producer API?"
- **API Name**: IBM Event Streams REST Producer
- **API Endpoint**: produceMessage
- **Output**: Good. This instruction is classified as **Good** because it provides a single, clear directive without additional text and correctly uses the API name.
3. **Input**:
- **Instruction**: "Heres a possible user query utilizing the given API: Help me list all the bare metal servers in my Virtual Private Cloud account using the Virtual Private Cloud API.’"
- **API Name**: Virtual Private Cloud API
- **API Endpoint**: list_bare_metal_servers
- **Output**: Bad. This instruction is classified as **Bad** because it contains unnecessary introductory text ("Heres a possible user query utilizing the given API:") before the actual instruction.
4. **Input**:
- **Instruction**: "How do I retrieve a list of all bare metal servers in my region using the Virtual Private Cloud API?"
- **API Name**: Virtual Private Cloud API
- **API Endpoint**: list_bare_metal_servers
- **Output**: Good. This instruction is classified as **Good** because it is a singular, straightforward instruction without extra text and appropriately uses the API name.
**Your Current Task**:
**Input**:
- **Instruction**: {candidate}
- **API Name**: {api_name}
- **API Endpoint**: {endpoint}
**Output**:
Listing 5: Prompt for few-shot fine-tuning and evaluation
**api_description**:{api_description}
**lang**:{programming language}
Given the following examples:
**instruction**
{instruction}
**output**
{api_call}
**instruction**
{instruction}
**output**
{api_call}
**instruction**
{instruction}
**output**
{api_call}
Your actual task:
**instruction**
{instruction_test}
**output**
### ASSISTANT: