License: arXiv.org perpetual non-exclusive license
arXiv:2311.09206v3 [cs.CL] 04 Apr 2024

TableLlama: Towards Generalist Large Language Models for Tables

Antiquus S. Hippocampus, Natalia Cerebro & Amelie P. Amygdale
Department of Computer Science
Cranberry-Lemon University
Pittsburgh, PA 15213, USA
{hippo,brain,jen}@cs.cranberry-lemon.edu
&Ji Q. Ren & Yevgeny LeNet
Department of Computational Neuroscience
University of the Witwatersrand
Joburg, South Africa
{robot,net}@wits.ac.za
Coauthor
Affiliation
Address
email
Use footnote for providing further information about author (webpage, alternative address)—not for acknowledging funding agencies. Funding acknowledgements go at the end of the paper.
Abstract

Semi-structured tables are ubiquitous. There has been a variety of tasks that aim to automatically interpret, augment, and query tables. Current methods often require pretraining on tables or special model architecture design, are restricted to specific table types, or have simplifying assumptions about tables and tasks. This paper makes the first step towards developing open-source large language models (LLMs) as generalists for a diversity of table-based tasks. Towards that end, we construct TableInstruct, a new dataset with a variety of realistic tables and tasks, for instruction tuning and evaluating LLMs. We further develop the first open-source generalist model for tables, TableLlama, by fine-tuning Llama 2 (7B) with LongLoRA to address the long context challenge. We experiment under both in-domain setting and out-of-domain setting. On 7 out of 8 in-domain tasks, TableLlama achieves comparable or better performance than the SOTA for each task, despite the latter often has task-specific[Huan: can we say ‘task-specific’ as it’d be better] design. On 6 out-of-domain datasets, it achieves 5-44 absolute point gains compared with the base model, showing that training on TableInstruct enhances the model’s generalizability. We open source our dataset and trained model to boost future work on developing open generalist models for tables.111Code, model and data are available at: https://osu-nlp-group.github.io/TableLlama/.

1 Introduction

Refer to caption
Figure 1: An overview of TableInstruct and TableLlama. TableInstruct includes a wide variety of realistic tables and tasks with instructions. We make the first step towards developing open-source generalist models for tables with TableInstruct and TableLlama. The image of TableLlama is adapted based on DALL-E 3’s generation.[Huan: i think you could use the task name as the name for subfigure (b-d), and give a short description of what the task is under the table and then for each colored block, you change the name to sth like ’task instruction and expected output’? Also, font in the top subfigure should be changed; currently it’s hard to read. do we need so many colors? I feel it’s disturbing.]

Semi-structured tables are prevalent data structures to store and present information in almost every domain, ranging from scientific research, business reports, and healthcare records to financial statements.[Huan: while it is fine to mention these domains, a question: tables in these domains usually have a lot of numbers, but in our datasets, we don’t specifically deal with numbers, right? or some tasks actually have a lot of numbers?]I feel it’s ok. We have one hierarchical qa task that contains a lot of numbers. A variety of table-based tasks have been proposed, such as entity linking (ritze2015matching), column type annotation (hulsebos2019sherlock), schema augmentation (zhang2017entitables), and table-based question answering (cheng-etal-2022-hitab; Nan2021FeTaQAFT; chen2020hybridqa), which have spurred significant research interest (deng2020turl; yin-etal-2020-tabert; wang2021tuta; iida-etal-2021-tabbie) in recent years.

Building models for tables is not new. However, none of them meets all the requirements for a generalist model listed above. Most existing methods for table-based tasks have at least one of the following limitations: (1) Require table pretraining (liu2022tapex; yin-etal-2020-tabert; deng2020turl; iida-etal-2021-tabbie) and/or special model architecture design for tables (deng2020turl; wang2021tuta; iida-etal-2021-tabbie), (2) only support limited, specific types of tables and tasks (Chen2020TabFact:; Nan2021FeTaQAFT), (3) make strong simplifying assumptions (See the “in-domain” part of Section 2.1) about tables and tasks (li2023tablegpt).

On the other hand, language models like T5 (raffel2020exploring) have been shown to excel in grounding language to structured knowledge (xie2022unifiedskg). In addition, instruction tuning (chung2022scaling; wang-etal-2022-super; mishra-etal-2022-cross) appears as an important technique that can guide LLMs to follow instructions to complete a variety of tasks.

[Huan: it’s too early to directly throw out this question.. in addtion to the above comment, you should also mention the recent trend showing the promise of using LLMs for multiple tasks via instruction tuning. therefore, you explore the following question. Currently there is too much a gap between your first sentence and this one.] Under this background, we seek to answer the following question: Can we build a generalist model to handle a variety of table-based tasks using LLMs and instruction tuning? Some exemplar tasks are shown in Figure 1.A generalist model for tables could ? make the tables more easily to be manipulated and annotated, thus can serve as a user-friendly tool to largely reduce manual labor ().[Huan: Are you trying to use the last sentence to motivate why building a generalist model vs fine-tuning a specific model for each task? if so, it’s not convincing at all. why is it better than the latter in terms of ”…more easily to be manipulated and annotated” and ”as a user-friendly tool to largely reduce manual labor”?] Such a generalist model shall meet the following desiderata: First, it should not only work well on diverse table-based tasks, but also generalize to unseen tasks. Since new table data and tasks can be constructed dynamically as new information arrives, it is hard to collect training data that covers all tasks and all tables, which requires a model to be inherently generalizable to tasks[Huan: in our OOD setting, it has two sub-settings, right? one is to test on new tasks, the other to test on old tasks but new datasets] and datasets it has never seen before. Second, it should work on real-world tables and realistic tasks, which can be large, intricate, and incomplete. The model should not make strong assumptions to only handle simplified synthetic tables and tasks, but must embrace practical challenges such as handling complex numerical reasoning on large hierarchical spreadsheets as well as a large number of candidates for classification and ranking tasks.

In pursuing this goal, we realize there lacks a comprehensive collection of realistic tables and tasks that can support the development and evaluation of generalist models. Therefore, we construct TableInstruct, by meticulously selecting representative table-based tasks from widely used datasets, unifying the format for all tasks and manually annotating instructions. TableInstruct shown in Table 1 offers the following unique features: (1) Diverse coverage of tables and tasks. TableInstruct boasts a collection of 14 datasets of 11 tasks in total, with both in-domain and out-of-domain evaluation settings. Our training data includes 8 tasks, which are curated from 1.24M tables containing 2.6M instances spanning from table interpretation, table augmentation[Huan: can you in table 1 mark what tasks are table interpretation and table augmentation? I don’t think people are very familiar with these two.], table-based QA, and table-based fact verification. We choose 8 datasets for these 8 tasks for in-domain evaluation and leave the other 6 datasets for 4 tasks for out-of-domain evaluation. The in-domain training tasks can enable the model to learn more fundamental table understanding abilities such as table interpretation and table augmentation, while we choose tasks that require more high-level reasoning abilities such as table QA and cell description to test the model’s generalization ability. This extensive range of tables and diverse tasks not only provide valuable resources for table modeling, but also foster a more comprehensive evaluation of generalist models. [Huan: should we talk a bit more about the InD and OOD setting? how many/what are InD and OOD? Is there any rationale for choosing some as InD while others as OOD?] (2) The use of real-world tables and realistic tasks. TableInstruct uses authentic real-world instead of overly simplified synthetic task data compared with existing work (li2023tablegpt). We incorporate a large number of Wikipedia tables and spreadsheets from statistical scientific reports, and collect[Huan: I still have the earlier question: did we ‘collect’? if not, then reprase this as ‘as well as tables with varied length…’] tables with varied length of contents, realistic and complex semantic types from Freebase (freebase) for column type annotation and relation extraction, and a large referent entity corpus with rich metadata from Wikidata (vrandevcic2014wikidata) for entity linking. In addition, we include complicated numerical reasoning tasks with hierarchical table structure and existing manually annotated [Huan: do you mean they ‘manually annotated’ or you did that?]table QA and fact verification tasks. By doing so, we aim to equip models with the capability to cope with realistic and complex table-based tasks.

TableInstruct requires models to accommodate long inputs[Huan: can you in Table 1 show the min/max/median of the context length?] (Table 1). We adopt LongLoRA (longlora) based on Llama 2 (7B) (touvron2023llama) as our backbone model, which has been shown efficient and effective to handle long contexts. We fine-tune it on TableInstruct and name our model TableLlama. We conducted extensive experiments and analysis [Huan: there is no analysis, right?] under both in-domain and out-of-domain settings. Our experiments show TableLlama has strong capabilities for various in-domain table understanding and augmentation tasks, and also achieves promising performance in generalizing to unseen tasks and datasets.

Refer to caption
Figure 2: Illustration of three exemplary tasks: (a) Column type annotation. This task is to annotate the selected column with the correct semantic types. (b) Row population. This task is to populate rows given table metadata and partial row entities. (c) Hierarchical table QA. For subfigures (a) and (b), we mark candidates with red color in the “task instruction” part. The candidate set size can be hundreds to thousands in TableInstruct. [Huan: 1. can we remove the pink background of the TableLlama icon? 2. maybe separate subfigure (a) out as a standalone figure as Figure 1? those task formulations can be introduced later as Figure 2. BTW, I still do not like having so many colors in Figure (a).][Huan: i think you could use the task name as the name for subfigure (b-d), and give a short description of what the task is under the table and then for each colored block, you change the name to sth like ’task instruction and expected output’? Also, font in the top subfigure should be changed; currently it’s hard to read. do we need so many colors? I feel it’s disturbing.]

In summary, our main contributions are:

  • We construct TableInstruct, a large-scale instruction tuning dataset with diverse, realistic tasks based on real-world tables. We unify their format and manually annotate instructions to guarantee quality.

  • We develop TableLlama, an open-source LLM-based generalist model fine-tuned on TableInstruct. Experiments show that compared with the SOTA on each task that often has special pre-training or model architecture design for tables, TableLlama can achieve similar or even better performance on almost all of the in-domain tasks[Huan: it is confusing here. what do you mean by this: when using the same training data for tables]. For out-of-domain tasks, compared with the base model, TableLlama can achieve 5-44 absolute point gains on 6 datasets, and compared with GPT-4, TableLlama has less gap or even better zero-shot performance on 4 out of 6 datasets, which demonstrate that TableInstruct can substantially enhance model generalizability. [Huan: I think the last part isn’t very convincing; why comparing with pre-trained model? I think you can admit for OOD, there is still some gap with SoTAs, which is understandable, and then stress on the improvement over the pre-trained/un-fine-tuned model]

[Huan: In this paragraph, you should talk about your effort to unify the task format and prepare instructions (briefly), followed by what models you trained to do instruction tuning, as well as how you solved the long context issue.] [Huan: Give a summary of the results in this paragraph and maybe also talk about insights for future work.]

2 TableInstruct Benchmark

Task Category Task Name    Dataset In- domain #Train #Test Input Token Length
(Table/Sample) (Table/Sample) min max median
Table Interpretation Col Type Annot. TURL (deng2020turl) Yes 397K/628K 1K/2K 106 8192 2613
Relation Extract. Yes 53K/63K 1K/2K 2602 8192 3219
Entity Linking Yes 193K/1264K 1K/2K 299 8192 4667
Table Augmentation Schema Aug. TURL (deng2020turl) Yes 288K/288K 4K/4K 160 1188 215
Row Pop. Yes 286K/286K 0.3K/0.3K 264 8192 1508
Question Answering Hierarchical Table QA HiTab (cheng-etal-2022-hitab) Yes 3K/7K 1K/1K 206 5616 978
Highlighted Cells QA FeTaQA (Nan2021FeTaQAFT) Yes 7K/7K 2K/2K 261 5923 740
Hybrid Table QA HybridQA (chen2020hybridqa) No 3K/3K 248 2497 675
Table QA WikiSQL (wikisql) No 5K/16K 198 2091 575
Table QA WikiTQ (wikitq) No 0.4K/4K 263 2688 709
Fact Verification Fact Verification TabFact (Chen2020TabFact:) Yes 16K/92K 2K/12K 253 4975 630
FEVEROUS (feverous) No 4K/7K 247 8192 648
Dialogue
Generation
Table Grounded
Dialogue Generation
KVRET (kvret) No 0.3K/0.8K 187 1103 527
Data-to-Text
Highlighted
Cells Description
ToTTo (parikh-etal-2020-totto) No 7K/8K 152 8192 246
Table 1: Statistics of train/test tasks and datasets in our TableInstruct. For each task, we explain its definition and show an example in Appendix LABEL:sec:prompt_format. TODO: add context length [Huan: why is there a ‘-’ for ‘Table’? does it mean only 1 table? if so, just put 1? is ‘conversational’ the full name?] [Huan: For each task, we explain its definition and show an example in Appendix XX.]

Unlike existing datasets predominantly designed for training task-specific table models, our objective is to bridge the gap between multiple complex task-specific models and one simple generalist model that can deal with all the table-based tasks without extra model-design efforts. To achieve this, our approach for constructing TableInstruct adheres to the following principles. First, instead of collecting multiple datasets from highly homogeneous tasks, we try to diversify the tasks and table types. We pick representative table-based tasks that necessitate different abilities of models, such as table interpretation, table augmentation, table QA and table fact verification from Wikipedia tables and spreadsheets in statistical scientific reports. Second, we select realistic tasks and construct high-quality instruction data in a unified fashion without simplifying assumptions (see “in-domain” part of 2.1)I think ”simplifying assumptions” this term occurs many times so far. We should clarify this concept at the first time. TableInstruct will support powerful modeling and realistic evaluation approaches, ensuring a valuable and practical dataset for research.

2.1 Data Collection

TableInstruct incorporates samples from 14 table-based datasets of 11 distinctive tasks (Table 1). We separate them and select 8 datasets of 8 tasks for training and in-domain evaluation. We leave the other 6 datasets of 4 tasks as held-out unseen datasets for out-of-domain evaluation.

Task category: Tasks in TableInstruct can be categorized into several groups: table interpretation, table augmentation, question answering, fact verification, dialogue generation, and data-to-text. Table interpretation aims to uncover the semantic attributes of the data contained in relational tables, and transform this information into machine understandable knowledge. Table augmentation is to expand the partial tables with additional data. Question answering aims to obtain the answer with tables and optional highlighted cells or passages as evidence. Fact verification is to discriminate whether the tables can support or refute the claims. Dialogue generation is to generate a response grounded on the table and dialogue history. Data-to-text is to generate a description based on the highlighted cells.We choose table interpretation, table augmentation, some representative question answering tasks and one dataset of fact verification as in-domain datasets for training the model. We hold out the rest of the datasets spanning from question answering, fact verification, dialogue generation to data-to-text as out-of-domain datasets. By choosing the tasks that require models to learn more fundamental table understanding abilities such as table interpretation and table augmentation for training, we hope the model can demonstrate generalization ability on out-of-domain datasets such as high-level table QA and table cell description tasks. [Huan: maybe here you briefly introduce the task category like what table interpretation in general does, and then introduce your rationale why some are used for in-domain evaluation and some are for ood.]

In-domain: The tasks for training the generalist table model include column type annotation, relation extraction, entity linking, row population, schema augmentation, hierarchical table QA, highlighted cells QA, and table fact verification.The tasks for training the generalist table model include column type annotation (deng2020turl), relation extraction (deng2020turl), entity linking (deng2020turl), row population (deng2020turl), schema augmentation (deng2020turl), hierarchical table QA (cheng-etal-2022-hitab), highlighted cells QA (Nan2021FeTaQAFT), and table fact verification (Chen2020TabFact:). These tasks require the model to understand the semantics of table columns, the relation between table column pairs, the semantics of table cells and require the model to gain reasoning ability to answer table-related questions and verify the facts. For the dataset of each task, we intentionally pick up those that enjoy realistic task complexity without simplifying assumptions. For example, for column type annotation and relation extraction, these two tasks are multi-choice classification tasks in essence. We use real-world column semantic types and relation types from Freebase (freebase), which contains hundreds of complex choices such as “government.politician.partygovernment.political_party_tenure.party” shown in Figure LABEL:fig:rel_extraction in Appendix LABEL:sec:prompt_format. For entity linking, the referent entities are from real-world Wikidata (vrandevcic2014wikidata), which contains hundreds of complex metadata, such as “¡2011-12 Melbourne Victory season [DESCRIPTION] Association football club 2011/12 season for Melbourne Victory [TYPE] SoccerClubSeason¿” as shown in Figure LABEL:fig:ent_link in Appendix LABEL:sec:prompt_format. For schema augmentation and row population, there are a huge number of candidates that LLMs need to rank. For hierarchical table QA, all the tables are engaged with intricate structures with multi-level column names and row names. In addition, it is intensive in numerical reasoning which requires LLMs to understand table structure, identify related cells and do calculations. By doing so, we hope to enable LLMs to become truly powerful generalist models that can handle sophisticated table tasks and TableInstruct can be a realistic benchmark to evaluate LLMs’ abilities compared with specially designed table models.

Out-of-domain: A powerful generalist table model is expected to not only demonstrate strong performance on in-domain tasks, but also generalize well to unseen tasks or unseen datasets of the same tasks. The underlying table understanding ability learned by the model should be able to transfer to unseen tasks or datasets. We choose tasks such as table QA and cell description that require the model’s high-level table understanding and reasoning ability as out-of-domain datasets. We involve HybridQA (chen2020hybridqa), KVRET (kvret), FEVEROUS (feverous), ToTTo (parikh-etal-2020-totto), WikiSQL (wikisql) and WikiTQ (wikitq) as 6 out-of-domain datasets to test our model’s generalization ability. HybridQA is a table and passages grounded question answering task. KVRET is a response generation task grounded on table and dialogue history. ToTTo is to generate text descriptions based on highlighted table cells. FEVEROUS is a table fact verification task. WikiSQL and WikiTQ are two table QA tasks[Huan: you don’t have such a description for each task in the previous paragraph. Maybe as said in the caption of Table 1” leave such a brief description in appendix.]. By evaluating our model on these datasets, we hope to demonstrate our model’s generalization ability. [Huan: provide the rationale why some are in-domain while others are OOD. polish the language you said on Teams]

2.2 Task Formulation and Challenges

[Huan: I think you should first talk about Section 3.2 (what these tasks are) and then talk about Section 3.1 (how to formulate them into a unified format). Do point readers to some examples, whether in the main content or in the appendix. Otherwise, it’s very boring to read.]

[Huan: this paragraph could be significantly polished. you could refer to Figure LABEL:fig:examplars b-d in the beginning of this paragraph when talking about the prompt format. and then stress on the challenges in our task formulation, e.g., number of candidates and context length.] The primary objective of TableInstruct is to design one generalist model for all table-based tasks. As Figure 2 (a)-(c) shows, each instance in our dataset maps three components: ¡instruction, table input, question¿ to an output. The instruction is manually designed to point out the task and give a detailed task description. We concatenate table metadata such as the Wikipedia page title, section title and table caption with the serialized table as table input. In the question, we put all the information the model needed to complete the task and prompt the model to generate an answer. For example, for the column type annotation task, as Figure 2 (a) shows, the column named “Player” needs to be annotated with its semantic types. In the format, the “instruction” gives the description of the task. The “input” contains the table-related information. Then we provide the entire candidate pool in the “question” and ask the model to choose one or multiple correct semantic types for this column.

Challenges. Since we select realistic tasks and tables, the table length can vary from several to thousands of rows. Besides, for some tasks that are essentially multi-choice classification or ranking, the entire candidate pool can be very large up to thousands. Furthermore, as the candidates are from real-world Freebase (freebase)and Wikidata(vrandevcic2014wikidata), each candidate is long, such as “¡2011-12 Melbourne Victory season [DESCRIPTION] Association football club 2011/12 season for Melbourne Victory [TYPE] SoccerClubSeason¿” is one candidate for entity linking. These characteristics can not only make it difficult for the model to learn, but also introduce the challenge of handling long contexts.

[Huan: maybe allocate one specific paragraph to talk about challenges of our task formulation. for example, move some stuff in the above paragraph and the first few sentences in the ‘model selection’ paraphraph in Section 3 to here.]

3 Experiments

[Huan: do we still want to put the results of models from earlier experiments where you use the tricks to deal with long contexts? If so, you should think about how to describe them in the main content without diving into too many details. I think you might just include Llama 2’s results there, i.e, there are llama 2 (with short pruned candiates using your tricks) vs Llama 2 (with long context using LongLoRa).]

We demonstrate the capability of completing table tasks on five LLMs: Alpaca, Llama-2, Llama-2-Chat, Vicuna-1.5, and Tulu. Alpaca is an instruction-tuned model that originates from Llama-1. Llama2 is a more powerful pretrained model, which has demonstrated better performance than Llama-1. Llama-2-Chat is a fine-tuned version of Llama-2 that is optimized for dialogue use cases. It’s trained by alignment techniques such as instruction tuning and RLHF. Vicuna-1.5 is the latest version by fine-tuning Llama 2 on user-shared conversations collected from ShareGPT. Tulu is a Llama model finetuned on a mixture of instruction datasets such as FLAN V2, CoT, Dolly, Open Assistant 1, GPT4-Alpaca, Code-Alpaca, and ShareGPT. These are LLaMA families with only the decoder architecture. Due to the computing resource limit, we only evaluate the performance of the 7B model for the LLaMA family and leave the larger size model performance for future work. We also largely reduce the training examples, which are 100 training examples and 1000 training examples for each task to demonstrate LLMs’ ability.

Evaluation Metrics. We follow existing work to use their metrics. For multi-label classification tasks (column type annotation and relation extraction), we use precision, recall and F1. For entity linking which there is only one ground truth and fact verification, we use accuracy. For ranking tasks (row population and schema augmentation), we use MAP (mean average precision). For free-form QA, we use scareBLEU, Rough-1, Rough-2, Rough-L and METEOR. For hierarchical table QA, we use execution accuracy.

Training and Inference Details. For the dynamic table segmentation, the reserved instruction length for free-form QA is 50; for entity linking is 500; for other datasets is 100. The offset for all tasks is 200. The reserved table metadata length for all tasks is 20. We train LLaMA-family models for 1 epoch when using the entire 968k training data. We train 3 epochs when using 100 and 1000 training examples. The candidate size in the instructions for column type annotation, relation extraction and entity linking is 10, The subset size for row population is 20. The max generation length for row population is 512; for free-form QA and schema augmentation is 128; for others is 64.

4 Results and Analysis.

Table 2: Test set performance of eight tasks under five LLMs. All LLMs’ size is 7B. ‘100’ and ‘1000’ refer to training size of each task. ‘full’ refers to the entire training set in our constructed data. The percentage marked as blue refers to the training size ratio compared with the training size used to finetune the SOTA models. The precentage marked as red refers to the performance ratio compared with SOTA. All models are trained in a unified fashion, which we merge the data from all tasks to train the model. The SOTA performance of column type annotation is we reproduced using the SOTA model on our test set. (Remember to cite sota paper)
Alpaca Llama-2 Vicuna-1.5 Tulu Llama-2-Chat SOTA
100 1000 full full
Column Type Annotation Precision 24.30 67.75 (0.08%x) (71%) 88.74 89.26 88.84 89.48 (5%x)(94%) 95.15
Recall 30.16 64.94 (0.08%x)(69%) 87.40 (5%x) (92%) 86.75 87.13 86.60 94.54
F1 26.91 66.31(0.08%x)(71%) 88.07 (5%x) (94%) 87.99 87.98 88.02 93.94
Relation Extraction Precision 32.28 59.39 (0.2%x)(63%) 91.53 93.07 (40%x)(98%) 92.43 92.45 94.57
Recall 5.54 43.08 (0.2%x)(45%) 73.19 (40%x) (77%) 73.00 72.14 72.28 95.25
F1 9.46 49.94 (0.2%x)(53%) 81.34 81.83 (40%x) (86%) 81.03 81.13 94.91
Entity Linking EM 16.69 62.57 71.91 69.57 71.41 67.91
Relax EM? 23.34 69.73 81.77 82.2 80.41 79.94
Schema Augmentation MAP 42.06 48.84 (0.3%x)(63%) 68.10 (10%x)(88%) 67.74 67.39 67.89 77.55
Row Population MAP 18.55 24.79 25.99 29.09 53.78 45.32
Hierarchical Table QA Exec Acc 30.43 (68%) 36.74 (14%x)(81%) 34.47 39.58 (100%x)(88%) 37.63 34.91 45.1
Free-form QA sacreBLEU 30.90 (92%) 35.77 (14%x)(107%) 36.62 36.79 36.66 36.93 (100%x) (110%) 33.44
Rough-1 64.51 (99%) 69.10 (14%x)(106%) 71.11 71.13 71.32 (100%x)(109%) 71.27 65.21
Rough-2 41.51 (96%) 45.87 (14%x)(106%) 48.43 48.56 48.73 (100%x)(113%) 48.61 43.09
Rough-L 53.22 (96%) 57.16 (14%x)(103%) 59.82 59.71 59.97 60.20 (100%x)(109%) 55.31
METEOR 58.64 (114%) 63.06 (14%x)(123%) 63.55 63.77 63.84 (100%x)(125%) 63.79 51.23
Fact Verification Accuracy 50.00 51.58 (1%x)(61%) 66.93 64.40 64.51 68.35 (20%x)(81%) 84.2

4.1 Overall Performance.

Table2 shows the five instruction-following LLMs’ performance on our benchmark. Compared the full training size performance among five LLMs, we find that: (1) There is not a best model across all the tasks. (2) The performances are pretty similar across different LMs for most tasks except for row population. As those LLMs are either no instruction tuned or instruction tuned on different datasets, we hypothesize that those different datasets help models gain different ability that related to our different tasks, so that in our case, different models are in favor more of different tasks. However, the similar performances indicate those learned ability doesn’t significantly matter for most tasks, since Llama-2 is only pretrained without any instruction tuning, but other LLMs’ performances are not significantly better than Llama-2. (3) With only no more than 1.4% training data, LLMs can achieve more than 60% SOTA performance on most tasks; with 5%-40% training data, LLMs can achieve more than 80% SOTA performance, some can even be better than SOTA.

4.2 Analysis of different tasks.

Column Type Annotation. We investigated different factors that affect LLMs’ performance on large category classification tasks. We investigate how the Pos and Neg ratio, instruction order, and adding other tasks affect the column annotation task. We do experiment on four settings: (1) Pos:Neg = 1:3, first instruction then table input (2) Pos:Neg = 1:10, first table input then instruction (3) Pos:Neg = 1:3, first instruction then table input (4) Pos:Neg = 1:3, unified model, first instruction then table input.

Table 3: Column Type Annotation: controllable experiments on Alpaca.
Precision Recall F1
(1) 81.60 85.08 83.30
(2) 88.49 87.55 88.02
(3) 81.17 92.58 86.50
(4) 81.36 92.14 86.41

Comparing (1) with (2), we can see the ratio of the positive class and the negative class matters: both precision and recall improve largely by increasing the negative class training examples. Comparing (1) with (3), we can see the order of the instruction and the table matters: the precision improves largely while precision is maintained. As (liu2023lost) finds that LLMs’ performance is always highest when the relevant information occurs at the beginning or the end of the input context, our observation may be because the candidates are at the end of the instruction and the model will pay more attention to the candidates so as to help models choose more correct candidates easily. Comparing (1) and (4), we can see merging other table tasks matters, which can increase the recall largely for the column type annotation task.

Entity Linking. We observe most of the errors come from the model choosing a very similar entity name while the description is not correct. Since the description is usually highly correlated with the Wikipedia page title, we hypothesize that the model doesn’t get enough attention on the title so choose the wrong entity. This indicates for the instruction design, giving some special hints to explicitly highlight the Wikipedia title may help the model make the correct choice.

Table QA and Fact Verification. In our experiments (Figure4, we found that after unifying

Table 4: Performance of task transfer on Alpaca. ‘Unified’ means training a unified model with merged data from eight tasks.
Train Hitab Unified
Eval Hitab Exec Acc 58.71 34.47
FeTaQA sacreBLEU 23.47 36.62
Rough-1 19.70 71.11
Rough-1 9.23 48.43
Rough-L 18.14 59.82
METEOR 8.84 63.55
TabFact Accuracy 15.42 66.93

different tasks to get one unified model, the performance drops significantly for hierarchical table QA (Hitab) task compared with only instruction tune the single task. As (Gudibande et al., 2023) proposed that instruction tuning mainly improves on tasks that are heavily supported in its training dataset, we hypothesize that the performance drop is due to the very smaller training size for hierarchical table QA task, if compared with other tasks when training a unified model. In addition, we also show the performance on task transfer among Hitab, FeTaQA and TabFact. We can see if the model is only instruction tuned on Hitab, then the model performance for FeTaQA and TabFact will have a large gap compared with the unified model, which also demonstrates the necessity of constructing table-specific data from different tasks and training the generalist models for tables.

Row Population. Figure3 shows an case study for row population task. The left subfigrue shows the initial candidate list which is waiting to be ranked. The orange candidates are target candidates, but currently they scattered in the different places of the entire candidate list. The entire candidate size is around 200. The right subfigure shows the ranked list after applying our tree-rank algorithm. We can see the target candidates are ranked top in the list, which demonstrates the effectiveness of our designed algorithm.

Refer to caption
Figure 3: Sample figure caption.

5 Related Work

Table Representation Learning. Given the vast amount of knowledge stored in tables, various table-based tasks have been proposed pujara2021tables, such as column type annotation (hulsebos2019sherlock), row population (zhang2017entitables), table QA sun2016table; wikitq; cheng-etal-2022-hitab; Nan2021FeTaQAFT, etc. In order to handle the semi-structured tables, existing work puts their efforts into designing special model architectures, [Huan: for the work you are going to mention, if they have a model name, mention it, as it could make people easily recall the work.] such as TURL with structure-aware attention (deng2020turl), TUTA with tree-based attention (wang2021tuta) and TaBERT with vertical self-attention mechanism (yin-etal-2020-tabert); or designing special encodings such as cell text encoding (yin-etal-2020-tabert; eisenschlos2021mate; wang2021tuta), table position encoding (herzig-etal-2020-tapas; wang2021tuta), and numerical encoding (wang2021tuta) to better encode the table structure and infuse more information to the neural architecture. In addition, some work focuses on table pretraining (liu2022tapex; yin-etal-2020-tabert; deng2020turl; iida-etal-2021-tabbie) to encode knowledge in large-scale tables. However, although such existing works have shown promising progress, they are still data-specific and downstream task-specific, which requires special design tailored for tables and table-based tasks.

Our work proposes TableInstruct to unify different table-based tasks and develops a one-for-all LLM TableLlama to reduce those extra efforts during modeling, and evaluate its table understanding and generalization ability under both in-domain and out-of-domain settings. This high-level insight is similar to UnifiedSKG (xie2022unifiedskg), which unifies a diverse set of structured knowledge grounding tasks into a text-to-text format. and enhance T5 model’s performance via multi-task fine-tuning. However, UnifiedSKG deals with different knowledge sources such as databases, knowledge graphs and web tables and does not explore instruction tuning, while we focus on a wide range of realistic tasks based on real-world tables via instruction tuning. In addition, a concurrent work (li2023tablegpt) synthesizes diverse table-related tasks and finetunes close-source LLMs such as GPT-3.5 via instruction tuning. [Huan: rephrase? what do you mean? unify diverse table-based tasks on closed-source LLMs such as ChatGPT,] Compared to theirs, we collect more realistic and complex task data such as HiTab as well as classification and ranking tasks with candidates from Freebase and Wikidata and develop open-source LLMs for table-based tasks. We believe both our constructed high-quality table instruction tuning dataset and the trained model can be valuable resources for facilitating this line of research.

Instruction Tuning. Instruction tuning that trains LLMs using <instruction, output> pairs in a supervised fashion is a crucial technique to enhance the capabilities and controllability of LLMs (chung2022scaling; wang-etal-2022-super; mishra-etal-2022-cross). The instructions serve to constrain the model’s outputs to align with the desired response characteristics or domain knowledge and can help LLMs rapidly adapt to a specific domain without extensive retraining or architecture designs (zhang2023instruction). Therefore, different instruction tuning datasets have been proposed to guide LLMs’ behaviors (wang-etal-2022-super; honovich2022unnatural; longpre2023flan; xu2023wizardlm; yue2024mammoth). Those datasets are collected either from formatting existing natural language processing tasks by templates (longpre2023flan) or prompting ChatGPT (xu2023wizardlm) and GPT-4 (gpt4llm) to generate instructions. Different instruction tuning models such as InstructGPT (instructgpt), Vicuna (vicuna) and Claude222https://www.anthropic.com/index/introducing-claude[Huan: can you use another way to cite Claude? it’s weird to have the first footnote towards the end of the paper. If you cannot find a paper, just create a bibtex for this url.] emerge and demonstrate boosted performance compared with the pre-trained models. In addition, instruction tuning has been applied to different modalities such as images, videos and audio (li2023blip2) and has shown promising results. This signals that instruction tuning can be a promising technique to enable large pre-trained models to handle various tasks. However, how to utilize instruction tuning to guide LLMs to complete tables-based tasks is still under-explored. Our work fills this gap by constructing a high-quality table instruction tuning dataset: TableInstruct, which covers large-scale diverse and realistic tables and tasks to enable both modeling and evaluation. We also release TableLlama, an open-source LLM-based generalist model fine-tuned on TableInstruct to promote this avenue of research.

6 Conclusion

This paper makes the first step towards developing open-source large generalist models for a diversity of table-based tasks. Towards that end, we construct TableInstruct and develop the first open-source generalist model for tables, TableLlama., a comprehensive dataset for instructing tuning and evaluating LLMs for tables and develop the first open-source generalist model for tables, TableLlama, by fine-tuning Llama 2 (7B) with LongLoRA to address the context length challenge. We evaluate both in-domain and out-of-domain settings and the experiments show that TableLlama has gained strong table understanding ability and generalization ability. On 7 out of 8 in-domain tasks, our generalist model TableLlama achieves comparable or better performance than the existing SOTA method for each task, despite the latter often has table-specific model design or pre-training. On 6 out-of-domain datasets, it achieves 6-48 absolute point gains compared with the base model, showing that training on our TableInstruct enhances generalizability.