On the Potential of Lexico-logical Alignments
for Semantic Parsing to SQL Queries
Abstract
Large-scale semantic parsing datasets annotated with logical forms have enabled major advances in supervised approaches. But can richer supervision help even more? To explore the utility of fine-grained, lexical-level supervision, we introduce Squall, a dataset that enriches WikiTableQuestions English-language questions with manually created sql equivalents plus alignments between sql and question fragments. Our annotation enables new training possibilities for encoder-decoder models, including approaches from machine translation previously precluded by the absence of alignments. We propose and test two methods: (1) supervised attention; (2) adopting an auxiliary objective of disambiguating references in the input queries to table columns. In -fold cross validation, these strategies improve over strong baselines by execution accuracy. Oracle experiments suggest that annotated alignments can support further accuracy gains of up to .
1 Introduction
The availability of large-scale datasets pairing natural utterances with logical forms (dahl1994expanding; wang+15i; zhong2017seq2sql; yu2018spider, inter alia) has enabled significant progress on supervised approaches to semantic parsing (jia2016recombination; xiao+16; dong2016language; dong-lapata-2018-coarse, inter alia). However, the provision of logical forms alone does not indicate important fine-grained relationships between individual words or phrases and logical form tokens. This is unfortunate because researchers have in fact hypothesized that the lack of such alignment information hampers progress in semantic parsing (zhang+19, pg. 80).
We address this lack by introducing Squall,111Squall =“SQL+QUestion pairs ALigned Lexically”. the first large-scale semantic-parsing dataset with manual lexical-to-logical alignments; and we investigate the potential accuracy boosts achievable from such alignments. The starting point for Squall is WikiTableQuestions (WTQ; pasupat2015compositional), containing data tables, English questions regarding the tables, and table-based answers. We manually enrich the -instance subset of WTQ’s training data that is translatable to sql by providing expert annotations, consisting not only of target logical forms in sql, but also labeled alignments between the input question tokens (e.g., “how many”) and their corresponding sql fragments (e.g., COUNT()). Figure 1 shows two Squall instances.
These new data enable training of encoder-decoder neural models that incorporates manual alignments. Consider the bottom example in Figure 1: A decoder can benefit from knowing that ORDER BY LIMIT 1 comes from “the highest” (where rank 1 is best); and an encoder should match “who” with the “athlete” column even though the two strings have no overlapping tokens. We implement these ideas with two training strategies:
-
1.
Supervised attention that guides models to produce attention weights mimicking human judgments during both encoding and decoding. Supervised attention has improved both alignment and translation quality in machine translation (liu+16; mi+16), but has only been applied in semantic parsing to heuristically generated alignments (rabinovich17) due to the lack of manual annotations.
-
2.
Column prediction that infers which column in the data table a question fragment refers to.
Using BERT features, our models reach execution accuracy on the WTQ test set, surpassing the previous weakly-supervised state-of-the-art (where weak supervision means access to only the answer, not the logical form of the question). More germane to the issue of alignment utility, in -fold cross validation, our additional fine-grained supervision improves execution accuracy by over models supervised with only logical forms; ablation studies indicate that mappings between question tokens and columns help the most. Additionally, we construct oracle models that have access to the full alignments during test time to show the unrealized potential for our data, seeing improvements of up to absolute logical form accuracy.
Through annotation-cost and learning-curve analysis, we conclude that lexical alignments are cost-effective for training parsers: lexical alignments take less than half the time to annotate as a logical form does, and we can improve execution accuracy by percentage points by aligning merely of the logical forms in the training set.
Our contributions are threefold: 1) we release a high-quality semantic parsing dataset with manually-annotated logical forms; 2) we label the alignments between the English questions and the corresponding logical forms to provide additional supervision; 3) we propose two training strategies that use our alignments to improve strong base models. Our dataset and code are publicly available at https://www.github.com/tzshi/squall.
2 Task: Table-based Semantic Parsing
Our task is to answer questions about structured tables through semantic parsing to logical forms (LFs). Formally, the input consists of a question about a table , and the goal of a semantic parser is to reproduce the target LF for (and thus have high LF accuracy) or, in a less strict setting, to generate any query LF that, when executed against , yields the correct output (and thus have high execution accuracy).
In a weakly supervised setting, training examples consist only of input-answer pairs . Recent datasets (zhong2017seq2sql; yu2018spider, inter alia) provide enough logical forms, i.e., training pairs, to learn from mappings from to in a supervised setting. Unsurprisingly, supervised models are more accurate than weakly supervised ones. However, training supervised models is still challenging: both and are structured, so models typically generate in multiple steps, but the training data cannot reveal which parts of generate which parts of and how they are combined.
Just as adding supervised training improves accuracy over weak supervision, we explore whether even finer-grained supervision further helps. Since no large-scale datasets furnishing fine-grained supervision exist (to the best of our knowledge), we introduce Squall.
3 Squall: Our New Dataset
Squall is based on WikiTableQuestions (WTQ; pasupat2015compositional). WTQ is a large-scale question-answering dataset that contains diverse and challenging crowd-sourced question-answer pairs over semi-structured Wikipedia tables. Most of the questions are more than simple table-cell look-ups and are highly compositional, a fact that motivated us to study lexical mappings between questions and logical forms. We hand-generate sql equivalents of the WTQ queries and align question tokens with corresponding sql query fragments.222 sql is a widely adopted formalism. Other formalisms including LambdaDCS (pasupat2015compositional), have been used on WTQ. sql and LambdaDCS can express roughly the same percentage of queries: (our finding) vs. (analysis of a -question sample by pasupat-liang2016). We leave automatic conversion to and from sql to other formalisms and vice versa to future work. We leave lexical alignments of other text-to-sql datasets and cross-dataset model generalization (alane2020explor) to future work.
3.1 Data Annotation
We annotated WTQ’s training fold in three stages: database construction, sql query annotation, and alignment. Two expert annotators familiar with sql annotated half of the dataset each and then checked each other’s annotations and resolved all conflicts via discussion. See LABEL:sec:app-anno for the annotation guidelines.
Database Construction
Tables encode semi-structured information. Each table column usually contains data of the same type: e.g., text, numbers, dates, etc., as is typical in relational databases. While pre-processing the WTQ tables, we considered both basic data types (e.g., raw text, numbers) and composite types (e.g., lists, binary tuples), and we suffixed column names with their inferred data types (e.g., _number in Figure 1). For annotation consistency, all tables were assigned the same name w and columns were given the sequential names c1, c2,…in the database schema, but we kept the original table headers for feature extraction. We additionally added a special column id to every table denoting the linear order of its rows. See LABEL:sec:app-database for details.
Conversion of Queries to sql
For every question in WTQ’s training fold, we manually created its corresponding sql query, choosing the shortest when there are multiple possibilities, for instance, we wrote “SELECT MAX(c1) FROM w” instead of “SELECT c1 FROM w ORDER BY c1 DESC LIMIT 1”. An exception is that we opted for less table structure-dependent versions even if their complexity was higher. As an example, if the table listed games (c2) pre-sorted by date (c1), and the question was “what is the next game after A?”, we wrote “SELECT c2 FROM w WHERE c1 > (SELECT c1 FROM w WHERE c2 = A) ORDER BY c1 LIMIT 1” instead of “SELECT c2 FROM w WHERE id = (SELECT id FROM w WHERE c2 = A) + 1”. Out of questions spanning tables, Squall provided sql queries for questions, or . The remaining consisted of questions with non-deterministic answers (e.g., “show me an example of …”), questions requiring additional pre-processing (e.g., looking up a date inside a text-based details column), and cases where sql queries would be insufficiently expressive (e.g., “what team has the most consecutive wins?”).
Alignment Annotation
how long | MAX() | |
---|---|---|
Frequently aligned to | col | the last |
MAX(col)-MIN(col) | the most | |
col-col | the largest | |
COUNT(*) | the highest | |
COUNT(col) | the first |
Given a tokenized question/LF pair, the annotators selected and aligned corresponding fragments from the two sides. The selected tokens did note need to be contiguous, but they had to be units that decompose no further. For the example in Figure 1, there were three alignment pairs, where the non-contiguous “ORDER BY LIMIT 1” was treated as an atomic unit and aligned to “the highest” in the input. Additionally, not all tokens on either side needed to be aligned. For instance, sql keywords SELECT, FROM and question tokens “what”, “is”, etc. were mostly unaligned. Table 1 shows that the same question phrase was aligned to a range of sql expressions, and vice versa. Overall, of question tokens were aligned. Comparative and superlative question tokens were the most frequently aligned, while many function words were unaligned; see LABEL:sec:app-pos-tag for part-of-speech distributions of the aligned and unaligned tokens. Except for the four keywords in the basic structure “SELECT FROM w WHERE ”, of sql keywords were aligned. The rest of the unaligned sql tokens include d= (alignment ratio of ), AND () and column names (). The first two cases arose because equality checks and conjunctions of filtering conditions are often implicit in natural language.
Inter-Annotator Agreement and Annotation Cost
The two annotators’ initial sql annotation agreement in a pilot trial333 In the pilot study, the annotators independently labeled questions over the same tables. We report the percentage of cases where one annotator accepted the other annotator’s labels. was and after discussion, they agreed on of data instances; similarly, alignment agreement rose from to . With respect to annotation speed, an average sql query took seconds to produce and an additional seconds to enrich with alignments: the cost of annotating instances with alignment enrichment was comparable to that of instances with only logical forms.
3.2 Post-processing
Literal values in the sql queries such as “25,000” in Figure 1 and “star one” in LABEL:fig:case are often directly copied from the input questions. We thus adapted WikiSQL’s zhong2017seq2sql task setting, where all literal values correspond to spans in the input questions. We used our alignment to generate gold selection spans, filtering out instances where literal values could not be reconstructed through fuzzy match from the gold spans. After post-processing, Squall contained table-question-answer triplets with logical form and lexical alignment annotations.
4 (State-of-the-Art)444 In Appendix §LABEL:sec:app-sota, we show that on Squall, our base model is competitive with a state-of-the-art system alane2020explor benchmarked on the Spider dataset yu2018spider. Base Model: Seq2seq with Attention and Copying
Recent state-of-the-art text-to-sql models extend the sequence-to-sequence (seq2seq) framework with attention and copying mechanisms (zhong2017seq2sql; dong2016language; dong-lapata-2018-coarse; alane2020explor, inter alia). We adopt this strong neural paradigm as our base model. The seq2seq model generates one output token at a time via a probability distribution conditioned on both the input sequence representations and the partially-generated output sequence: where and are the feature representations for the input and output sequences, and <i denotes a prefix. The last token of must be a special <STOP> token that terminates the output generation. The per-token probability distribution is modeled through Long-Short Term Memory networks (LSTMs, hochreiter-schmidhuber97) and multi-layer perceptrons (MLPs):
(1) | ||||
(2) |
The training objective is the negative log likelihood of the gold , defined for each timestep as
Question and Table Encoding
An input contains a length- question and a table with columns . The input question is represented through a bi-directional LSTM (bi-LSTM) encoder that summarizes information from both directions within the sequence. Inputs to the bi-LSTM are concatenations of word embeddings, character-level bi-LSTM vectors, part-of-speech embeddings, and named entity type embeddings. We denote the resulting feature vector associated with as . For column names, the representation concatenates the final hidden states of two LSTMs running in opposite directions that take the concatenated word embeddings, character encodings, and column data type embeddings as inputs. We also experiment with pre-trained BERT feature extractors (devlin+19), where we feed the BERT model with the question and the columns as a single sequence delimited by the special [SEP] token, and we take the final-layer representations of the question words and the last token of each column as their representations.
Attention in Encoding
To enhance feature interaction between the question and the table schema, for each question word representation , we use an attention mechanism to determine its relevant columns and calculate a linearly-weighted context vector as follows:
(3) | ||||
(4) |
Then we run another bi-LSTM by concatenating the question representation and context representation as inputs to derive a column-sensitive representation for each question word . We apply a similar procedure to get the column representation for each column.
Attention in Decoding
During decoding, to allow LSTMs to capture long-distance dependencies from the input, we add attention-based features to the recurrent feature definition of Eq. (1):
(5) | ||||
(6) |
sql Token Prediction with Copying Mechanism
Since each output token can be an sql keyword, a column name or a literal value, we factor the probability defined in Eq. (2) into two components: one that decides the type of :
and another that predicts the token conditioned on the type . For token type KEY, we predict the keyword token with another MLP:
For COL and STR tokens, the model selects directly from the input column names or question via a copying mechanism. We define a probability distribution with softmax-normalized bilinear scores:
Similarly, we define literal string copying from with another bilinear scoring matrix .
5 Using Alignments in Model Training
The model design in §4 includes many latent interactions within and across the encoder and the decoder. We now describe how our manual alignments can enable direct supervision on such previously latent interactions. Our alignments can be used as supervision for the necessary attention weights (§5.1). In an oracle experiment where we replace induced attention with manual alignments, the jump in logical form accuracy shows alignments are valuable, if only the models could reproduce them (§5.2). Moreover, alignments enable a column-prediction auxiliary task (§5.3).
The loss function of our full model is a linear combination of the loss terms of the seq2seq model, supervised attention, and column prediction:
where we define and below.
5.1 Supervised Attention
Our annotated lexical alignments resemble our base model’s attention mechanisms. At the encoding stage, question tokens and the relevant columns are aligned (e.g., “who” column “athlete”) which should induce higher weights in both question-to-column and column-to-question attention (Eq. (3) and Eq. (4)); similarly, for decoding, annotation reflects which question words are most relevant to the current output token. Inspired by improvements from supervised attention in machine translation (liu+16; mi+16), we train the base model’s attention mechanisms to minimize the Euclidean distance666 See LABEL:sec:app-abl-loss for experiments with other distances. between the human-annotated alignment vector and the model-generated attention vector :
The vector is a one-hot vector when the annotation aligns to a single element, or represents a uniform distribution over the subset in cases where the annotation aligns multiple elements.
5.2 Oracle Experiments with Manual Alignments
Attention type | (Dev) | |
---|---|---|
Induced attention | ||
Oracle attention | ||
Encoder only | ||
Decoder only | ||
Encoder + decoder |
To present the potential of alignment annotations for models with supervised attention, we first assume a model that can flawlessly reproduce our annotations within the base model. During training and inference, we feed the true alignment vectors in place of the attention weights to the encoder and/or decoder. Table 2 shows the resultant logical form accuracies. Access to oracle alignments provides up to absolute higher accuracy over the base model. This wide gap suggests the high potential for training models with our lexical alignments.
5.3 Column Prediction
wang-etal-2019-learning show the importance of inferring token-column correspondence in a weakly-supervised setting; Squall enables full supervision for an auxiliary task that directly predicts the corresponding column for each question token . We model this auxiliary prediction as:
For the corresponding loss over tokens that match columns, we use cross-entropy.
Exact-match Features: An Unsupervised Alternative
A heuristic-based, albeit lower-coverage, alternative to manual alignment is to use questions’ mentions of column names. Thus, we use automatically-generated exact-match features in our baseline models for comparison in our experiments. For question encoders, we include two embeddings derived from binary exact-match features: indicators of whether the token appears in (1) any of the column headers and (2) any of the table cells. Similarly, for the column encoders, we also include an exact-match feature of whether the column name appears in the question.
6 Experiments
Model | (Test) |
---|---|
Prior work (all necessarily are weakly supervised) | |
Single model | – |
Single model (w/ bert) | |
Ensemble | – |
This paper (strongly supervised for the first time) | |
Single model (align) | |
Single model (align w/ bert) | |
Ensemble (align) | |
Ensemble (align w/ bert) |
Setup
We randomly shuffle the tables in Squall and divide them into five splits. For each setting, we report the average logical form accuracy (output LF exactly matches the target LF) and execution accuracy (output LF may not match the target LF, but its execution yields the gold-standard answer) as well as the standard deviation of five models, each trained with four of the splits as its training set and the other split as its dev set. We denote the base model from §4 as seq2seq and our model trained with both proposed training strategies in §5 as align. The main baseline model we compare with, seq2seq+, is the base model enhanced with the automatically-derived exact-match features (§5.3). See Appendix LABEL:sec:app-impl for model implementation details.
WTQ Test Results
Table 3 presents the WTQ test-set of align compared with previous models. Unsurprisingly, Squall’s supervision allows our models to surpass weakly supervised models. Single models trained with BERT feature extractors exceed prior state-of-the-art by . However, our main scientific interest is not these numbers per se, but how beneficial additional lexical supervision is.
Model | Dev | Test | |
---|---|---|---|
seq2seq+ | |||
align | |||
seq2seq+ w/ bert | |||
align w/ bert |
Effect of Alignment Annotations
To examine the utility of lexical alignments as a finer-grained type of supervision, we compare align with seq2seq+ in Table 4. Both have access to logical form supervision, but align additionally uses lexical alignments during training. align improves seq2seq by with BERT and without, showing that lexical alignment annotation is more beneficial than automatically-derived exact-match column reference features.777 Test set accuracies are lower than on the dev set because the WTQ test set includes questions unanswerable by sql.
Effect of Individual Strategies
Table 5 compares model variations. We add each individual training strategy into the baseline seq2seq+ model and ablate components from the align model. Each component contributes to increased accuracies compared with seq2seq+. The effects range from with column prediction to with supervised encoder attention. Supervised encoder attention is the single most effective strategy: including it produces the highest gains and ablating it the largest drop. The exact-match column reference features are essential to the baseline model: seq2seq without those features has lower . Nonetheless, supervised encoder attention and column prediction are still effective on top of the exact-match features. Yet, align’s accuracy is still far below that of the oracle models; we hope Squall can inspire future work to take better advantage of its rich supervision.
Component | Dev | |
---|---|---|
seq2seq | ||
seq2seq+ | ||
+ Supervised decoder attn. | ||
+ Supervised encoder attn. | ||
+ Column prediction | ||
align | ||
- Supervised decoder attn. | ||
- Supervised encoder attn. | ||
- Column prediction | ||
- Exact-match features | ||
Oracle attention | – | |