On the Potential of Lexico-logical Alignments
for Semantic Parsing to SQL Queries

Tianze Shi
Cornell University
tianze@cs.cornell.edu
&Chen Zhao11footnotemark: 1
University of Maryland
chenz@cs.umd.edu
&Jordan Boyd-Graber
University of Maryland
jbg@umiacs.umd.edu
Hal Daumé III
Microsoft Research & University of Maryland
me@hal3.name
&Lillian Lee
Cornell University
llee@cs.cornell.edu

  Equal contribution; listed in alphabetical order.
Abstract

Large-scale semantic parsing datasets annotated with logical forms have enabled major advances in supervised approaches. But can richer supervision help even more? To explore the utility of fine-grained, lexical-level supervision, we introduce Squall, a dataset that enriches 11,276 WikiTableQuestions English-language questions with manually created sql equivalents plus alignments between sql and question fragments. Our annotation enables new training possibilities for encoder-decoder models, including approaches from machine translation previously precluded by the absence of alignments. We propose and test two methods: (1) supervised attention; (2) adopting an auxiliary objective of disambiguating references in the input queries to table columns. In 5-fold cross validation, these strategies improve over strong baselines by 4.4% execution accuracy. Oracle experiments suggest that annotated alignments can support further accuracy gains of up to 23.9%.

Refer to caption
Figure 1: Two examples from Squall. The table-question-answer triplets come from WikiTableQuestions. We provide the logical forms as sql plus alignments between question and logical form. In the bottom example, for instance, “the highest” ORDER BY and LIMIT 1, as indicated by both matching highlight color ( blue ) and circled-number labels (2).
Findings of ACL: EMNLP 2020

1 Introduction

The availability of large-scale datasets pairing natural utterances with logical forms (dahl1994expanding; wang+15i; zhong2017seq2sql; yu2018spider, inter alia) has enabled significant progress on supervised approaches to semantic parsing (jia2016recombination; xiao+16; dong2016language; dong-lapata-2018-coarse, inter alia). However, the provision of logical forms alone does not indicate important fine-grained relationships between individual words or phrases and logical form tokens. This is unfortunate because researchers have in fact hypothesized that the lack of such alignment information hampers progress in semantic parsing (zhang+19, pg. 80).

We address this lack by introducing Squall,111Squall =“SQL+QUestion pairs ALigned Lexically”. the first large-scale semantic-parsing dataset with manual lexical-to-logical alignments; and we investigate the potential accuracy boosts achievable from such alignments. The starting point for Squall is WikiTableQuestions (WTQ; pasupat2015compositional), containing data tables, English questions regarding the tables, and table-based answers. We manually enrich the 11,276-instance subset of WTQ’s training data that is translatable to sql by providing expert annotations, consisting not only of target logical forms in sql, but also labeled alignments between the input question tokens (e.g., “how many”) and their corresponding sql fragments (e.g., COUNT()). Figure 1 shows two Squall instances.

These new data enable training of encoder-decoder neural models that incorporates manual alignments. Consider the bottom example in Figure 1: A decoder can benefit from knowing that ORDER BY LIMIT 1 comes from “the highest” (where rank 1 is best); and an encoder should match “who” with the “athlete” column even though the two strings have no overlapping tokens. We implement these ideas with two training strategies:

  1. 1.

    Supervised attention that guides models to produce attention weights mimicking human judgments during both encoding and decoding. Supervised attention has improved both alignment and translation quality in machine translation (liu+16; mi+16), but has only been applied in semantic parsing to heuristically generated alignments (rabinovich17) due to the lack of manual annotations.

  2. 2.

    Column prediction that infers which column in the data table a question fragment refers to.

Using BERT features, our models reach 54.1% execution accuracy on the WTQ test set, surpassing the previous weakly-supervised state-of-the-art 48.8% (where weak supervision means access to only the answer, not the logical form of the question). More germane to the issue of alignment utility, in 5-fold cross validation, our additional fine-grained supervision improves execution accuracy by 4.4% over models supervised with only logical forms; ablation studies indicate that mappings between question tokens and columns help the most. Additionally, we construct oracle models that have access to the full alignments during test time to show the unrealized potential for our data, seeing improvements of up to 23.9% absolute logical form accuracy.

Through annotation-cost and learning-curve analysis, we conclude that lexical alignments are cost-effective for training parsers: lexical alignments take less than half the time to annotate as a logical form does, and we can improve execution accuracy by 2.5 percentage points by aligning merely 5% of the logical forms in the training set.

Our contributions are threefold: 1) we release a high-quality semantic parsing dataset with manually-annotated logical forms; 2) we label the alignments between the English questions and the corresponding logical forms to provide additional supervision; 3) we propose two training strategies that use our alignments to improve strong base models. Our dataset and code are publicly available at https://www.github.com/tzshi/squall.

2 Task: Table-based Semantic Parsing

Our task is to answer questions about structured tables through semantic parsing to logical forms (LFs). Formally, the input x=(q,T) consists of a question q about a table T, and the goal of a semantic parser is to reproduce the target LF y for q (and thus have high LF accuracy) or, in a less strict setting, to generate any query LF y that, when executed against T, yields the correct output z (and thus have high execution accuracy).

In a weakly supervised setting, training examples consist only of input-answer pairs (x,z). Recent datasets (zhong2017seq2sql; yu2018spider, inter alia) provide enough logical forms, i.e., (x,y) training pairs, to learn from mappings from x to y in a supervised setting. Unsurprisingly, supervised models are more accurate than weakly supervised ones. However, training supervised models is still challenging: both x and y are structured, so models typically generate y in multiple steps, but the training data cannot reveal which parts of x generate which parts of y and how they are combined.

Just as adding supervised training improves accuracy over weak supervision, we explore whether even finer-grained supervision further helps. Since no large-scale datasets furnishing fine-grained supervision exist (to the best of our knowledge), we introduce Squall.

3 Squall: Our New Dataset

Squall is based on WikiTableQuestions (WTQ; pasupat2015compositional). WTQ is a large-scale question-answering dataset that contains diverse and challenging crowd-sourced question-answer pairs over 2,108 semi-structured Wikipedia tables. Most of the questions are more than simple table-cell look-ups and are highly compositional, a fact that motivated us to study lexical mappings between questions and logical forms. We hand-generate sql equivalents of the WTQ queries and align question tokens with corresponding sql query fragments.222 sql is a widely adopted formalism. Other formalisms including LambdaDCS (pasupat2015compositional), have been used on WTQ. sql and LambdaDCS can express roughly the same percentage of queries: 81% (our finding) vs. 79% (analysis of a 200-question sample by pasupat-liang2016). We leave automatic conversion to and from sql to other formalisms and vice versa to future work. We leave lexical alignments of other text-to-sql datasets and cross-dataset model generalization (alane2020explor) to future work.

3.1 Data Annotation

We annotated WTQ’s training fold in three stages: database construction, sql query annotation, and alignment. Two expert annotators familiar with sql annotated half of the dataset each and then checked each other’s annotations and resolved all conflicts via discussion. See LABEL:sec:app-anno for the annotation guidelines.

Database Construction

Tables encode semi-structured information. Each table column usually contains data of the same type: e.g., text, numbers, dates, etc., as is typical in relational databases. While pre-processing the WTQ tables, we considered both basic data types (e.g., raw text, numbers) and composite types (e.g., lists, binary tuples), and we suffixed column names with their inferred data types (e.g., _number in Figure 1). For annotation consistency, all tables were assigned the same name w and columns were given the sequential names c1c2,…in the database schema, but we kept the original table headers for feature extraction. We additionally added a special column id to every table denoting the linear order of its rows. See LABEL:sec:app-database for details.

Conversion of Queries to sql

For every question in WTQ’s training fold, we manually created its corresponding sql query, choosing the shortest when there are multiple possibilities, for instance, we wrote “SELECT MAX(c1) FROM w” instead of “SELECT c1 FROM w ORDER BY c1 DESC LIMIT 1”. An exception is that we opted for less table structure-dependent versions even if their complexity was higher. As an example, if the table listed games (c2) pre-sorted by date (c1), and the question was “what is the next game after A?”, we wrote “SELECT c2 FROM w WHERE c1 > (SELECT c1 FROM w WHERE c2 = A) ORDER BY c1 LIMIT 1” instead of “SELECT c2 FROM w WHERE id = (SELECT id FROM w WHERE c2 = A) + 1”. Out of 14,149 questions spanning 1,679 tables, Squall provided sql queries for 11,468 questions, or 81.1%. The remaining 18.9% consisted of questions with non-deterministic answers (e.g., “show me an example of …”), questions requiring additional pre-processing (e.g., looking up a date inside a text-based details column), and cases where sql queries would be insufficiently expressive (e.g., “what team has the most consecutive wins?”).

Alignment Annotation

how long MAX()
Frequently aligned to col the last
MAX(col)-MIN(col) the most
col-col the largest
COUNT(*) the highest
COUNT(col) the first
Table 1: Examples of frequently-aligned English/LF segment pairs, illustrating the diversity in the aligned counterparts for the same lexical units. col is a placeholder for the actual data table column mention.

Given a tokenized question/LF pair, the annotators selected and aligned corresponding fragments from the two sides. The selected tokens did note need to be contiguous, but they had to be units that decompose no further. For the example in Figure 1, there were three alignment pairs, where the non-contiguous “ORDER BY LIMIT 1” was treated as an atomic unit and aligned to “the highest” in the input. Additionally, not all tokens on either side needed to be aligned. For instance, sql keywords SELECT, FROM and question tokens “what”, “is”, etc. were mostly unaligned. Table 1 shows that the same question phrase was aligned to a range of sql expressions, and vice versa. Overall, 49.8% of question tokens were aligned. Comparative and superlative question tokens were the most frequently aligned, while many function words were unaligned; see LABEL:sec:app-pos-tag for part-of-speech distributions of the aligned and unaligned tokens. Except for the four keywords in the basic structure “SELECT FROM w WHERE ”, 90.2% of sql keywords were aligned. The rest of the unaligned sql tokens include d= (alignment ratio of 18.0%), AND (25.5%) and column names (86.1%). The first two cases arose because equality checks and conjunctions of filtering conditions are often implicit in natural language.

Inter-Annotator Agreement and Annotation Cost

The two annotators’ initial sql annotation agreement in a pilot trial333 In the pilot study, the annotators independently labeled questions over the same 50 tables. We report the percentage of cases where one annotator accepted the other annotator’s labels. was 70.4% and after discussion, they agreed on 94.5% of data instances; similarly, alignment agreement rose from 75.1% to 93.3%. With respect to annotation speed, an average sql query took 33.9 seconds to produce and an additional 15.0 seconds to enrich with alignments: the cost of annotating 100 instances with alignment enrichment was comparable to that of 144 instances with only logical forms.

3.2 Post-processing

Literal values in the sql queries such as “25,000” in Figure 1 and “star one” in LABEL:fig:case are often directly copied from the input questions. We thus adapted WikiSQL’s zhong2017seq2sql task setting, where all literal values correspond to spans in the input questions. We used our alignment to generate gold selection spans, filtering out instances where literal values could not be reconstructed through fuzzy match from the gold spans. After post-processing, Squall contained 11,276 table-question-answer triplets with logical form and lexical alignment annotations.

4 (State-of-the-Art)444 In Appendix §LABEL:sec:app-sota, we show that on Squall, our base model is competitive with a state-of-the-art system alane2020explor benchmarked on the Spider dataset yu2018spider. Base Model: Seq2seq with Attention and Copying

Recent state-of-the-art text-to-sql models extend the sequence-to-sequence (seq2seq) framework with attention and copying mechanisms (zhong2017seq2sql; dong2016language; dong-lapata-2018-coarse; alane2020explor, inter alia). We adopt this strong neural paradigm as our base model. The seq2seq model generates one output token at a time via a probability distribution conditioned on both the input sequence representations and the partially-generated output sequence: P(y|𝐱)=i=1|y|P(yi|𝐲<i,𝐱), where 𝐱 and 𝐲 are the feature representations for the input and output sequences, and <i denotes a prefix. The last token of y must be a special <STOP> token that terminates the output generation. The per-token probability distribution is modeled through Long-Short Term Memory networks (LSTMs, hochreiter-schmidhuber97) and multi-layer perceptrons (MLPs):

𝐡i =LSTM(𝐡i1,𝐲i1) (1)
P(yi|𝐲<i,𝐱) =softmax(MLP(𝐡i)). (2)

The training objective is the negative log likelihood of the gold y, defined for each timestep as

Liseq2seq=logP(yi|𝐲<i,𝐱).

Question and Table Encoding

An input x contains a length-n question q=q1,,qn and a table with m columns c=c1,,cm. The input question is represented through a bi-directional LSTM (bi-LSTM) encoder that summarizes information from both directions within the sequence. Inputs to the bi-LSTM are concatenations of word embeddings, character-level bi-LSTM vectors, part-of-speech embeddings, and named entity type embeddings. We denote the resulting feature vector associated with qi as 𝐪i. For column names, the representation 𝐜j concatenates the final hidden states of two LSTMs running in opposite directions that take the concatenated word embeddings, character encodings, and column data type embeddings as inputs. We also experiment with pre-trained BERT feature extractors (devlin+19), where we feed the BERT model with the question and the columns as a single sequence delimited by the special [SEP] token, and we take the final-layer representations of the question words and the last token of each column as their representations.

Attention in Encoding

To enhance feature interaction between the question and the table schema, for each question word representation 𝐪i, we use an attention mechanism to determine its relevant columns and calculate a linearly-weighted context vector 𝐪~i as follows:

𝐪~i =Attn(𝐪i,𝐜)j𝐚ij𝐜j, (3)
where 𝐚ij =softmaxj(𝐪iTWatt𝐜). (4)

Then we run another bi-LSTM by concatenating the question representation 𝐪 and context representation 𝐪~ as inputs to derive a column-sensitive representation 𝐪i for each question word qi. We apply a similar procedure to get the column representation 𝐜j for each column.

Attention in Decoding

During decoding, to allow LSTMs to capture long-distance dependencies from the input, we add attention-based features to the recurrent feature definition of Eq. (1):

𝐯i =Attn(𝐡i,𝐪) (5)
𝐡i =LSTM(𝐡i1,[𝐯i1;𝐲i1]). (6)

sql Token Prediction with Copying Mechanism

Since each output token can be an sql keyword, a column name or a literal value, we factor the probability defined in Eq. (2) into two components: one that decides the type ti{KEY,COL,STR} of yi:

P(ti|𝐲<i,𝐱)=softmax(MLPtype(𝐡i)),

and another that predicts the token conditioned on the type ti. For token type KEY, we predict the keyword token with another MLP:

P(yi|𝐲<i,𝐱,ti=KEY)=softmax(MLPKEY(𝐡i)).

For COL and STR tokens, the model selects directly from the input column names c or question q via a copying mechanism. We define a probability distribution with softmax-normalized bilinear scores:

P(yi=cj|𝐲<i,𝐱,ti=COL)=softmaxj(𝐬i),
where 𝐬ij=𝐡iWCOL𝐜j.

Similarly, we define literal string copying from q with another bilinear scoring matrix WSTR.

5 Using Alignments in Model Training

The model design in §4 includes many latent interactions within and across the encoder and the decoder. We now describe how our manual alignments can enable direct supervision on such previously latent interactions. Our alignments can be used as supervision for the necessary attention weights (§5.1). In an oracle experiment where we replace induced attention with manual alignments, the jump in logical form accuracy shows alignments are valuable, if only the models could reproduce them (§5.2). Moreover, alignments enable a column-prediction auxiliary task (§5.3).

The loss function L of our full model is a linear combination of the loss terms of the seq2seq model, supervised attention, and column prediction:

L=Lseq2seq+λattLatt+λCPLCP,

where we define Latt and LCP below.

5.1 Supervised Attention

Our annotated lexical alignments resemble our base model’s attention mechanisms. At the encoding stage, question tokens and the relevant columns are aligned (e.g., “who” column “athlete”) which should induce higher weights in both question-to-column and column-to-question attention (Eq. (3) and Eq. (4)); similarly, for decoding, annotation reflects which question words are most relevant to the current output token. Inspired by improvements from supervised attention in machine translation (liu+16; mi+16), we train the base model’s attention mechanisms to minimize the Euclidean distance666 See LABEL:sec:app-abl-loss for experiments with other distances. between the human-annotated alignment vector 𝐚 and the model-generated attention vector 𝐚:

Latt=12𝐚𝐚2.

The vector 𝐚 is a one-hot vector when the annotation aligns to a single element, or 𝐚 represents a uniform distribution over the subset in cases where the annotation aligns multiple elements.

5.2 Oracle Experiments with Manual Alignments

Attention type ACCLF (Dev) Δ
Induced attention 37.8±0.6
Oracle attention
   Encoder only 51.5±1.4 +13.7
   Decoder only 49.4±0.9 +11.6
   Encoder + decoder 61.7±0.4 +23.9
Table 2: Oracle experiment LF-accuracy results over five dev sets from random splits, where attention weights are replaced by manual alignments. Induced attention refers to the base model (§4).

To present the potential of alignment annotations for models with supervised attention, we first assume a model that can flawlessly reproduce our annotations within the base model. During training and inference, we feed the true alignment vectors in place of the attention weights to the encoder and/or decoder. Table 2 shows the resultant logical form accuracies. Access to oracle alignments provides up to 23.9% absolute higher accuracy over the base model. This wide gap suggests the high potential for training models with our lexical alignments.

5.3 Column Prediction

wang-etal-2019-learning show the importance of inferring token-column correspondence in a weakly-supervised setting; Squall enables full supervision for an auxiliary task that directly predicts the corresponding column cj for each question token qi. We model this auxiliary prediction as:

𝐬ij=𝐪iWCP𝐜j
P(qi matches cj|qi)=softmaxj(𝐬i).

For the corresponding loss LCP over tokens that match columns, we use cross-entropy.

Exact-match Features: An Unsupervised Alternative

A heuristic-based, albeit lower-coverage, alternative to manual alignment is to use questions’ mentions of column names. Thus, we use automatically-generated exact-match features in our baseline models for comparison in our experiments. For question encoders, we include two embeddings derived from binary exact-match features: indicators of whether the token appears in (1) any of the column headers and (2) any of the table cells. Similarly, for the column encoders, we also include an exact-match feature of whether the column name appears in the question.

6 Experiments

Model ACCEXE (Test)
Prior work (all necessarily are weakly supervised)
Single model 34.244.5
Single model (w/ bert) 48.8
Ensemble 37.746.9
This paper (strongly supervised for the first time)
Single model (align) 49.7±0.4
Single model (align w/ bert) 54.1±0.2
Ensemble (align) 53.1
Ensemble (align w/ bert) 57.2
Table 3: WTQ test set execution accuracies (%). The accuracy ranges for prior work are aggregated over pasupat2015compositional, neelakantan2016learning, krishnamurthy2017neural, zhang+17, haug2018neural, liang+18, dasigi+19, agarwal+19, wang-etal-2019-learning, and herzig+20. Unsurprisingly, our models trained on Squall surpass weakly-supervised previous work.

Setup

We randomly shuffle the tables in Squall and divide them into five splits. For each setting, we report the average logical form accuracy ACCLF (output LF exactly matches the target LF) and execution accuracy ACCEXE (output LF may not match the target LF, but its execution yields the gold-standard answer) as well as the standard deviation of five models, each trained with four of the splits as its training set and the other split as its dev set. We denote the base model from §4 as seq2seq and our model trained with both proposed training strategies in §5 as align. The main baseline model we compare with, seq2seq+, is the base model enhanced with the automatically-derived exact-match features (§5.3). See Appendix LABEL:sec:app-impl for model implementation details.

WTQ Test Results

Table 3 presents the WTQ test-set ACCEXE of align compared with previous models. Unsurprisingly, Squall’s supervision allows our models to surpass weakly supervised models. Single models trained with BERT feature extractors exceed prior state-of-the-art by 5.3%. However, our main scientific interest is not these numbers per se, but how beneficial additional lexical supervision is.

Model Dev Test
ACCLF ACCEXE ACCEXE
seq2seq+ 37.8±0.6 56.9±0.7 46.6±0.5
align 42.2±1.5 61.3±0.8 49.7±0.4
seq2seq+ w/ bert 44.7±2.1 63.8±1.1 51.8±0.4
align w/ bert 47.2±1.2 66.5±1.2 54.1±0.2
Table 4: Logical form (ACCLF) and execution (ACCEXE) accuracies (%) on dev and test sets, showing the utility of learning from lexical supervisions.

Effect of Alignment Annotations

To examine the utility of lexical alignments as a finer-grained type of supervision, we compare align with seq2seq+ in Table 4. Both have access to logical form supervision, but align additionally uses lexical alignments during training. align improves seq2seq by 2.3% with BERT and 3.1% without, showing that lexical alignment annotation is more beneficial than automatically-derived exact-match column reference features.777 Test set accuracies are lower than on the dev set because the WTQ test set includes questions unanswerable by sql.

Effect of Individual Strategies

Table 5 compares model variations. We add each individual training strategy into the baseline seq2seq+ model and ablate components from the align model. Each component contributes to increased accuracies compared with seq2seq+. The effects range from +1.3% ACCEXE with column prediction to +3.8% ACCEXE with supervised encoder attention. Supervised encoder attention is the single most effective strategy: including it produces the highest gains and ablating it the largest drop. The exact-match column reference features are essential to the baseline model: seq2seq without those features has 8.1% lower ACCEXE. Nonetheless, supervised encoder attention and column prediction are still effective on top of the exact-match features. Yet, align’s accuracy is still far below that of the oracle models; we hope Squall can inspire future work to take better advantage of its rich supervision.

Component Dev
ACCLF ACCEXE
seq2seq  31.0±0.7 48.8±0.8
seq2seq+  37.8±0.6 56.9±0.7
+ Supervised decoder attn.  39.4±1.1 58.6±1.3
+ Supervised encoder attn.  41.3±1.7 60.7±0.7
+ Column prediction  38.6±0.5 58.2±0.8
align  42.2±1.5 61.3±0.8
- Supervised decoder attn.  41.6±1.8 61.1±1.3
- Supervised encoder attn.  39.6±0.6 58.7±0.8
- Column prediction  41.8±1.6 60.9±0.8
- Exact-match features  39.5±1.1 58.8±0.7
Oracle attention  61.7±0.4
30405060
Table 5: Dev logical form (ACCLF) and execution (ACCEXE) accuracies for different model variations (w/o bert). The superimposed bar chart provides a visual presentation of ACCLF. Each align component contributes to increased accuracies compared with seq2seq+, while the oracle attention model demonstrates the unrealized potential of the alignments.
5%10%20%40%80%2030405060Percentage of training examples (log scale)Execution accuracies on devalignseq2seq+