Negation, Coordination, and Quantifiers in Contextualized Language Models
2022, Kalouli, Aikaterini-Lida, Sevastjanova, Rita, Schätzle, Christin, Romero, Maribel
With the success of contextualized language models, much research explores what these models really learn and in which cases they still fail. Most of this work focuses on specific NLP tasks and on the learning outcome. Little research has attempted to decouple the models' weaknesses from specific tasks and focus on the embeddings per se and their mode of learning. In this paper, we take up this research opportunity: based on theoretical linguistic insights, we explore whether the semantic constraints of function words are learned and how the surrounding context impacts their embeddings. We create suitable datasets, provide new insights into the inner workings of LMs vis-a-vis function words and implement an assisting visual web interface for qualitative analysis.
Explaining Simple Natural Language Inference
2019, Kalouli, Aikaterini-Lida, Buis, Annebeth, Real, Livy, Palmer, Martha, de Paiva, Valeria
The vast amount of research introducing new corpora and techniques for semi-automatically annotating corpora shows the important role that datasets play in today’s research, especially in the machine learning community. This rapid development raises concerns about the quality of the datasets created and consequently of the models trained, as recently discussed with respect to the Natural Language Inference (NLI) task. In this work we conduct an annotation experiment based on a small subset of the SICK corpus. The experiment reveals several problems in the annotation guidelines, and various challenges of the NLI task itself. Our quantitative evaluation of the experiment allows us to assign our empirical observations to specific linguistic phenomena and leads us to recommendations for future annotation tasks, for NLI and possibly for other tasks.
CoUSBi : A Structured and Visualized Legal Corpus of US State Bills
2018, Kalouli, Aikaterini-Lida, Vrana, Leo, Fabella, Vigile Marie, Bellani, Luna, Hautli-Janisz, Annette
This paper reports on an approach to automatically transform semi-structured and public databases of US state-level legislative bills into a structured, legal corpus, namely the Corpus of US Bills (CoUSBi). Our work has resulted in a methodology and a corpus that makes this data usable for natural language processing applications. It thus also lays important groundwork for work in the social sciences, particularly in the fields of political science and economics where there is a growing interest in the relationship between legislative policy-making and economic behavior. Against the backdrop of eventually contributing to a Legal Knowledge Graph, the paper shows that the corpus we provide already fulfills the requirements to be connected to other resources: We automatically extract correspondences between individual state bills and model bills from independent organizations, generating interesting insights into the legislative process. We furthermore use NEREx, a Visual Analytics framework, that allows us to capture important content of the bills at a glance.
Explaining Contextualization in Language Models using Visual Analytics
2021, Sevastjanova, Rita, Kalouli, Aikaterini-Lida, Schätzle, Christin, Schäfer, Hanna, El-Assady, Mennatallah
Despite the success of contextualized language models on various NLP tasks, it is still unclear what these models really learn. In this paper, we contribute to the current efforts of explaining such models by exploring the continuum between function and content words with respect to contextualization in BERT, based on linguistically-informed insights. In particular, we utilize scoring and visual analytics techniques: we use an existing similarity-based score to measure contextualization and integrate it into a novel visual analytics technique, presenting the model’s layers simultaneously and highlighting intra-layer properties and inter-layer differences. We show that contextualization is neither driven by polysemy nor by pure context variation. We also provide insights on why BERT fails to model words in the middle of the functionality continuum.
GKR : Bridging the gap between symbolic/structural and distributional meaning representations
2019, Kalouli, Aikaterini-Lida, Crouch, Richard, de Paiva, Valeria
Three broad approaches have been attempted to combine distributional and structural/symbolic aspects to construct meaning representations: a) injecting linguistic features into distributional representations, b) injecting distributional features into symbolic representations or c) combining structural and distributional features in the final representation. This work focuses on an example of the third and less studied approach: it extends the Graphical Knowledge Representation (GKR) to include distributional features and proposes a division of semantic labour between the distributional and structural/symbolic features. We propose two extensions of GKR that clearly show this division and empirically test one of the proposals on an NLI dataset with hard compositional pairs.
KonTra at CMCL 2021 Shared Task : Predicting Eye Movements by Combining BERT with Surface, Linguistic and Behavioral Information
2021, Yu, Qi, Kalouli, Aikaterini-Lida, Frassinelli, Diego
This paper describes the submission of the team KonTra to the CMCL 2021 Shared Task on eye-tracking prediction. Our system combines the embeddings extracted from a fine-tuned BERT model with surface, linguistic and behavioral features, resulting in an average mean absolute error of 4.22 across all 5 eye-tracking measures. We show that word length and features representing the expectedness of a word are consistently the strongest predictors across all 5 eye-tracking measures.
Composing noun phrase vector representations
2019, Kalouli, Aikaterini-Lida, de Paiva, Valeria, Crouch, Richard
Vector representations of words have seen an increasing success over the past years in a variety of NLP tasks. While there seems to be a consensus about the usefulness of word embeddings and how to learn them, it is still unclear which representations can capture the meaning of phrases or even whole sentences. Recent work has shown that simple operations outperform more complex deep architectures. In this work, we propose two novel constraints for computing noun phrase vector representations. First, we propose that the semantic and not the syntactic contribution of each component of a noun phrase should be considered, so that the resulting composed vectors express more of the phrase meaning. Second, the composition process of the two phrase vectors should apply suitable dimensions’ selection in a way that specific semantic features captured by the phrase’s meaning become more salient. Our proposed methods are compared to 11 other approaches, including popular baselines and a neural net architecture, and are evaluated across 6 tasks and 2 datasets. Our results show that these constraints lead to more expressive phrase representations and can be applied to other state-of-the-art methods to improve their performance.