20 minute read

AmsterdamNLP at CLIN!

We’re at CLIN 35 in Leiden today with a talks and multiple presentations on some of the most topical topics in current NLP!


Causal Methods for a (Mechanistic) Understanding of Gender Bias in Dutch Language Models

Caspar de Jong, Oskar van der Wal, Willem Zuidema (Institute for Logic, Language and Computation, University of Amsterdam)

The recent successes of Large Language Models (LLMs) in tasks as diverse as content generation and question answering have led to excitement about integrating this technology in everyday applications, and even in important decision-making practices. Unfortunately, these LLMs are known to exhibit undesirable biases that, if left unchecked, could lead to real-world harm. Yet, our ability to measure and mitigate these remains limited—a problem that is only aggravated for Dutch LMs, as little research has been done on bias outside the English context (Talat et al., 2022). To address this gap, this paper aims to get a better understanding of how Dutch GPT-style models rely on gender stereotypes in coreference resolution.

Caspar

We build on previous work that used Causal Mediation Analysis (CMA; Vig et al., 2020; Chintam et al., 2023) to identify transformer-components responsible for gender bias in English LMs, and adapt the methods and datasets to Dutch. We perform CMA on the 144 attention heads of GPT2-small-dutch by, a Dutch LM trained by GroNLP (de Vries & Nissim, 2021). The original English CMA dataset (the Professions dataset from Vig et al., 2020) is translated to Dutch with the Google Translate API and manually verified for correctness. Interestingly, the CMA results for the Dutch LM are very similar to the components identified by Chintam et al. (2023) in the English GPT-2 small model.

Following Chintam et al. (2023), we then test the effectiveness of ‘targeted fine-tuning’, i.e. finetuning on only the top 10 CMA-identified attention heads on a gender-balanced dataset in reducing the gender bias, when compared to fine-tuning a random set of attention heads or the full model. To obtain this fine-tuning dataset, we translate an English dataset of 1717 sentences containing one or more gendered pronouns (BUG Gold; Levy et al., 2021) to Dutch and perform counterfactual data augmentation (CDA; Lu et al., 2020) by swapping the gender of these pronouns (thus creating a gender-balanced dataset of 3434 sentences). Our results indicate that targeted fine-tuning of the CMA-identified attention heads can reduce the gender bias with limited damage to the general language modeling capabilities, as measured by perplexity on the DutchParliament dataset (van Heusden et al., 2023).

While we believe that this work shows a promising research direction for understanding undesirable biases in Dutch LMs, we also recognize the limitations imposed by a lack of proper bias evaluation tools. We therefore call for the development of new benchmarks for measuring biases and harms as a crucial next step in the development of safe Dutch LMs.


Do Hate Speech Detection Models Reflect their Dataset’s Definition? Investigating Model Behavior on Hate Speech Aspects

Urja Khurana Vrije Universiteit Amsterdam Eric Nalisnick Johns Hopkins University Antske Fokkens Vrije Universiteit Amsterdam

Hate Speech is subjective and hence requires a lot of attention to detail when designing such a detection system. While picking a model that can properly capture this phenomenon is essential, the biggest challenge lies in the data. Ultimately, the model is going to learn the type of hate speech that is present in its training data. An important question to ask for a hate speech researcher is which dataset they should use. What kind of hate speech should be addressed and which dataset reflects this type of definition? But do these datasets deliver as promised? Previous research has illustrated certain limitations of hate speech datasets, but investigating whether they encapsulate their definition has not been at the forefront. We, therefore, investigate whether models trained on these datasets reflect their dataset’s definition.

As such, we design a setup where we examine to what extent models trained on six different hate speech datasets follow their dataset’s definition. We measure the compliance of a model to its dataset’s definition by decomposing each definition into multiple aspects that make something hate speech, fueled by Hate Speech Criteria. After this decomposition, we match the aspects with HateCheck, an evaluation set for hate speech detection systems to uncover the strengths and weaknesses (which capabilities does my model perform well or badly on?) of the system. Different aspects are matched to different test cases and capabilities to systematically check if a model covers an aspect mentioned in the definition or not.


BLiMP-NL: A corpus of Dutch minimal pairs and grammaticality judgements for language model evaluation

Michelle Suijkerbuik (Centre for Language Studies, Radboud University) Zoë Prins (Institute for Logic, Language and Computation, University of Amsterdam) Marianne de Heer Kloots (Institute for Logic, Language and Computation, University of Amsterdam) Willem Zuidema (Institute for Logic, Language and Computation, University of Amsterdam) Stefan L. Frank (Centre for Language Studies, Radboud University)

In 2020, Warstadt and colleagues introduced The Benchmark of Linguistic Minimal Pairs (BLiMP), a set of minimal pairs of well-known grammatical phenomena in English that is used to evaluate the linguistic abilities of language models. In the current work, we expand this work further by creating a benchmark of minimal pairs for Dutch: BLiMP-NL. We present a corpus of 8400 Dutch sentence pairs, intended for the grammatical evaluation of language models. Each pair consists of a grammatical sentence and a minimally different ungrammatical sentence.

By going through all the volumes of the Syntax of Dutch, we ended up with 22 grammatical phenomena (e.g., anaphor agreement, wh-movement), all consisting of different sub-phenomena (84 in total). For each of the 84 sub-phenomena, 10 minimal pairs were created by hand and another 90 minimal pairs were created synthetically. An example of a minimal pair of the sub-phenomenon ‘impersonal passive’ can be found below.

a. Er wordt veel gelachen door de vriendinnen. [grammatical] there is much laughed by the girlfriends b. Yara wordt veel gelachen door de vriendinnen. [ungrammatical] Yara is much laughed by the girlfriends

In creating these minimal pairs, we improved the methodology of the English BLiMP by, for example, making sure that the critical word (i.e., the point at which the sentence becomes unacceptable; “gelachen”/”laughed” in the example) is the same for both sentences, which makes evaluation less noisy both when evaluation people and language models. Another improvement is in the set-up of the experiment in which we test the performance of native speakers. The 84 sub-phenomena were divided over 7 experiments, and in each experiment, there were 30 participants. These participants all performed a self-paced reading task, in which they read every sentence word-by-word and rated its acceptability. In contrast to the original BLiMP, these ratings were not binary but made on a scale of 1 to 7 to capture the gradience of acceptability judgements.

We used our dataset to evaluate some Transformer language models. We evaluate these models both based on determining for which fraction of minimal pairs the grammatical sentence gets a higher probability from the language model than the ungrammatical sentence and by comparing the probability distributions of the sentences with the distribution of human ratings. We considered both causal and masked language models and found that bigger models can identify grammaticality quite reliably. Interestingly, we see that small masked language models perform better than bigger causal models when comparing probabilities per minimal pair, which is inconsistent with the abilities of these models on other evaluation criteria.


Distributional Semantic Modeling of SC-Pair Polish Aspectual Verbs

Matthew Micyk Amsterdam Center for Language and Communication, University of Amsterdam Jelke Bloem Institute for Logic, Language and Computation, University of Amsterdam

Current theoretical literature concerning the aspectual opposition of verbs in Polish categorizes relationships between aspectual pairs in several different ways - suffixal pairs, multiplicative-semelfactive pairs (MS-pairs), simplex-complex pairs (SC-pairs), and suppletive pairs. SC-pairs are comprised of a simplex imperfective verb and a compound perfective verb which consists of the simplex imperfective verb with the addition of a prefix. This prefix has been termed an ‘empty’ prefix, but this characterization is no longer considered strictly true. Previously, these prefixes have been described as lexically empty and only contributing a grammatical value of perfective aspect. However, they are now described as contributing a lexico-semantic value of terminative to the simplex imperfective verb which is not inherently terminative but can be terminative in specific contexts. Distributional semantic modeling provides insight into the semantic similarity of terms based on their co-occurrence in contexts which are similar or the same. The more often that both the imperfective and perfective verbs in an SC-pair occur in contexts which are similar or the same, the closer the relationship between the paired verbs. This relationship of adding a prefix which contributes the feature Terminativity and creates an aspectual partner to a simplex imperfective verb without modifying the lexical content of the verb contrasts with the process of lexical derivation whereby the addition of a prefix does modify the lexical content of the verb and creates a new lexical item. By comparing the similarity of SC-pairs grouped by which prefix is featured in the compound perfective partner, I aim to look at which prefixes are used most often to contribute this feature of Terminativity and create a compound perfective partner to a simplex imperfective verb without modifying the lexical content of the verb. The findings of my distributional semantic modeling inquiry generally pattern after previous research into which prefixes most often appear as ‘empty’ as they have been traditionally discussed in the literature on Slavic aspectology. This outcome simultaneously contributes validity to the use of distributional semantic modeling in the realm of theoretical linguistics and strengthens the previous findings in the realm of Slavic aspectology about which prefixes more often contribute the feature Terminativity without any additional lexical value.


Evaluation of Greek Word Embeddings

Leonidas Mylonadis, Jelke Bloem University of Amsterdam Institute for Logic, Language and Computation, University of Amsterdam; Data Science Centre, University of Amsterdam

Word embeddings are crucial to widely applied natural language processing tasks such as machine translation and sentiment analysis. In addition, computational models that accurately capture semantic similarity as opposed to association or relatedness have wide-ranging applications and are an effective proxy evaluation for general-purpose representation-learning models (Hill, Reichart & Korhonen, 2015). Existing research on the evaluation of word embeddings has mostly been conducted on English. For example, the SimLex-999 dataset contains 999 English word pairings that were manually rated for their similarity by 50 participants. SimLex-999 also distinguishes itself from other word embedding evaluation frameworks by more accurately capturing similarity relations between word pairings. We created a version of SimLex-999 for Modern Greek so as to contribute to the evaluation of Greek word embeddings. We did so by translating the existing dataset of 999 English word pairs into Greek and then recruiting native Greek speakers to manually rate the semantic similarity of the word pairs. We then used the dataset produced by the manual annotators to evaluate popular Greek language models such as GREEK-BERT, M-BERT and XLM-R. This evaluation identifies which existing language models more accurately capture similarity relations and in doing so contribute to the development of accurate computational models for the Greek language.


Evaluating the Linguistic Knowledge of Dutch Large Language Models

Julia Pestel, Raquel G. Alhama Institute for Logic, Language and Computation; University of Amsterdam

The rapid development of large language models (LLMs) has seen major improvement in the performance of these models across a range of natural language processing (NLP) tasks. Recent work has contributed valuable resources for evaluating LLMs in an effort to determine their linguistic knowledge. Although many of these developments are focused on English LLMs, research addressing the grammatical abilities of LLMs in other languages is flourishing, as is the case for Dutch language (de Vries et al., 2023; Suijkerbuijk & Prins, 2024).

Here, we contribute to such efforts. We present a challenge set for evaluating the grammatical abilities of LLMs on major grammatical phenomena in Dutch. We design our dataset following the descriptive grammar in the Algemene Nederlandse Spraakkunst (ANS), which aims to provide a comprehensive description of the grammatical phenomena of contemporary Standard Dutch. We focus on four different types of phrases (noun phrase, adjective phrase, adpositional phrase and verb phrase) and 13 syntactic phenomena that span across these phrases. For each phenomenon, we provide 50 minimal pairs, i.e. pairs of minimally different sentences that differ in grammatical acceptability on the specific syntactic phenomenon. To construct the dataset, we retrieved the minimal pairs provided in the ANS website (which ranged from 2 to 10 pairs), and we are currently extending each set with additional pairs, until reaching 50. We generate the extra pairs using a generative model (in particular, ChatGPT) and revise them manually.

The next step in our (still ongoing) work is evaluating acceptability judgments on these minimal pairs for a range of Dutch LLMs (in particular, RobBERT (Delobelle et al., 2020), BERTje (de Vries et al., 2019), GEITje (Vanroy, 2023), GPT2 (Radford et al., 2019), and Llama (Touvron et al., 2023)). Our analyses will be extended with a comparison against human acceptability judgments, performed on a subset of our dataset. Thus, our work will provide a reliable dataset and an analysis of the linguistic knowledge of LLMs, shedding light on the grammatical abilities of Dutch LLMs.


Quantifying Politicization: Leverage Contextualized Embeddings of Politicized Keywords

Sidi Wang Computational Linguistics and Text Mining Lab, Vrije Universiteit Amsterdam Jelke Bloem Data Science Centre, University of Amsterdam; Institute for Logic, Language and Computation, University of Amsterdam

In previous work, a metric measuring politicization was developed for foreign aid project reports applying the Doc2Vec model. This metric is computed by measuring the cosine similarity distance between the document embeddings of each report and sets of known politicized keywords, which were derived from the USAID thesaurus and hand-coded with politicized scores by political science domain experts. A Spearman correlation test was conducted on the metric with a politicization silver score derived from the report metadata, showing a weak but statistically significant correlation. However, the Doc2Vec model only produces static, non-context-sensitive embeddings. It is built on Word2Vec, which is known for its limited handling of polysemy. For a given document, Doc2Vec generates a static vector in which all words contribute equally, irrespective of context. As a follow-up to the aforementioned pilot study, the objective of this research is to develop a novel metric using contextualized embedding produced by BERT for foreign aid project reports. The built pipeline uses BERT model to embed a given politicized keyword by taking care of the context it appears. Then it takes the sum of the last four layers of a keyword as embedding for the politicization score calculation. This study compared the results of the contextualized embedding metric with the metric yielded by the Doc2Vec model. Furthermore, I explored the correlation between politicization and project effectiveness, since political interests may affect aid effectiveness.


Using Natural Language Processing to Quantify Politicization in Foreign Aid Reports

Léa Gontard, Jelke Bloem Institute for Logic, Language and Computation, University of Amsterdam

Politicization within foreign aid reports is of great interest for political science experts to measure its correlation with major topics of concern, such as aid effectiveness. Reports that evaluate foreign aid projects are frequently made publicly available by the governments that commission them. However, textual reports of foreign aid are usually unstructured and unstandardized. The goal of this study is to operationalize a text-based metric of politicization for documents by investigating the use of contextual word embeddings. We work with a sample of the United States Agency for International Development (USAID) public aid evaluations projects from the Development Experience Clearinghouse (DEC) written by external third parties on health-related project conducted by the USAID. We attempt to capture ideological differences around certain keywords between reports conducted under Republican and Democrat government through contextual word embeddings methods. We compare state-of-the-art contextual embedding Bidirectional Encoder Representations from Transformers (BERT) and its enhanced version, Robustly Optimized BERT Pretraining Approach (RoBERTa). Using word embeddings allows to derive politicization scores for keywords from the cosine similarity of their averaged vector representation for each party. Politicization scores of documents are generated by averaging the politicization scores of the keywords present in a text. During our presentation, we will present the results from assessing the correlation between the generated politicization scores and experts’ labelled data at the keyword level. We will also present the correlation at the document level using a ‘silver standard’ score generated from the experts’ labelled data.


Exploring Semantic Consistency in Zero-Shot Pre-trained Language Models for Emotion Detection Tasks in the Social Sciences and Humanities

Pepijn Stoop University of Amsterdam Jelke Bloem Institute for Logic, Language and Computation, University of Amsterdam

In the social sciences and humanities (SSH) field, the use of pre-trained language models (PLMs) for zero-shot emotion detection tasks is on the rise. Despite achieving promising accuracy rates, the semantic consistency of these models in dimensional emotion detection tasks remains underexplored. Our study investigates how word-level prompt perturbation and the temperature hyper-parameter affect this consistency using the Emobank dataset, an English corpus annotated with dimensional emotional labels. We evaluated both intra- and inter-tool semantic consistency by repeatedly prompting two commonly utilized PLMs in SSH research for emotion detection across a range of temperature settings. By introducing a novel metric, the Vector-Model-Consistency-Score (VMCS), various dissimilarities between predicted emotional vectors could be assessed and aggregated into a single score indicative of the semantic consistency of the model.

Our findings indicate that prompt perturbation yields strong, contrasting effects on inter-tool consistency, associated with increased VMCS scores in one PLM and decreased scores in the other. Furthermore, lower temperature settings contribute positively to semantic consistency, although no significant relationships between temperature and accuracy were observed. Notably, our analysis reveals that the baseline outperformed both PLMs in terms of accuracy, suggesting a deficiency in the generalization of emotion detection tasks within these models.

In conclusion, our study has found that prompt perturbation and temperature settings affect semantic consistency and that PLMs can show a lack of generalization during zero-shot emotion detection tasks. Therefore, we recommend exercising caution when using PLMs for emotion detection without prior fine-tuning on additional datasets and emphasize the importance of extensive testing with a collection of prompts and temperature settings.


Linearly Mapping from Graph to Text Space

Congfeng Cao, Jelke Bloem Institute for Logic, Language and Computation, University of Amsterdam

Aligned multi-modality models, such as vision-language models and audio-language models, have recently attracted significant attention. These models address the limitation of uni-type encoders by aligning various modalities’ encoders. For example, the standard training procedure for training a vision-language model aims to align text and image representations using a contrastive loss function that maximizes the similarity between image-text pairs while pushing negative pairs away.

Some work trains a linear mapping from the output embeddings of vision encoders to the input embeddings of language models to explore the relationship between vision and language encoders in vision-language models, which exhibits impressive performance on image captioning and VQA tasks based on this linear transformation. In the vision-language domain, linear regression and relative representations are also used to evaluate the relationship of multi-modality encoders with a set of multi-modality representation pairs.

We raise the following central question: Do graph and language encoders also differ only by a linear transformation in graph-language models? Given that graphs have a more complex topology structure than the grid structure of vision, is the result of graph-language models with a linear transformation in accordance with vision-language models?

Similar to CLIP, which is an aligned text-image model in vision-language, MoleculeSTM is a multi-modalities model in graph-language trained by chemical graph structures and text description pairs. We hypothesize that graph and language encoders can also be transformed by a linear model. Selecting a collection of chemical graphs and text description pairs and splitting it to a training set and a test set from the PubChemSTM dataset, we leverage the training set to learn a linear transformation from the graph embedding space to the text and apply the transformation to get the text embeddings from graph embeddings on test set. Given the graph embeddings on the test set, we can obtain text embeddings from MoleculeSTM and the linear transformation, allowing the embeddings from two different sources to be evaluated based on cosine similarity.


Compositionality in Emergent Languages: Evaluating the Metrics

Urtė Jakubauskaitė Institute for Logic, Language and Computation, University of Amsterdam Raquel G. Alhama Institute for Logic, Language and Computation, University of Amsterdam Phong Le Amazon Alexa

Compositionality is a key characteristic of natural languages, suggesting that complex linguistic units’ meanings are derived from their components’ meanings. An open question is whether we can simulate the emergence of a language that exhibits compositionality to some degree. To address that question, studies in language emergence have used referential games, in which agents work together on a task, sharing a common reward system based on their collective performance. Although this approach offers flexibility, it also introduces a challenge: humans cannot readily interpret the messages that agents exchange. So, how can we ensure that these emergent languages are genuinely compositional?

Prior work has employed a range of metrics to quantify compositionality; most commonly, topographic similarity (Brighton & Kirby, 2006), positional disentanglement, and bag-of-symbols disentanglement (Chaabouni et al., 2020). However, since the metrics are used over uninterpretable languages, we do not know to what extent they are really sensitive to compositional aspects of the messages, or whether there are any other features that are captured by these metrics. In other words, the metrics themselves have not been empirically evaluated.

To address this issue, we present a novel dataset consisting of a set of (referential) games, paired with a set of grammars that can generate languages to describe items in the games. Crucially, the grammars range from those that generate messages with minimal compositionality to those that are highly compositional. By applying compositionality metrics to our dataset, we can determine whether the metrics can differentiate grammars with varying levels of compositionality. Additionally, our dataset includes grammars that make use of other linguistic factors, such as case marking and polysemy, which have not received attention before. With this novel approach, our ongoing work will allow us to find out the relative sensitivity and robustness of existing metrics, and will pave the way for the design and evaluation of novel metrics that capture specific aspects of emergent languages.

Brighton, Henry & Simon Kirby. 2006. Understanding Linguistic Evolution by Visualizing the Emergence of Topographic Mappings. Artificial life 12.2: 229–242.

Chaabouni, Rahma, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux & Marco Baroni. 2020. Compositionality and Generalization in Emergent Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4427–4442, Online. Association for Computational Linguistics.


A case for peer speech data in modelling child language acquisition and development

Svea Bösch Amsterdam Center for Language and Communication, University of Amsterdam Jelke Bloem Institute for Logic, Language and Computation, University of Amsterdam Raquel Garrido Alhama Institute for Logic, Language and Computation, University of Amsterdam

Typical child language development does not take place in isolation, with communicative input from both adults as well as other children. However, prior attempts at modelling language acquisition have largely neglected the influence that peer-speech can have on development, opting to focus on the effects of child-directed speech or adult interactions on language development.

Here, we examine how a model with child-directed input compares to a model with peer-speech input on the basis of on the basis of which offers a more accurate picture of the developmental patterns observed in children’s language learning. These models will be comparatively evaluated based upon the metrics of frequency and appropriateness of telegraphic speech (a simplified form of speech, wherein grammatical function words and inflectional endings are omitted) and optional infinitive errors (when the infinitive form of verbs are used despite the target being an inflected form of the verb), the age of acquisition of vocabulary, and their ability to appropriately answer wh-questions. These metrics have been selected as they align with the development of typically developing children’s language skills.

By introducing a model with peer-speech as the primary input into the existing corpus of models of language acquisition, we hope to bring the standard of language acquisition modelling closer to the actual development of children. Our model will allow us to analyse the effects of different types of linguistic input on a variety on lexical and grammatical development.