Publikation: Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models
Dateien
Datum
Autor:innen
Herausgeber:innen
ISSN der Zeitschrift
Electronic ISSN
ISBN
Bibliografische Daten
Verlag
Schriftenreihe
Auflagebezeichnung
URI (zitierfähiger Link)
Internationale Patentnummer
Link zur Lizenz
Angaben zur Forschungsförderung
Projekt
Open Access-Veröffentlichung
Core Facility der Universität Konstanz
Titel in einer weiteren Sprache
Publikationstyp
Publikationsstatus
Erschienen in
Zusammenfassung
Language models (LMs) have revolutionized natural language processing (NLP) methods with their ability to generate text and perform various language-related tasks. One essential characteristic of LMs is their representation of words through contextual word embeddings, i.e., high dimensional vectors capturing linguistic properties (e.g., semantics and syntax) of their surrounding contexts. Supporting the interpretability of contextual word embeddings is crucial for assessing model strengths and limitations, identifying encoded biases, and making informed decisions about sufficient embedding layers for text-analysis tasks. While word embeddings have proven to be effective representations for various NLP methods, much remains to learn about how they encode and represent language properties. Existing approaches often examine embeddings in relation to high-level tasks such as question answering or sentiment classification, overlooking the detailed analysis of linguistic phenomena, e.g., how models represent word categories like function words. These fine-grained linguistic insights are essential for inspecting whether LMs "understand" language, as sometimes assumed in academic publications. Moreover, given that embeddings have already demonstrated their ability to capture diverse language properties, they can serve as effective features for further analysis tasks when appropriately applied. A deeper examination of their successful utilization is still missing.
In the first part of the thesis, we address the research gap related to linguistically motivated embedding explanation methods. Collaborating with computational linguists, we design three visual analytics techniques that facilitate the exploration of embedding properties. In particular, we investigate reasons for word contextualization, i.e., differences in the embeddings for the same word used in different contexts. We explore whether contextualization captures the word's semantic meaning (and polysemy) or context variations, as assumed by the related work. We design visual methods for gaining insights into linguistic properties encoded in embedding vectors and show how the information gets propagated through the models’ architectures. Lastly, we investigate how models capture word functionality, i.e., whether models correctly encode the meaning and constraints of different function word classes.
In the second part of the thesis, we examine methods on how to utilize embedding vectors for diverse application scenarios in order to assist researchers in selecting the best model for a user and task at hand. In particular, in this thesis, we present three interactive approaches for visually comparing model adaptations and their generated outputs by using embedding vectors as the main features for the analysis. First, we design visual methods that enable effective comparison of text outputs generated by causal LMs, including the biases associated with different prompt inputs. Second, we develop visual methods to compare masked LM adaptations, particularly their influence on semantic language concept representations. Finally, we explore methods that incorporate gamification to learn users' preferences captured through word embedding similarity for an optimal LM selection.
Through the presented visual analytics methods, we show both the LM strengths and limitations and motivate the development of more linguistically aware LMs and further methods for their effective analysis.
Zusammenfassung in einer weiteren Sprache
Fachgebiet (DDC)
Schlagwörter
Konferenz
Rezension
Zitieren
ISO 690
SEVASTJANOVA, Rita, 2025. Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models [Dissertation]. Konstanz: Universität KonstanzBibTex
@phdthesis{Sevastjanova2025-07-04Inter-73831,
title={Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models},
year={2025},
author={Sevastjanova, Rita},
address={Konstanz},
school={Universität Konstanz}
}RDF
<rdf:RDF
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:bibo="http://purl.org/ontology/bibo/"
xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:void="http://rdfs.org/ns/void#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#" >
<rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/73831">
<dcterms:rights rdf:resource="http://creativecommons.org/licenses/by-nc/4.0/"/>
<dc:creator>Sevastjanova, Rita</dc:creator>
<bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/73831"/>
<dcterms:hasPart rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/73831/4/Sevastjanova_2-zojoodl1iu4v2.pdf"/>
<dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
<foaf:homepage rdf:resource="http://localhost:8080/"/>
<dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-07-07T05:03:30Z</dc:date>
<dc:contributor>Sevastjanova, Rita</dc:contributor>
<dcterms:issued>2025-07-04</dcterms:issued>
<dspace:hasBitstream rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/73831/4/Sevastjanova_2-zojoodl1iu4v2.pdf"/>
<dcterms:abstract>Language models (LMs) have revolutionized natural language processing (NLP) methods with their ability to generate text and perform various language-related tasks. One essential characteristic of LMs is their representation of words through contextual word embeddings, i.e., high dimensional vectors capturing linguistic properties (e.g., semantics and syntax) of their surrounding contexts. Supporting the interpretability of contextual word embeddings is crucial for assessing model strengths and limitations, identifying encoded biases, and making informed decisions about sufficient embedding layers for text-analysis tasks. While word embeddings have proven to be effective representations for various NLP methods, much remains to learn about how they encode and represent language properties. Existing approaches often examine embeddings in relation to high-level tasks such as question answering or sentiment classification, overlooking the detailed analysis of linguistic phenomena, e.g., how models represent word categories like function words. These fine-grained linguistic insights are essential for inspecting whether LMs "understand" language, as sometimes assumed in academic publications. Moreover, given that embeddings have already demonstrated their ability to capture diverse language properties, they can serve as effective features for further analysis tasks when appropriately applied. A deeper examination of their successful utilization is still missing.
In the first part of the thesis, we address the research gap related to linguistically motivated embedding explanation methods. Collaborating with computational linguists, we design three visual analytics techniques that facilitate the exploration of embedding properties. In particular, we investigate reasons for word contextualization, i.e., differences in the embeddings for the same word used in different contexts. We explore whether contextualization captures the word's semantic meaning (and polysemy) or context variations, as assumed by the related work. We design visual methods for gaining insights into linguistic properties encoded in embedding vectors and show how the information gets propagated through the models’ architectures. Lastly, we investigate how models capture word functionality, i.e., whether models correctly encode the meaning and constraints of different function word classes.
In the second part of the thesis, we examine methods on how to utilize embedding vectors for diverse application scenarios in order to assist researchers in selecting the best model for a user and task at hand. In particular, in this thesis, we present three interactive approaches for visually comparing model adaptations and their generated outputs by using embedding vectors as the main features for the analysis. First, we design visual methods that enable effective comparison of text outputs generated by causal LMs, including the biases associated with different prompt inputs. Second, we develop visual methods to compare masked LM adaptations, particularly their influence on semantic language concept representations. Finally, we explore methods that incorporate gamification to learn users' preferences captured through word embedding similarity for an optimal LM selection.
Through the presented visual analytics methods, we show both the LM strengths and limitations and motivate the development of more linguistically aware LMs and further methods for their effective analysis.</dcterms:abstract>
<dc:rights>Attribution-NonCommercial 4.0 International</dc:rights>
<dc:language>eng</dc:language>
<void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
<dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-07-07T05:03:30Z</dcterms:available>
<dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
<dcterms:title>Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models</dcterms:title>
</rdf:Description>
</rdf:RDF>