Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models

Sevastjanova, Rita

Publikation:
Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models

Dateien

Sevastjanova_2-zojoodl1iu4v2.pdfGröße: 53.73 MBDownloads: 316

Datum

2025

Autor:innen

Sevastjanova, Rita

URI (zitierfähiger Link)

http://nbn-resolving.de/urn:nbn:de:bsz:352-2-zojoodl1iu4v2

Link zur Lizenz

CC BY-NC 4.0

Open Access-Veröffentlichung

Open Access Green

Sammlungen

Informatik und Informationswissenschaft: Publikationen

Publikationstyp

Dissertation

Publikationsstatus

Published

Zusammenfassung

Language models (LMs) have revolutionized natural language processing (NLP) methods with their ability to generate text and perform various language-related tasks. One essential characteristic of LMs is their representation of words through contextual word embeddings, i.e., high dimensional vectors capturing linguistic properties (e.g., semantics and syntax) of their surrounding contexts. Supporting the interpretability of contextual word embeddings is crucial for assessing model strengths and limitations, identifying encoded biases, and making informed decisions about sufficient embedding layers for text-analysis tasks. While word embeddings have proven to be effective representations for various NLP methods, much remains to learn about how they encode and represent language properties. Existing approaches often examine embeddings in relation to high-level tasks such as question answering or sentiment classification, overlooking the detailed analysis of linguistic phenomena, e.g., how models represent word categories like function words. These fine-grained linguistic insights are essential for inspecting whether LMs "understand" language, as sometimes assumed in academic publications. Moreover, given that embeddings have already demonstrated their ability to capture diverse language properties, they can serve as effective features for further analysis tasks when appropriately applied. A deeper examination of their successful utilization is still missing.

In the first part of the thesis, we address the research gap related to linguistically motivated embedding explanation methods. Collaborating with computational linguists, we design three visual analytics techniques that facilitate the exploration of embedding properties. In particular, we investigate reasons for word contextualization, i.e., differences in the embeddings for the same word used in different contexts. We explore whether contextualization captures the word's semantic meaning (and polysemy) or context variations, as assumed by the related work. We design visual methods for gaining insights into linguistic properties encoded in embedding vectors and show how the information gets propagated through the models’ architectures. Lastly, we investigate how models capture word functionality, i.e., whether models correctly encode the meaning and constraints of different function word classes.

In the second part of the thesis, we examine methods on how to utilize embedding vectors for diverse application scenarios in order to assist researchers in selecting the best model for a user and task at hand. In particular, in this thesis, we present three interactive approaches for visually comparing model adaptations and their generated outputs by using embedding vectors as the main features for the analysis. First, we design visual methods that enable effective comparison of text outputs generated by causal LMs, including the biases associated with different prompt inputs. Second, we develop visual methods to compare masked LM adaptations, particularly their influence on semantic language concept representations. Finally, we explore methods that incorporate gamification to learn users' preferences captured through word embedding similarity for an optimal LM selection.

Through the presented visual analytics methods, we show both the LM strengths and limitations and motivate the development of more linguistically aware LMs and further methods for their effective analysis.

Fachgebiet (DDC)

004 Informatik

Schlagwörter

Large Language Models, Contextual Word Embeddings, Visual Analytics

Zitieren

ISO 690

SEVASTJANOVA, Rita, 2025. Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models [Dissertation]. Konstanz: Universität Konstanz

BibTex

@phdthesis{Sevastjanova2025-07-04Inter-73831,
  title={Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models},
  year={2025},
  author={Sevastjanova, Rita},
  address={Konstanz},
  school={Universität Konstanz}
}

RDF

<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/73831">
    <dcterms:rights rdf:resource="http://creativecommons.org/licenses/by-nc/4.0/"/>
    <dc:creator>Sevastjanova, Rita</dc:creator>
    <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/73831"/>
    <dcterms:hasPart rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/73831/4/Sevastjanova_2-zojoodl1iu4v2.pdf"/>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-07-07T05:03:30Z</dc:date>
    <dc:contributor>Sevastjanova, Rita</dc:contributor>
    <dcterms:issued>2025-07-04</dcterms:issued>
    <dspace:hasBitstream rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/73831/4/Sevastjanova_2-zojoodl1iu4v2.pdf"/>
    <dcterms:abstract>Language models (LMs) have revolutionized natural language processing (NLP) methods with their ability to generate text and perform various language-related tasks. One essential characteristic of LMs is their representation of words through contextual word embeddings, i.e., high dimensional vectors capturing linguistic properties (e.g., semantics and syntax) of their surrounding contexts. Supporting the interpretability of contextual word embeddings is crucial for assessing model strengths and limitations, identifying encoded biases, and making informed decisions about sufficient embedding layers for text-analysis tasks. While word embeddings have proven to be effective representations for various NLP methods, much remains to learn about how they encode and represent language properties. Existing approaches often examine embeddings in relation to high-level tasks such as question answering or sentiment classification, overlooking the detailed analysis of linguistic phenomena, e.g., how models represent word categories like function words. These fine-grained linguistic insights are essential for inspecting whether LMs "understand" language, as sometimes assumed in academic publications. Moreover, given that embeddings have already demonstrated their ability to capture diverse language properties, they can serve as effective features for further analysis tasks when appropriately applied. A deeper examination of their successful utilization is still missing.

In the first part of the thesis, we address the research gap related to linguistically motivated embedding explanation methods. Collaborating with computational linguists, we design three visual analytics techniques that facilitate the exploration of embedding properties. In particular, we investigate reasons for word contextualization, i.e., differences in the embeddings for the same word used in different contexts. We explore whether contextualization captures the word's semantic meaning (and polysemy) or context variations, as assumed by the related work. We design visual methods for gaining insights into linguistic properties encoded in embedding vectors and show how the information gets propagated through the models’ architectures. Lastly, we investigate how models capture word functionality, i.e., whether models correctly encode the meaning and constraints of different function word classes. 

In the second part of the thesis, we examine methods on how to utilize embedding vectors for diverse application scenarios in order to assist researchers in selecting the best model for a user and task at hand. In particular, in this thesis, we present three interactive approaches for visually comparing model adaptations and their generated outputs by using embedding vectors as the main features for the analysis. First, we design visual methods that enable effective comparison of text outputs generated by causal LMs, including the biases associated with different prompt inputs. Second, we develop visual methods to compare masked LM adaptations, particularly their influence on semantic language concept representations. Finally, we explore methods that incorporate gamification to learn users' preferences captured through word embedding similarity for an optimal LM selection. 

Through the presented visual analytics methods, we show both the LM strengths and limitations and motivate the development of more linguistically aware LMs and further methods for their effective analysis.</dcterms:abstract>
    <dc:rights>Attribution-NonCommercial 4.0 International</dc:rights>
    <dc:language>eng</dc:language>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-07-07T05:03:30Z</dcterms:available>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dcterms:title>Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models</dcterms:title>
  </rdf:Description>
</rdf:RDF>

Prüfungsdatum der Dissertation

July 15, 2024

Hochschulschriftenvermerk

Konstanz, Univ., Diss., 2024

Universitätsbibliographie

Nein

Publikation: Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models

Dateien

Datum

Autor:innen

Herausgeber:innen

Kontakt

ISSN der Zeitschrift

item.preview.dc.identifier.eissn

ISBN

Bibliografische Daten

Verlag

Schriftenreihe

Auflagebezeichnung

URI (zitierfähiger Link)

DOI (zitierfähiger Link)

item.preview.dc.identifier.arxiv

Internationale Patentnummer

Link zur Lizenz

Angaben zur Forschungsförderung

Projekt

Open Access-Veröffentlichung

Sammlungen

Core Facility der Universität Konstanz

Gesperrt bis

Titel in einer weiteren Sprache

Publikationstyp

Publikationsstatus

Erschienen in

Zusammenfassung

Zusammenfassung in einer weiteren Sprache

Fachgebiet (DDC)

Schlagwörter

Konferenz

Rezension

Forschungsvorhaben

Organisationseinheiten

Zeitschriftenheft

Zugehörige Datensätze in KOPS

Zitieren

Interner Vermerk

xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter

Kontakt

URL der Originalveröffentl.

Prüfdatum der URL

Prüfungsdatum der Dissertation

Hochschulschriftenvermerk

Finanzierungsart

Kommentar zur Publikation

Allianzlizenz

Corresponding Authors der Uni Konstanz vorhanden

Internationale Co-Autor:innen

Universitätsbibliographie

Begutachtet

Diese Publikation teilen

Publikation:
Interactive Visual Investigation of Word Embedding Contextualizations in Large Language Models