AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images

Tater, Tarun; Frassinelli, Diego; Schulte im Walde, Sabine

doi:10.18653/v1/2025.emnlp-main.417

Publikation:
AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images

Dateien

Zu diesem Dokument gibt es keine Dateien.

Datum

2025

Autor:innen

Tater, Tarun

Frassinelli, Diego

Schulte im Walde, Sabine

DOI (zitierfähiger Link)

https://doi.org/10.18653/v1/2025.emnlp-main.417

Angaben zur Forschungsförderung

Deutsche Forschungsgemeinschaft (DFG): SCHU 2580/4-1

Sammlungen

Linguistik: Publikationen

Publikationstyp

Beitrag zu einem Konferenzband

Publikationsstatus

Published

Erschienen in

CHRISTODOULOPOULOS, Christos, Hrsg., Tanmoy CHAKRAABORTY, Hrsg., Carolyn ROSE, Hrsg., Violet PENG, Hrsg.. The 2025 Conference on Empirical Methods in Natural Language Processing - proceedings of the conference, EMNLP 2025. Kerrville, TX: Association for Computational Linguistics, 2025, S. 8271-8292. ISBN 979-8-89176-332-6. Verfügbar unter: doi: 10.18653/v1/2025.emnlp-main.417

Zusammenfassung

Abstract concepts like mercy and peace often lack clear visual grounding, and thus challenge humans and models to provide suitable image representations. To address this challenge, we introduce AbsVis – a dataset of 675 images annotated with 14,175 concept–explanation attributions from humans and two Vision-Language Models (VLMs: Qwen and LLaVA), where each concept is accompanied by a textual explanation. We compare human and VLM attributions in terms of diversity, abstractness, and alignment, and find that humans attribute more varied concepts. AbsVis also includes 2,680 human preference judgments evaluating the quality of a subset of these annotations, showing that overlapping concepts (attributed by both humans and VLMs) are most preferred. Explanations clarify and strengthen the perceived attributions, both from humans and VLMs. Explanations clarify and strengthen the perceived attributions, both from human and VLMs. Finally, we show that VLMs can approximate human preferences and use them to fine-tune VLMs via Direct Preference Optimization (DPO), yielding improved alignments with preferred concept–explanation pairs.

Fachgebiet (DDC)

400 Sprachwissenschaft, Linguistik

Konferenz

2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4. Nov. 2025 - 9. Nov. 2025, Suzhou, China

Zitieren

ISO 690

TATER, Tarun, Diego FRASSINELLI, Sabine SCHULTE IM WALDE, 2025. AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images. 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Suzhou, China, 4. Nov. 2025 - 9. Nov. 2025. In: CHRISTODOULOPOULOS, Christos, Hrsg., Tanmoy CHAKRAABORTY, Hrsg., Carolyn ROSE, Hrsg., Violet PENG, Hrsg.. The 2025 Conference on Empirical Methods in Natural Language Processing - proceedings of the conference, EMNLP 2025. Kerrville, TX: Association for Computational Linguistics, 2025, S. 8271-8292. ISBN 979-8-89176-332-6. Verfügbar unter: doi: 10.18653/v1/2025.emnlp-main.417

BibTex

@inproceedings{Tater2025AbsVi-76377,
  title={AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images},
  year={2025},
  doi={10.18653/v1/2025.emnlp-main.417},
  isbn={979-8-89176-332-6},
  address={Kerrville, TX},
  publisher={Association for Computational Linguistics},
  booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing - proceedings of the conference, EMNLP 2025},
  pages={8271--8292},
  editor={Christodoulopoulos, Christos and Chakraaborty, Tanmoy and Rose, Carolyn and Peng, Violet},
  author={Tater, Tarun and Frassinelli, Diego and Schulte im Walde, Sabine}
}

RDF

<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/76377">
    <dc:creator>Tater, Tarun</dc:creator>
    <dcterms:abstract>Abstract concepts like mercy and peace often lack clear visual grounding, and thus challenge humans and models to provide suitable image representations. To address this challenge, we introduce AbsVis – a dataset of 675 images annotated with 14,175 concept–explanation attributions from humans and two Vision-Language Models (VLMs: Qwen and LLaVA), where each concept is accompanied by a textual explanation. We compare human and VLM attributions in terms of diversity, abstractness, and alignment, and find that humans attribute more varied concepts. AbsVis also includes 2,680 human preference judgments evaluating the quality of a subset of these annotations, showing that overlapping concepts (attributed by both humans and VLMs) are most preferred. Explanations clarify and strengthen the perceived attributions, both from humans and VLMs. Explanations clarify and strengthen the perceived attributions, both from human and VLMs. Finally, we show that VLMs can approximate human preferences and use them to fine-tune VLMs via Direct Preference Optimization (DPO), yielding improved alignments with preferred concept–explanation pairs.</dcterms:abstract>
    <dc:creator>Frassinelli, Diego</dc:creator>
    <dcterms:title>AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images</dcterms:title>
    <dc:language>eng</dc:language>
    <dc:creator>Schulte im Walde, Sabine</dc:creator>
    <dc:contributor>Tater, Tarun</dc:contributor>
    <dcterms:issued>2025</dcterms:issued>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/45"/>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/45"/>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dc:contributor>Frassinelli, Diego</dc:contributor>
    <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/76377"/>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2026-02-27T06:50:28Z</dc:date>
    <dc:contributor>Schulte im Walde, Sabine</dc:contributor>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2026-02-27T06:50:28Z</dcterms:available>
  </rdf:Description>
</rdf:RDF>

Universitätsbibliographie

Nein

Publikation: AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images

Dateien

Datum

Autor:innen

Herausgeber:innen

Kontakt

ISSN der Zeitschrift

item.preview.dc.identifier.eissn

ISBN

Bibliografische Daten

Verlag

Schriftenreihe

Auflagebezeichnung

URI (zitierfähiger Link)

DOI (zitierfähiger Link)

item.preview.dc.identifier.arxiv

Internationale Patentnummer

Angaben zur Forschungsförderung

Projekt

Open Access-Veröffentlichung

Sammlungen

Core Facility der Universität Konstanz

Gesperrt bis

Titel in einer weiteren Sprache

Publikationstyp

Publikationsstatus

Erschienen in

Zusammenfassung

Zusammenfassung in einer weiteren Sprache

Fachgebiet (DDC)

Schlagwörter

Konferenz

Rezension

Forschungsvorhaben

Organisationseinheiten

Zeitschriftenheft

Zugehörige Datensätze in KOPS

Zitieren

Interner Vermerk

xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter

Kontakt

URL der Originalveröffentl.

Prüfdatum der URL

Prüfungsdatum der Dissertation

Finanzierungsart

Kommentar zur Publikation

Allianzlizenz

Corresponding Authors der Uni Konstanz vorhanden

Internationale Co-Autor:innen

Universitätsbibliographie

Begutachtet

Diese Publikation teilen

Publikation:
AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images