AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images

Tater, Tarun; Frassinelli, Diego; Schulte im Walde, Sabine

doi:10.18653/v1/2025.emnlp-main.417

AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images

dc.contributor.author	Tater, Tarun
dc.contributor.author	Frassinelli, Diego
dc.contributor.author	Schulte im Walde, Sabine
dc.date.accessioned	2026-02-27T06:50:28Z
dc.date.available	2026-02-27T06:50:28Z
dc.date.issued	2025
dc.description.abstract	Abstract concepts like mercy and peace often lack clear visual grounding, and thus challenge humans and models to provide suitable image representations. To address this challenge, we introduce AbsVis – a dataset of 675 images annotated with 14,175 concept–explanation attributions from humans and two Vision-Language Models (VLMs: Qwen and LLaVA), where each concept is accompanied by a textual explanation. We compare human and VLM attributions in terms of diversity, abstractness, and alignment, and find that humans attribute more varied concepts. AbsVis also includes 2,680 human preference judgments evaluating the quality of a subset of these annotations, showing that overlapping concepts (attributed by both humans and VLMs) are most preferred. Explanations clarify and strengthen the perceived attributions, both from humans and VLMs. Explanations clarify and strengthen the perceived attributions, both from human and VLMs. Finally, we show that VLMs can approximate human preferences and use them to fine-tune VLMs via Direct Preference Optimization (DPO), yielding improved alignments with preferred concept–explanation pairs.
dc.description.version	published	deu
dc.identifier.doi	10.18653/v1/2025.emnlp-main.417
dc.identifier.uri	https://kops.uni-konstanz.de/handle/123456789/76377
dc.language.iso	eng
dc.subject.ddc	400
dc.title	AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images	eng
dc.type	INPROCEEDINGS
dspace.entity.type	Publication
kops.citation.bibtex	@inproceedings{Tater2025AbsVi-76377, title={AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images}, year={2025}, doi={10.18653/v1/2025.emnlp-main.417}, isbn={979-8-89176-332-6}, address={Kerrville, TX}, publisher={Association for Computational Linguistics}, booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing - proceedings of the conference, EMNLP 2025}, pages={8271--8292}, editor={Christodoulopoulos, Christos and Chakraaborty, Tanmoy and Rose, Carolyn and Peng, Violet}, author={Tater, Tarun and Frassinelli, Diego and Schulte im Walde, Sabine} }
kops.citation.iso690	TATER, Tarun, Diego FRASSINELLI, Sabine SCHULTE IM WALDE, 2025. AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images. 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Suzhou, China, 4. Nov. 2025 - 9. Nov. 2025. In: CHRISTODOULOPOULOS, Christos, Hrsg., Tanmoy CHAKRAABORTY, Hrsg., Carolyn ROSE, Hrsg., Violet PENG, Hrsg.. The 2025 Conference on Empirical Methods in Natural Language Processing - proceedings of the conference, EMNLP 2025. Kerrville, TX: Association for Computational Linguistics, 2025, S. 8271-8292. ISBN 979-8-89176-332-6. Verfügbar unter: doi: 10.18653/v1/2025.emnlp-main.417	deu
kops.citation.iso690	TATER, Tarun, Diego FRASSINELLI, Sabine SCHULTE IM WALDE, 2025. AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images. 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Suzhou, China, Nov 4, 2025 - Nov 9, 2025. In: CHRISTODOULOPOULOS, Christos, ed., Tanmoy CHAKRAABORTY, ed., Carolyn ROSE, ed., Violet PENG, ed.. The 2025 Conference on Empirical Methods in Natural Language Processing - proceedings of the conference, EMNLP 2025. Kerrville, TX: Association for Computational Linguistics, 2025, pp. 8271-8292. ISBN 979-8-89176-332-6. Available under: doi: 10.18653/v1/2025.emnlp-main.417	eng
kops.citation.rdf	<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:void="http://rdfs.org/ns/void#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/76377"> <dc:creator>Tater, Tarun</dc:creator> <dcterms:abstract>Abstract concepts like mercy and peace often lack clear visual grounding, and thus challenge humans and models to provide suitable image representations. To address this challenge, we introduce AbsVis – a dataset of 675 images annotated with 14,175 concept–explanation attributions from humans and two Vision-Language Models (VLMs: Qwen and LLaVA), where each concept is accompanied by a textual explanation. We compare human and VLM attributions in terms of diversity, abstractness, and alignment, and find that humans attribute more varied concepts. AbsVis also includes 2,680 human preference judgments evaluating the quality of a subset of these annotations, showing that overlapping concepts (attributed by both humans and VLMs) are most preferred. Explanations clarify and strengthen the perceived attributions, both from humans and VLMs. Explanations clarify and strengthen the perceived attributions, both from human and VLMs. Finally, we show that VLMs can approximate human preferences and use them to fine-tune VLMs via Direct Preference Optimization (DPO), yielding improved alignments with preferred concept–explanation pairs.</dcterms:abstract> <dc:creator>Frassinelli, Diego</dc:creator> <dcterms:title>AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images</dcterms:title> <dc:language>eng</dc:language> <dc:creator>Schulte im Walde, Sabine</dc:creator> <dc:contributor>Tater, Tarun</dc:contributor> <dcterms:issued>2025</dcterms:issued> <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/45"/> <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/45"/> <foaf:homepage rdf:resource="http://localhost:8080/"/> <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/> <dc:contributor>Frassinelli, Diego</dc:contributor> <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/76377"/> <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2026-02-27T06:50:28Z</dc:date> <dc:contributor>Schulte im Walde, Sabine</dc:contributor> <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2026-02-27T06:50:28Z</dcterms:available> </rdf:Description> </rdf:RDF>
kops.conferencefield	2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4. Nov. 2025 - 9. Nov. 2025, Suzhou, China	deu
kops.date.conferenceEnd	2025-11-09
kops.date.conferenceStart	2025-11-04
kops.description.funding	{"first":"dfg","second":"SCHU 2580/4-1"}
kops.flag.knbibliography	false
kops.location.conference	Suzhou, China
kops.sourcefield	CHRISTODOULOPOULOS, Christos, Hrsg., Tanmoy CHAKRAABORTY, Hrsg., Carolyn ROSE, Hrsg., Violet PENG, Hrsg.. <i>The 2025 Conference on Empirical Methods in Natural Language Processing - proceedings of the conference, EMNLP 2025</i>. Kerrville, TX: Association for Computational Linguistics, 2025, S. 8271-8292. ISBN 979-8-89176-332-6. Verfügbar unter: doi: 10.18653/v1/2025.emnlp-main.417	deu
kops.sourcefield.plain	CHRISTODOULOPOULOS, Christos, Hrsg., Tanmoy CHAKRAABORTY, Hrsg., Carolyn ROSE, Hrsg., Violet PENG, Hrsg.. The 2025 Conference on Empirical Methods in Natural Language Processing - proceedings of the conference, EMNLP 2025. Kerrville, TX: Association for Computational Linguistics, 2025, S. 8271-8292. ISBN 979-8-89176-332-6. Verfügbar unter: doi: 10.18653/v1/2025.emnlp-main.417	deu
kops.sourcefield.plain	CHRISTODOULOPOULOS, Christos, ed., Tanmoy CHAKRAABORTY, ed., Carolyn ROSE, ed., Violet PENG, ed.. The 2025 Conference on Empirical Methods in Natural Language Processing - proceedings of the conference, EMNLP 2025. Kerrville, TX: Association for Computational Linguistics, 2025, pp. 8271-8292. ISBN 979-8-89176-332-6. Available under: doi: 10.18653/v1/2025.emnlp-main.417	eng
kops.title.conference	2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)
relation.isAuthorOfPublication	bf0689a7-23f2-460a-8abb-42ea30bb2d29
relation.isAuthorOfPublication.latestForDiscovery	bf0689a7-23f2-460a-8abb-42ea30bb2d29
source.bibliographicInfo.fromPage	8271
source.bibliographicInfo.toPage	8292
source.contributor.editor	Christodoulopoulos, Christos
source.contributor.editor	Chakraaborty, Tanmoy
source.contributor.editor	Rose, Carolyn
source.contributor.editor	Peng, Violet
source.identifier.isbn	979-8-89176-332-6
source.publisher	Association for Computational Linguistics
source.publisher.location	Kerrville, TX
source.title	The 2025 Conference on Empirical Methods in Natural Language Processing - proceedings of the conference, EMNLP 2025

Sammlungen

Linguistik: Publikationen

AbsVis – Benchmarking How Humans and Vision-Language Models "See" Abstract Concepts in Images

Dateien

Sammlungen