Datensatz:

Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models & Code)

Lade...
Vorschaubild

Datum der Erstveröffentlichung

August 1, 2020

Autor:innen

Ostendorff, Malte
Ruas, Terry
Rehm, Georg

Andere Beitragende

Repositorium der Erstveröffentlichung

Zenodo

Version des Datensatzes

Link zur Lizenz

Angaben zur Forschungsförderung

Institutionen der Bundesrepublik Deutschland: 03WKDA1A

Projekt

Core Facility der Universität Konstanz
Bewerten Sie die FAIRness der Forschungsdaten

Gesperrt bis

Titel in einer weiteren Sprache

Publikationsstatus
Published

Zusammenfassung

Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another. Additional information can be found on GitHub. The following data is supplemental to the experiments described in our research paper. The data consists of: Datasets (articles, class labels, cross-validation splits) Pretrained models (Transformers, GloVe, Doc2vec) Model output (prediction) for the best performing models

Zusammenfassung in einer weiteren Sprache

Fachgebiet (DDC)
004 Informatik

Schlagwörter

Wikidata, Wikipedia, Document Similarity, Semantic Relations, Recommender System

Zugehörige Publikationen in KOPS

Link zu zugehöriger Publikation

Zitieren

ISO 690OSTENDORFF, Malte, Terry RUAS, Moritz SCHUBOTZ, Georg REHM, Bela GIPP, 2020. Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models & Code)
BibTex
RDF
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/74421">
    <dc:creator>Rehm, Georg</dc:creator>
    <dc:contributor>Ostendorff, Malte</dc:contributor>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-08-27T13:02:59Z</dcterms:available>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2020-03-18T08:16:35Z</dcterms:created>
    <dcterms:abstract>Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,&lt;br&gt; which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another. Additional information can be found on GitHub. The following data is supplemental to the experiments described in our research paper. The data consists of: Datasets (articles, class labels, cross-validation splits) Pretrained models (Transformers, GloVe, Doc2vec) Model output (prediction) for the best performing models</dcterms:abstract>
    <dc:creator>Schubotz, Moritz</dc:creator>
    <dcterms:issued>2020-08-01</dcterms:issued>
    <dc:language>eng</dc:language>
    <dc:creator>Ruas, Terry</dc:creator>
    <dc:contributor>Rehm, Georg</dc:contributor>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-08-27T13:02:59Z</dc:date>
    <dc:creator>Ostendorff, Malte</dc:creator>
    <dc:contributor>Gipp, Bela</dc:contributor>
    <dcterms:relation>https://github.com/malteos/semantic-document-relations/</dcterms:relation>
    <dcterms:hasPart>https://github.com/malteos/semantic-document-relations/</dcterms:hasPart>
    <dc:contributor>Ruas, Terry</dc:contributor>
    <dc:creator>Gipp, Bela</dc:creator>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/71925"/>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/71925"/>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dcterms:rights rdf:resource="https://creativecommons.org/licenses/by/4.0/legalcode"/>
    <dcterms:title>Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset, Models &amp; Code)</dcterms:title>
    <dcterms:isReferencedBy>10.48550/arXiv.2003.09881</dcterms:isReferencedBy>
    <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/74421"/>
    <dc:rights>Creative Commons Attribution 4.0 International</dc:rights>
    <dc:contributor>Schubotz, Moritz</dc:contributor>
  </rdf:Description>
</rdf:RDF>
URL (Link zu den Daten)

Prüfdatum der URL

Kommentar zur Publikation

Universitätsbibliographie
Nein
Diese Publikation teilen