Publikation:

Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language

Lade...
Vorschaubild

Dateien

Zu diesem Dokument gibt es keine Dateien.

Datum

2020

Herausgeber:innen

Kontakt

ISSN der Zeitschrift

Electronic ISSN

ISBN

Bibliografische Daten

Verlag

Schriftenreihe

Auflagebezeichnung

URI (zitierfähiger Link)

Internationale Patentnummer

Angaben zur Forschungsförderung

Projekt

Open Access-Veröffentlichung
Core Facility der Universität Konstanz

Gesperrt bis

Titel in einer weiteren Sprache

Publikationstyp
Beitrag zu einem Konferenzband
Publikationsstatus
Published

Erschienen in

HUANG, Ruhua, ed. and others. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20). New York: ACM, 2020, pp. 137-146. ISBN 978-1-4503-7585-6. Available under: doi: 10.1145/3383583.3398529

Zusammenfassung

In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8% and cluster purities up to 69.4% (number of clusters equals number of classes), and 99.9% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.

Zusammenfassung in einer weiteren Sprache

Fachgebiet (DDC)
004 Informatik

Schlagwörter

Konferenz

JCDL '20, 1. Aug. 2020 - 5. Aug. 2020, China (Virtual Event)
Rezension
undefined / . - undefined, undefined

Forschungsvorhaben

Organisationseinheiten

Zeitschriftenheft

Zugehörige Datensätze in KOPS

Zitieren

ISO 690SCHARPF, Philipp, Moritz SCHUBOTZ, Abdou YOUSSEF, Felix HAMBORG, Norman MEUSCHKE, Bela GIPP, 2020. Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language. JCDL '20. China (Virtual Event), 1. Aug. 2020 - 5. Aug. 2020. In: HUANG, Ruhua, ed. and others. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20). New York: ACM, 2020, pp. 137-146. ISBN 978-1-4503-7585-6. Available under: doi: 10.1145/3383583.3398529
BibTex
@inproceedings{Scharpf2020-05-22T06:16:32ZClass-51925,
  year={2020},
  doi={10.1145/3383583.3398529},
  title={Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language},
  isbn={978-1-4503-7585-6},
  publisher={ACM},
  address={New York},
  booktitle={Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20)},
  pages={137--146},
  editor={Huang, Ruhua},
  author={Scharpf, Philipp and Schubotz, Moritz and Youssef, Abdou and Hamborg, Felix and Meuschke, Norman and Gipp, Bela}
}
RDF
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/51925">
    <dc:contributor>Gipp, Bela</dc:contributor>
    <dcterms:abstract xml:lang="eng">In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to 82.8% and cluster purities up to 69.4% (number of clusters equals number of classes), and 99.9% (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.</dcterms:abstract>
    <dc:creator>Schubotz, Moritz</dc:creator>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dc:rights>terms-of-use</dc:rights>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dc:contributor>Meuschke, Norman</dc:contributor>
    <dc:language>eng</dc:language>
    <dc:contributor>Youssef, Abdou</dc:contributor>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2020-11-25T14:29:09Z</dcterms:available>
    <dc:creator>Gipp, Bela</dc:creator>
    <dc:creator>Scharpf, Philipp</dc:creator>
    <dc:contributor>Scharpf, Philipp</dc:contributor>
    <dcterms:issued>2020-05-22T06:16:32Z</dcterms:issued>
    <dc:creator>Hamborg, Felix</dc:creator>
    <dc:creator>Youssef, Abdou</dc:creator>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dc:contributor>Hamborg, Felix</dc:contributor>
    <dc:creator>Meuschke, Norman</dc:creator>
    <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/51925"/>
    <dcterms:title>Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language</dcterms:title>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2020-11-25T14:29:09Z</dc:date>
    <dcterms:rights rdf:resource="https://rightsstatements.org/page/InC/1.0/"/>
    <dc:contributor>Schubotz, Moritz</dc:contributor>
  </rdf:Description>
</rdf:RDF>

Interner Vermerk

xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter

Kontakt
URL der Originalveröffentl.

Prüfdatum der URL

Prüfungsdatum der Dissertation

Finanzierungsart

Kommentar zur Publikation

Allianzlizenz
Corresponding Authors der Uni Konstanz vorhanden
Internationale Co-Autor:innen
Universitätsbibliographie
Nein
Begutachtet
Diese Publikation teilen