Publikation:

Quotebank : A Corpus of Quotations from a Decade of News

Lade...
Vorschaubild

Dateien

Zu diesem Dokument gibt es keine Dateien.

Datum

2021

Autor:innen

Vaucher, Timoté
Catasta, Michele
West, Robert

Herausgeber:innen

Kontakt

ISSN der Zeitschrift

Electronic ISSN

ISBN

Bibliografische Daten

Verlag

Schriftenreihe

Auflagebezeichnung

URI (zitierfähiger Link)
ArXiv-ID

Internationale Patentnummer

Angaben zur Forschungsförderung

Projekt

Open Access-Veröffentlichung
Core Facility der Universität Konstanz

Gesperrt bis

Titel in einer weiteren Sprache

Publikationstyp
Beitrag zu einem Konferenzband
Publikationsstatus
Published

Erschienen in

LEWIN-EYTAN, Liane, ed., David CARMEL, ed., Elad YOM-TOV, ed. and others. WSDM '21 : Proceedings of the 14th ACM International Conference on Web Search and Data Mining. New York, NY: ACM, 2021, pp. 328-336. ISBN 978-1-4503-8297-7. Available under: doi: 10.1145/3437963.3441760

Zusammenfassung

We present Quotebank, an open corpus of 178 million quotations attributed to the speakers who uttered them, extracted from 162 million English news articles published between 2008 and 2020. In order to produce this Web-scale corpus, while at the same time benefiting from the performance of modern neural models, we introduce Quobert, a minimally supervised framework for extracting and attributing quotations from massive corpora. Quobert avoids the necessity of manually labeled input and instead exploits the redundancy of the corpus by bootstrapping from a single seed pattern to extract training data for fine-tuning a BERT-based model. Quobert is language- and corpus agnostic and correctly attributes 86.9% of quotations in our experiments. Quotebank and Quobert are publicly available at https://doi.org/10.5281/zenodo.4277311.

Zusammenfassung in einer weiteren Sprache

Fachgebiet (DDC)
004 Informatik

Schlagwörter

Computing methodologies; Machine learning; Learning paradigms; Supervised learning; Information systems; World Wide Web; Web mining

Konferenz

WSDM '21 : The Fourteenth ACM International Conference on Web Search and Data Mining (Virtual Event), 8. März 2021 - 12. März 2021
Rezension
undefined / . - undefined, undefined

Forschungsvorhaben

Organisationseinheiten

Zeitschriftenheft

Zugehörige Datensätze in KOPS

Zitieren

ISO 690VAUCHER, Timoté, Andreas SPITZ, Michele CATASTA, Robert WEST, 2021. Quotebank : A Corpus of Quotations from a Decade of News. WSDM '21 : The Fourteenth ACM International Conference on Web Search and Data Mining (Virtual Event), 8. März 2021 - 12. März 2021. In: LEWIN-EYTAN, Liane, ed., David CARMEL, ed., Elad YOM-TOV, ed. and others. WSDM '21 : Proceedings of the 14th ACM International Conference on Web Search and Data Mining. New York, NY: ACM, 2021, pp. 328-336. ISBN 978-1-4503-8297-7. Available under: doi: 10.1145/3437963.3441760
BibTex
@inproceedings{Vaucher2021Quote-53926,
  year={2021},
  doi={10.1145/3437963.3441760},
  title={Quotebank : A Corpus of Quotations from a Decade of News},
  isbn={978-1-4503-8297-7},
  publisher={ACM},
  address={New York, NY},
  booktitle={WSDM '21 : Proceedings of the 14th ACM International Conference on Web Search and Data Mining},
  pages={328--336},
  editor={Lewin-Eytan, Liane and Carmel, David and Yom-Tov, Elad},
  author={Vaucher, Timoté and Spitz, Andreas and Catasta, Michele and West, Robert}
}
RDF
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/53926">
    <dcterms:title>Quotebank : A Corpus of Quotations from a Decade of News</dcterms:title>
    <dc:creator>Catasta, Michele</dc:creator>
    <dcterms:issued>2021</dcterms:issued>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2021-06-09T11:00:10Z</dc:date>
    <dc:contributor>West, Robert</dc:contributor>
    <dc:creator>West, Robert</dc:creator>
    <dcterms:abstract xml:lang="eng">We present Quotebank, an open corpus of 178 million quotations attributed to the speakers who uttered them, extracted from 162 million English news articles published between 2008 and 2020. In order to produce this Web-scale corpus, while at the same time benefiting from the performance of modern neural models, we introduce Quobert, a minimally supervised framework for extracting and attributing quotations from massive corpora. Quobert avoids the necessity of manually labeled input and instead exploits the redundancy of the corpus by bootstrapping from a single seed pattern to extract training data for fine-tuning a BERT-based model. Quobert is language- and corpus agnostic and correctly attributes 86.9% of quotations in our experiments. Quotebank and Quobert are publicly available at https://doi.org/10.5281/zenodo.4277311.</dcterms:abstract>
    <dc:creator>Spitz, Andreas</dc:creator>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2021-06-09T11:00:10Z</dcterms:available>
    <dc:rights>terms-of-use</dc:rights>
    <dc:contributor>Vaucher, Timoté</dc:contributor>
    <dc:language>eng</dc:language>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dc:contributor>Catasta, Michele</dc:contributor>
    <dc:creator>Vaucher, Timoté</dc:creator>
    <dcterms:rights rdf:resource="https://rightsstatements.org/page/InC/1.0/"/>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dc:contributor>Spitz, Andreas</dc:contributor>
    <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/53926"/>
  </rdf:Description>
</rdf:RDF>

Interner Vermerk

xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter

Kontakt
URL der Originalveröffentl.

Prüfdatum der URL

Prüfungsdatum der Dissertation

Finanzierungsart

Kommentar zur Publikation

Allianzlizenz
Corresponding Authors der Uni Konstanz vorhanden
Internationale Co-Autor:innen
Universitätsbibliographie
Ja
Begutachtet
Diese Publikation teilen