Datensatz: Auxiliary Datasets for Speaker Disambiguation in Quotebank
Lade...
Datum der Erstveröffentlichung
2023
Autor:innen
Andere Beitragende
Repositorium der Erstveröffentlichung
Zenodo
Version des Datensatzes
v1
DOI (Link zu den Daten)
Link zur Lizenz
Angaben zur Forschungsförderung
Projekt
Core Facility der Universität Konstanz
Titel in einer weiteren Sprache
Publikationsstatus
Published
Zusammenfassung
This data repository contains the auxiliary datasets necessary for enriching and preprocessing Quotebank, a large dataset of unique, speaker-attributed quotations. The scripts utilizing the data can be found in the quotebank-toolkit GitHub repository. The datasets are stored in data.zip and are described below: quotebank_disambiguation_mapping_quote.parquet
Provides the quoteID
-> speakerQID
mapping for each quotation, disambiguating ambiguous speaker names and linking them to their respective Wikidata items. The schema of the dataset is as follows:
|-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") |-- speaker: Wikidata ID corresponding to the speaker of the quotation
The mapping is created using heuristics described in the following paper: Marko Čuljak, Andreas Spitz, Robert West, and Akhil Arora"Strong Heuristics for Named Entity Linking"
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
10.18653/v1/2022.naacl-srw.30 self_quotations_filtered.parquet
Contains the identifiers of the quotations identified as not being self-attributed. The schema of the dataset is as follows:
|-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}")
speaker_attributes.parquetContains attributes of all the speakers appearing in Quotebank extracted from Wikidata. The schema of the dataset is as follows.
|-- id: Wikidata item QID of the speaker, primary key |-- aliases: list of speaker's aliases |-- date_of_birth: list of possible speaker's dates of birth |-- nationality: list of speaker's nationalities |-- gender: list of speaker's previous or current genders |-- lastrevid: ID of the last revision of the speaker's item |-- ethnic_group: list of ethnic groups the speaker belongs to |-- US_congress_bio_ID: identifier for the speaker in the Biographical Directory of the United States Congress |-- occupation: list of speaker's occupations |-- party: list of parties the speaker is/was affiliated to |-- academic_degree: list of academic degrees obtained by the speaker |-- label: Wikidata label of the speaker |-- candidacy: list of the speaker's candidacies in political elections |-- type: type of the Wikidata entry (value is item
for all the speakers) |-- religion: previous/current religious affiliations of the speaker
Using the id
field corresponding to the Wikidata QID of a speaker, this dataset can be easily joined with disambiguated Quotebank obtained by running the cleanup_disambiguate.py
script available in the aforementioned quotebank-toolkit repository.
Zusammenfassung in einer weiteren Sprache
Fachgebiet (DDC)
004 Informatik
Schlagwörter
Zitieren
ISO 690
ČULJAK, Marko, Andreas SPITZ, Robert WEST, AKHIL ARORA, 2023. Auxiliary Datasets for Speaker Disambiguation in QuotebankBibTex
RDF
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:void="http://rdfs.org/ns/void#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/73186"> <dc:creator>Spitz, Andreas</dc:creator> <dc:contributor>Spitz, Andreas</dc:contributor> <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/73186"/> <dcterms:rights rdf:resource="https://creativecommons.org/licenses/by/4.0/legalcode"/> <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/71925"/> <dcterms:abstract>This data repository contains the auxiliary datasets necessary for enriching and preprocessing Quotebank, a large dataset of unique, speaker-attributed quotations. The scripts utilizing the data can be found in the quotebank-toolkit GitHub repository. The datasets are stored in <strong>data.zip </strong>and are described below: <strong>quotebank_disambiguation_mapping_quote.parquet</strong><br> Provides the `quoteID`-&gt; `speakerQID` mapping for each quotation, disambiguating ambiguous speaker names and linking them to their respective Wikidata items. The schema of the dataset is as follows: <pre><code> |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") |-- speaker: Wikidata ID corresponding to the speaker of the quotation</code></pre> The mapping is created using heuristics described in the following paper: Marko Čuljak, Andreas Spitz, Robert West, and Akhil Arora<br> "Strong Heuristics for Named Entity Linking"<br> <em>Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop</em><br> 10.18653/v1/2022.naacl-srw.30 <strong>self_quotations_filtered.parquet</strong><br> Contains the identifiers of the quotations identified as not being self-attributed. The schema of the dataset is as follows: <pre><code> |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") </code></pre> <strong>speaker_attributes.parquet</strong><br> Contains attributes of all the speakers appearing in Quotebank extracted from Wikidata. The schema of the dataset is as follows. <pre><code> |-- id: Wikidata item QID of the speaker, primary key |-- aliases: list of speaker's aliases |-- date_of_birth: list of possible speaker's dates of birth |-- nationality: list of speaker's nationalities |-- gender: list of speaker's previous or current genders |-- lastrevid: ID of the last revision of the speaker's item |-- ethnic_group: list of ethnic groups the speaker belongs to |-- US_congress_bio_ID: identifier for the speaker in the Biographical Directory of the United States Congress |-- occupation: list of speaker's occupations |-- party: list of parties the speaker is/was affiliated to |-- academic_degree: list of academic degrees obtained by the speaker |-- label: Wikidata label of the speaker |-- candidacy: list of the speaker's candidacies in political elections |-- type: type of the Wikidata entry (value is `item` for all the speakers) |-- religion: previous/current religious affiliations of the speaker </code></pre> Using the `id` field corresponding to the Wikidata QID of a speaker, this dataset can be easily joined with disambiguated Quotebank obtained by running the `cleanup_disambiguate.py` script available in the aforementioned quotebank-toolkit repository.</dcterms:abstract> <dcterms:issued>2023</dcterms:issued> <dc:creator>Čuljak, Marko</dc:creator> <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2023-06-18T16:41:13Z</dcterms:created> <dc:contributor>West, Robert</dc:contributor> <dc:rights>Creative Commons Attribution 4.0 International</dc:rights> <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/71925"/> <foaf:homepage rdf:resource="http://localhost:8080/"/> <dc:creator>West, Robert</dc:creator> <dc:contributor>Čuljak, Marko</dc:contributor> <dc:creator>Akhil Arora</dc:creator> <dc:language>eng</dc:language> <dcterms:title>Auxiliary Datasets for Speaker Disambiguation in Quotebank</dcterms:title> <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-30T09:31:06Z</dc:date> <dc:contributor>Akhil Arora</dc:contributor> <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/> <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-30T09:31:06Z</dcterms:available> </rdf:Description> </rdf:RDF>
Kommentar zur Publikation
Universitätsbibliographie
Ja