Datensatz:

Auxiliary Datasets for Speaker Disambiguation in Quotebank

Lade...
Vorschaubild

Datum der Erstveröffentlichung

2023

Autor:innen

Čuljak, Marko
West, Robert
Akhil Arora

Andere Beitragende

Repositorium der Erstveröffentlichung

Zenodo

Version des Datensatzes

v1
Link zur Lizenz

Angaben zur Forschungsförderung

Projekt

Core Facility der Universität Konstanz
Bewerten Sie die FAIRness der Forschungsdaten

Gesperrt bis

Titel in einer weiteren Sprache

Publikationsstatus
Published

Zusammenfassung

This data repository contains the auxiliary datasets necessary for enriching and preprocessing Quotebank, a large dataset of unique, speaker-attributed quotations. The scripts utilizing the data can be found in the quotebank-toolkit GitHub repository. The datasets are stored in data.zip and are described below: quotebank_disambiguation_mapping_quote.parquet
Provides the quoteID-> speakerQID mapping for each quotation, disambiguating ambiguous speaker names and linking them to their respective Wikidata items. The schema of the dataset is as follows:

 |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") |-- speaker: Wikidata ID corresponding to the speaker of the quotation
The mapping is created using heuristics described in the following paper: Marko Čuljak, Andreas Spitz, Robert West, and Akhil Arora
"Strong Heuristics for Named Entity Linking"
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
10.18653/v1/2022.naacl-srw.30 self_quotations_filtered.parquet
Contains the identifiers of the quotations identified as not being self-attributed. The schema of the dataset is as follows:
 |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") 
speaker_attributes.parquet
Contains attributes of all the speakers appearing in Quotebank extracted from Wikidata. The schema of the dataset is as follows.
 |-- id: Wikidata item QID of the speaker, primary key |-- aliases: list of speaker's aliases |-- date_of_birth: list of possible speaker's dates of birth |-- nationality: list of speaker's nationalities |-- gender: list of speaker's previous or current genders |-- lastrevid: ID of the last revision of the speaker's item |-- ethnic_group: list of ethnic groups the speaker belongs to |-- US_congress_bio_ID: identifier for the speaker in the Biographical Directory of the United States Congress |-- occupation: list of speaker's occupations |-- party: list of parties the speaker is/was affiliated to |-- academic_degree: list of academic degrees obtained by the speaker |-- label: Wikidata label of the speaker |-- candidacy: list of the speaker's candidacies in political elections |-- type: type of the Wikidata entry (value is item for all the speakers) |-- religion: previous/current religious affiliations of the speaker 
Using the id field corresponding to the Wikidata QID of a speaker, this dataset can be easily joined with disambiguated Quotebank obtained by running the cleanup_disambiguate.py script available in the aforementioned quotebank-toolkit repository.

Zusammenfassung in einer weiteren Sprache

Fachgebiet (DDC)
004 Informatik

Schlagwörter

Zugehörige Publikationen in KOPS

Link zu zugehöriger Publikation
Link zu zugehörigem Datensatz

Zitieren

ISO 690ČULJAK, Marko, Andreas SPITZ, Robert WEST, AKHIL ARORA, 2023. Auxiliary Datasets for Speaker Disambiguation in Quotebank
BibTex
RDF
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/73186">
    <dc:creator>Spitz, Andreas</dc:creator>
    <dc:contributor>Spitz, Andreas</dc:contributor>
    <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/73186"/>
    <dcterms:rights rdf:resource="https://creativecommons.org/licenses/by/4.0/legalcode"/>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/71925"/>
    <dcterms:abstract>This data repository contains the auxiliary datasets necessary for enriching and preprocessing Quotebank, a large dataset of unique, speaker-attributed quotations. The scripts utilizing the data can be found in the quotebank-toolkit GitHub repository. The datasets are stored in &lt;strong&gt;data.zip &lt;/strong&gt;and are described below: &lt;strong&gt;quotebank_disambiguation_mapping_quote.parquet&lt;/strong&gt;&lt;br&gt; Provides the `quoteID`-&amp;gt; `speakerQID` mapping for each quotation, disambiguating ambiguous speaker names and linking them to their respective Wikidata items. The schema of the dataset is as follows: &lt;pre&gt;&lt;code&gt; |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") |-- speaker: Wikidata ID corresponding to the speaker of the quotation&lt;/code&gt;&lt;/pre&gt; The mapping is created using heuristics described in the following paper: Marko Čuljak, Andreas Spitz, Robert West, and Akhil Arora&lt;br&gt; "Strong Heuristics for Named Entity Linking"&lt;br&gt; &lt;em&gt;Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop&lt;/em&gt;&lt;br&gt; 10.18653/v1/2022.naacl-srw.30 &lt;strong&gt;self_quotations_filtered.parquet&lt;/strong&gt;&lt;br&gt; Contains the identifiers of the quotations identified as not being self-attributed. The schema of the dataset is as follows: &lt;pre&gt;&lt;code&gt; |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") &lt;/code&gt;&lt;/pre&gt; &lt;strong&gt;speaker_attributes.parquet&lt;/strong&gt;&lt;br&gt; Contains attributes of all the speakers appearing in Quotebank extracted from Wikidata. The schema of the dataset is as follows. &lt;pre&gt;&lt;code&gt; |-- id: Wikidata item QID of the speaker, primary key |-- aliases: list of speaker's aliases |-- date_of_birth: list of possible speaker's dates of birth |-- nationality: list of speaker's nationalities |-- gender: list of speaker's previous or current genders |-- lastrevid: ID of the last revision of the speaker's item |-- ethnic_group: list of ethnic groups the speaker belongs to |-- US_congress_bio_ID: identifier for the speaker in the Biographical Directory of the United States Congress |-- occupation: list of speaker's occupations |-- party: list of parties the speaker is/was affiliated to |-- academic_degree: list of academic degrees obtained by the speaker |-- label: Wikidata label of the speaker |-- candidacy: list of the speaker's candidacies in political elections |-- type: type of the Wikidata entry (value is `item` for all the speakers) |-- religion: previous/current religious affiliations of the speaker &lt;/code&gt;&lt;/pre&gt; Using the `id` field corresponding to the Wikidata QID of a speaker, this dataset can be easily joined with disambiguated Quotebank obtained by running the `cleanup_disambiguate.py` script available in the aforementioned quotebank-toolkit repository.</dcterms:abstract>
    <dcterms:issued>2023</dcterms:issued>
    <dc:creator>Čuljak, Marko</dc:creator>
    <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2023-06-18T16:41:13Z</dcterms:created>
    <dc:contributor>West, Robert</dc:contributor>
    <dc:rights>Creative Commons Attribution 4.0 International</dc:rights>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/71925"/>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dc:creator>West, Robert</dc:creator>
    <dc:contributor>Čuljak, Marko</dc:contributor>
    <dc:creator>Akhil Arora</dc:creator>
    <dc:language>eng</dc:language>
    <dcterms:title>Auxiliary Datasets for Speaker Disambiguation in Quotebank</dcterms:title>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-30T09:31:06Z</dc:date>
    <dc:contributor>Akhil Arora</dc:contributor>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-30T09:31:06Z</dcterms:available>
  </rdf:Description>
</rdf:RDF>
URL (Link zu den Daten)

Prüfdatum der URL

Kommentar zur Publikation

Universitätsbibliographie
Ja
Diese Publikation teilen