Auxiliary Datasets for Speaker Disambiguation in Quotebank

creativework.versionv1
dc.contributor.authorČuljak, Marko
dc.contributor.authorSpitz, Andreas
dc.contributor.authorWest, Robert
dc.contributor.authorAkhil Arora
dc.date.accessioned2025-04-30T09:31:06Z
dc.date.available2025-04-30T09:31:06Z
dc.date.created2023-06-18T16:41:13Z
dc.date.issued2023
dc.description.abstractThis data repository contains the auxiliary datasets necessary for enriching and preprocessing Quotebank, a large dataset of unique, speaker-attributed quotations. The scripts utilizing the data can be found in the quotebank-toolkit GitHub repository. The datasets are stored in data.zip and are described below: quotebank_disambiguation_mapping_quote.parquet
Provides the `quoteID`-> `speakerQID` mapping for each quotation, disambiguating ambiguous speaker names and linking them to their respective Wikidata items. The schema of the dataset is as follows:
 |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") |-- speaker: Wikidata ID corresponding to the speaker of the quotation
The mapping is created using heuristics described in the following paper: Marko Čuljak, Andreas Spitz, Robert West, and Akhil Arora
"Strong Heuristics for Named Entity Linking"
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
10.18653/v1/2022.naacl-srw.30 self_quotations_filtered.parquet
Contains the identifiers of the quotations identified as not being self-attributed. The schema of the dataset is as follows:
 |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") 
speaker_attributes.parquet
Contains attributes of all the speakers appearing in Quotebank extracted from Wikidata. The schema of the dataset is as follows.
 |-- id: Wikidata item QID of the speaker, primary key |-- aliases: list of speaker's aliases |-- date_of_birth: list of possible speaker's dates of birth |-- nationality: list of speaker's nationalities |-- gender: list of speaker's previous or current genders |-- lastrevid: ID of the last revision of the speaker's item |-- ethnic_group: list of ethnic groups the speaker belongs to |-- US_congress_bio_ID: identifier for the speaker in the Biographical Directory of the United States Congress |-- occupation: list of speaker's occupations |-- party: list of parties the speaker is/was affiliated to |-- academic_degree: list of academic degrees obtained by the speaker |-- label: Wikidata label of the speaker |-- candidacy: list of the speaker's candidacies in political elections |-- type: type of the Wikidata entry (value is `item` for all the speakers) |-- religion: previous/current religious affiliations of the speaker 
Using the `id` field corresponding to the Wikidata QID of a speaker, this dataset can be easily joined with disambiguated Quotebank obtained by running the `cleanup_disambiguate.py` script available in the aforementioned quotebank-toolkit repository.
dc.description.versionpublisheddeu
dc.identifier.doi10.5281/zenodo.8033672
dc.identifier.urihttps://kops.uni-konstanz.de/handle/123456789/73186
dc.language.isoeng
dc.rightsCreative Commons Attribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/legalcode
dc.subject.ddc004
dc.titleAuxiliary Datasets for Speaker Disambiguation in Quotebankeng
dspace.entity.typeDataset
kops.citation.bibtex
kops.citation.iso690ČULJAK, Marko, Andreas SPITZ, Robert WEST, AKHIL ARORA, 2023. Auxiliary Datasets for Speaker Disambiguation in Quotebankdeu
kops.citation.iso690ČULJAK, Marko, Andreas SPITZ, Robert WEST, AKHIL ARORA, 2023. Auxiliary Datasets for Speaker Disambiguation in Quotebankeng
kops.citation.rdf
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/73186">
    <dc:creator>Spitz, Andreas</dc:creator>
    <dc:contributor>Spitz, Andreas</dc:contributor>
    <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/73186"/>
    <dcterms:rights rdf:resource="https://creativecommons.org/licenses/by/4.0/legalcode"/>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/71925"/>
    <dcterms:abstract>This data repository contains the auxiliary datasets necessary for enriching and preprocessing Quotebank, a large dataset of unique, speaker-attributed quotations. The scripts utilizing the data can be found in the quotebank-toolkit GitHub repository. The datasets are stored in &lt;strong&gt;data.zip &lt;/strong&gt;and are described below: &lt;strong&gt;quotebank_disambiguation_mapping_quote.parquet&lt;/strong&gt;&lt;br&gt; Provides the `quoteID`-&amp;gt; `speakerQID` mapping for each quotation, disambiguating ambiguous speaker names and linking them to their respective Wikidata items. The schema of the dataset is as follows: &lt;pre&gt;&lt;code&gt; |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") |-- speaker: Wikidata ID corresponding to the speaker of the quotation&lt;/code&gt;&lt;/pre&gt; The mapping is created using heuristics described in the following paper: Marko Čuljak, Andreas Spitz, Robert West, and Akhil Arora&lt;br&gt; "Strong Heuristics for Named Entity Linking"&lt;br&gt; &lt;em&gt;Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop&lt;/em&gt;&lt;br&gt; 10.18653/v1/2022.naacl-srw.30 &lt;strong&gt;self_quotations_filtered.parquet&lt;/strong&gt;&lt;br&gt; Contains the identifiers of the quotations identified as not being self-attributed. The schema of the dataset is as follows: &lt;pre&gt;&lt;code&gt; |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") &lt;/code&gt;&lt;/pre&gt; &lt;strong&gt;speaker_attributes.parquet&lt;/strong&gt;&lt;br&gt; Contains attributes of all the speakers appearing in Quotebank extracted from Wikidata. The schema of the dataset is as follows. &lt;pre&gt;&lt;code&gt; |-- id: Wikidata item QID of the speaker, primary key |-- aliases: list of speaker's aliases |-- date_of_birth: list of possible speaker's dates of birth |-- nationality: list of speaker's nationalities |-- gender: list of speaker's previous or current genders |-- lastrevid: ID of the last revision of the speaker's item |-- ethnic_group: list of ethnic groups the speaker belongs to |-- US_congress_bio_ID: identifier for the speaker in the Biographical Directory of the United States Congress |-- occupation: list of speaker's occupations |-- party: list of parties the speaker is/was affiliated to |-- academic_degree: list of academic degrees obtained by the speaker |-- label: Wikidata label of the speaker |-- candidacy: list of the speaker's candidacies in political elections |-- type: type of the Wikidata entry (value is `item` for all the speakers) |-- religion: previous/current religious affiliations of the speaker &lt;/code&gt;&lt;/pre&gt; Using the `id` field corresponding to the Wikidata QID of a speaker, this dataset can be easily joined with disambiguated Quotebank obtained by running the `cleanup_disambiguate.py` script available in the aforementioned quotebank-toolkit repository.</dcterms:abstract>
    <dcterms:issued>2023</dcterms:issued>
    <dc:creator>Čuljak, Marko</dc:creator>
    <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2023-06-18T16:41:13Z</dcterms:created>
    <dc:contributor>West, Robert</dc:contributor>
    <dc:rights>Creative Commons Attribution 4.0 International</dc:rights>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/71925"/>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dc:creator>West, Robert</dc:creator>
    <dc:contributor>Čuljak, Marko</dc:contributor>
    <dc:creator>Akhil Arora</dc:creator>
    <dc:language>eng</dc:language>
    <dcterms:title>Auxiliary Datasets for Speaker Disambiguation in Quotebank</dcterms:title>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-30T09:31:06Z</dc:date>
    <dc:contributor>Akhil Arora</dc:contributor>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-30T09:31:06Z</dcterms:available>
  </rdf:Description>
</rdf:RDF>
kops.datacite.repositoryZenodo
kops.flag.knbibliographytrue
relation.isAuthorOfDataset4cf0b980-487c-486b-8e3c-e2890ea465b9
relation.isAuthorOfDataset.latestForDiscovery4cf0b980-487c-486b-8e3c-e2890ea465b9

Dateien