Auxiliary Datasets for Speaker Disambiguation in Quotebank
| creativework.version | v1 | |
| dc.contributor.author | Čuljak, Marko | |
| dc.contributor.author | Spitz, Andreas | |
| dc.contributor.author | West, Robert | |
| dc.contributor.author | Akhil Arora | |
| dc.date.accessioned | 2025-04-30T09:31:06Z | |
| dc.date.available | 2025-04-30T09:31:06Z | |
| dc.date.created | 2023-06-18T16:41:13Z | |
| dc.date.issued | 2023 | |
| dc.description.abstract | This data repository contains the auxiliary datasets necessary for enriching and preprocessing Quotebank, a large dataset of unique, speaker-attributed quotations. The scripts utilizing the data can be found in the quotebank-toolkit GitHub repository. The datasets are stored in data.zip and are described below: quotebank_disambiguation_mapping_quote.parquet Provides the `quoteID`-> `speakerQID` mapping for each quotation, disambiguating ambiguous speaker names and linking them to their respective Wikidata items. The schema of the dataset is as follows: The mapping is created using heuristics described in the following paper: Marko Čuljak, Andreas Spitz, Robert West, and Akhil Arora"Strong Heuristics for Named Entity Linking" Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop 10.18653/v1/2022.naacl-srw.30 self_quotations_filtered.parquet Contains the identifiers of the quotations identified as not being self-attributed. The schema of the dataset is as follows: speaker_attributes.parquetContains attributes of all the speakers appearing in Quotebank extracted from Wikidata. The schema of the dataset is as follows. Using the `id` field corresponding to the Wikidata QID of a speaker, this dataset can be easily joined with disambiguated Quotebank obtained by running the `cleanup_disambiguate.py` script available in the aforementioned quotebank-toolkit repository. | |
| dc.description.version | published | deu |
| dc.identifier.doi | 10.5281/zenodo.8033672 | |
| dc.identifier.uri | https://kops.uni-konstanz.de/handle/123456789/73186 | |
| dc.language.iso | eng | |
| dc.rights | Creative Commons Attribution 4.0 International | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/legalcode | |
| dc.subject.ddc | 004 | |
| dc.title | Auxiliary Datasets for Speaker Disambiguation in Quotebank | eng |
| dspace.entity.type | Dataset | |
| kops.citation.bibtex | ||
| kops.citation.iso690 | ČULJAK, Marko, Andreas SPITZ, Robert WEST, AKHIL ARORA, 2023. Auxiliary Datasets for Speaker Disambiguation in Quotebank | deu |
| kops.citation.iso690 | ČULJAK, Marko, Andreas SPITZ, Robert WEST, AKHIL ARORA, 2023. Auxiliary Datasets for Speaker Disambiguation in Quotebank | eng |
| kops.citation.rdf | <rdf:RDF
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:bibo="http://purl.org/ontology/bibo/"
xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:void="http://rdfs.org/ns/void#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#" >
<rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/73186">
<dc:creator>Spitz, Andreas</dc:creator>
<dc:contributor>Spitz, Andreas</dc:contributor>
<bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/73186"/>
<dcterms:rights rdf:resource="https://creativecommons.org/licenses/by/4.0/legalcode"/>
<dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/71925"/>
<dcterms:abstract>This data repository contains the auxiliary datasets necessary for enriching and preprocessing Quotebank, a large dataset of unique, speaker-attributed quotations. The scripts utilizing the data can be found in the quotebank-toolkit GitHub repository. The datasets are stored in <strong>data.zip </strong>and are described below: <strong>quotebank_disambiguation_mapping_quote.parquet</strong><br> Provides the `quoteID`-&gt; `speakerQID` mapping for each quotation, disambiguating ambiguous speaker names and linking them to their respective Wikidata items. The schema of the dataset is as follows: <pre><code> |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") |-- speaker: Wikidata ID corresponding to the speaker of the quotation</code></pre> The mapping is created using heuristics described in the following paper: Marko Čuljak, Andreas Spitz, Robert West, and Akhil Arora<br> "Strong Heuristics for Named Entity Linking"<br> <em>Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop</em><br> 10.18653/v1/2022.naacl-srw.30 <strong>self_quotations_filtered.parquet</strong><br> Contains the identifiers of the quotations identified as not being self-attributed. The schema of the dataset is as follows: <pre><code> |-- quoteID: primary key of the quotation (format: "YYYY-MM-DD-{increasing int:06d}") </code></pre> <strong>speaker_attributes.parquet</strong><br> Contains attributes of all the speakers appearing in Quotebank extracted from Wikidata. The schema of the dataset is as follows. <pre><code> |-- id: Wikidata item QID of the speaker, primary key |-- aliases: list of speaker's aliases |-- date_of_birth: list of possible speaker's dates of birth |-- nationality: list of speaker's nationalities |-- gender: list of speaker's previous or current genders |-- lastrevid: ID of the last revision of the speaker's item |-- ethnic_group: list of ethnic groups the speaker belongs to |-- US_congress_bio_ID: identifier for the speaker in the Biographical Directory of the United States Congress |-- occupation: list of speaker's occupations |-- party: list of parties the speaker is/was affiliated to |-- academic_degree: list of academic degrees obtained by the speaker |-- label: Wikidata label of the speaker |-- candidacy: list of the speaker's candidacies in political elections |-- type: type of the Wikidata entry (value is `item` for all the speakers) |-- religion: previous/current religious affiliations of the speaker </code></pre> Using the `id` field corresponding to the Wikidata QID of a speaker, this dataset can be easily joined with disambiguated Quotebank obtained by running the `cleanup_disambiguate.py` script available in the aforementioned quotebank-toolkit repository.</dcterms:abstract>
<dcterms:issued>2023</dcterms:issued>
<dc:creator>Čuljak, Marko</dc:creator>
<dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2023-06-18T16:41:13Z</dcterms:created>
<dc:contributor>West, Robert</dc:contributor>
<dc:rights>Creative Commons Attribution 4.0 International</dc:rights>
<dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/71925"/>
<foaf:homepage rdf:resource="http://localhost:8080/"/>
<dc:creator>West, Robert</dc:creator>
<dc:contributor>Čuljak, Marko</dc:contributor>
<dc:creator>Akhil Arora</dc:creator>
<dc:language>eng</dc:language>
<dcterms:title>Auxiliary Datasets for Speaker Disambiguation in Quotebank</dcterms:title>
<dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-30T09:31:06Z</dc:date>
<dc:contributor>Akhil Arora</dc:contributor>
<void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
<dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-30T09:31:06Z</dcterms:available>
</rdf:Description>
</rdf:RDF> | |
| kops.datacite.repository | Zenodo | |
| kops.flag.knbibliography | true | |
| relation.isAuthorOfDataset | 4cf0b980-487c-486b-8e3c-e2890ea465b9 | |
| relation.isAuthorOfDataset.latestForDiscovery | 4cf0b980-487c-486b-8e3c-e2890ea465b9 |