Publikation:

Scraping Scientific Web Repositories : Challenges and Solutions for Automated Content Extraction

Lade...
Vorschaubild

Dateien

Zu diesem Dokument gibt es keine Dateien.

Datum

2016

Herausgeber:innen

Kontakt

ISSN der Zeitschrift

Electronic ISSN

ISBN

Bibliografische Daten

Verlag

Schriftenreihe

Auflagebezeichnung

URI (zitierfähiger Link)
ArXiv-ID

Internationale Patentnummer

Angaben zur Forschungsförderung

Projekt

Open Access-Veröffentlichung
Core Facility der Universität Konstanz

Gesperrt bis

Titel in einer weiteren Sprache

Publikationstyp
Zeitschriftenartikel
Publikationsstatus
Published

Erschienen in

D-Lib Magazine. 2016, 22(9/10). eISSN 1082-9873. Available under: doi: 10.1045/september2016-meschenmoser

Zusammenfassung

Aside from improving the visibility and accessibility of scientific publications, many scientific Web repositories also assess researchers' quantitative and qualitative publication performance, e.g., by displaying metrics such as the h-index. These metrics have become important for research institutions and other stakeholders to support impactful decision making processes such as hiring or funding decisions. However, scientific Web repositories typically offer only simple performance metrics and limited analysis options. Moreover, the data and algorithms to compute performance metrics are usually not published. Hence, it is not transparent or verifiable which publications the systems include in the computation and how the systems rank the results. Many researchers are interested in accessing the underlying scientometric raw data to increase the transparency of these systems. In this paper, we discuss the challenges and present strategies to programmatically access such data in scientific Web repositories. We demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data. We would like to emphasize that the scraper included in the tool should only be used if consent was given by the operator of a repository. In our experience, consent is often given if the research goals are clearly explained and the project is of a non-commercial nature.

Zusammenfassung in einer weiteren Sprache

Fachgebiet (DDC)
004 Informatik

Schlagwörter

Web Crawling; Web Scraping; Crawler Detection

Konferenz

Rezension
undefined / . - undefined, undefined

Forschungsvorhaben

Organisationseinheiten

Zeitschriftenheft

Verknüpfte Datensätze

Zitieren

ISO 690MESCHENMOSER, Philipp, Norman MEUSCHKE, Manuel HOTZ, Bela GIPP, 2016. Scraping Scientific Web Repositories : Challenges and Solutions for Automated Content Extraction. In: D-Lib Magazine. 2016, 22(9/10). eISSN 1082-9873. Available under: doi: 10.1045/september2016-meschenmoser
BibTex
@article{Meschenmoser2016-09Scrap-44544,
  year={2016},
  doi={10.1045/september2016-meschenmoser},
  title={Scraping Scientific Web Repositories : Challenges and Solutions for Automated Content Extraction},
  number={9/10},
  volume={22},
  journal={D-Lib Magazine},
  author={Meschenmoser, Philipp and Meuschke, Norman and Hotz, Manuel and Gipp, Bela}
}
RDF
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/44544">
    <dc:creator>Meschenmoser, Philipp</dc:creator>
    <dc:creator>Gipp, Bela</dc:creator>
    <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/44544"/>
    <dcterms:title>Scraping Scientific Web Repositories : Challenges and Solutions for Automated Content Extraction</dcterms:title>
    <dc:contributor>Gipp, Bela</dc:contributor>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2019-01-14T10:14:28Z</dc:date>
    <dc:contributor>Meuschke, Norman</dc:contributor>
    <dc:contributor>Meschenmoser, Philipp</dc:contributor>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dc:language>eng</dc:language>
    <dc:creator>Meuschke, Norman</dc:creator>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dcterms:issued>2016-09</dcterms:issued>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dc:creator>Hotz, Manuel</dc:creator>
    <dcterms:abstract xml:lang="eng">Aside from improving the visibility and accessibility of scientific publications, many scientific Web repositories also assess researchers' quantitative and qualitative publication performance, e.g., by displaying metrics such as the h-index. These metrics have become important for research institutions and other stakeholders to support impactful decision making processes such as hiring or funding decisions. However, scientific Web repositories typically offer only simple performance metrics and limited analysis options. Moreover, the data and algorithms to compute performance metrics are usually not published. Hence, it is not transparent or verifiable which publications the systems include in the computation and how the systems rank the results. Many researchers are interested in accessing the underlying scientometric raw data to increase the transparency of these systems. In this paper, we discuss the challenges and present strategies to programmatically access such data in scientific Web repositories. We demonstrate the strategies as part of an open source tool (MIT license) that allows research performance comparisons based on Google Scholar data. We would like to emphasize that the scraper included in the tool should only be used if consent was given by the operator of a repository. In our experience, consent is often given if the research goals are clearly explained and the project is of a non-commercial nature.</dcterms:abstract>
    <dc:contributor>Hotz, Manuel</dc:contributor>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2019-01-14T10:14:28Z</dcterms:available>
  </rdf:Description>
</rdf:RDF>

Interner Vermerk

xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter

Kontakt
URL der Originalveröffentl.

Prüfdatum der URL

Prüfungsdatum der Dissertation

Finanzierungsart

Kommentar zur Publikation

Allianzlizenz
Corresponding Authors der Uni Konstanz vorhanden
Internationale Co-Autor:innen
Universitätsbibliographie
Ja
Begutachtet
Unbekannt
Diese Publikation teilen