Publikation:

Beyond time delays : how web scraping distorts measures of online news consumption

Lade...
Vorschaubild

Dateien

Zu diesem Dokument gibt es keine Dateien.

Datum

2025

Autor:innen

Mangold, Frank
Stier, Sebastian

Herausgeber:innen

Kontakt

ISSN der Zeitschrift

Electronic ISSN

ISBN

Bibliografische Daten

Verlag

Schriftenreihe

Auflagebezeichnung

URI (zitierfähiger Link)
ArXiv-ID

Internationale Patentnummer

Angaben zur Forschungsförderung

Projekt

Open Access-Veröffentlichung
Open Access Hybrid
Core Facility der Universität Konstanz

Gesperrt bis

Titel in einer weiteren Sprache

Publikationstyp
Zeitschriftenartikel
Publikationsstatus
Published

Erschienen in

Communication Methods and Measures. Informa UK Limited. ISSN 1931-2458. eISSN 1931-2466. Verfügbar unter: doi: 10.1080/19312458.2025.2482538

Zusammenfassung

As the exploration of digital behavioral data revolutionizes communication research, understanding the nuances of data collection methodologies becomes increasingly pertinent. This study focuses on one prominent data collection approach, web scraping;specifically, its application in the growing field of research relying on web browsing data. We investigate discrepancies between content obtained directly during user interaction with a website (insitu) and content scraped using the URLs of participants’ logged visits (exsitu) with various time delays (0, 30, 60, and 90 days). We find substantial disparities between the methodologies, uncovering that errors are not uniformly distributed across news categories regardless of the classification method (domain, URL, or content analysis). These biases compromise the precision of measurements used in the existing literature. The ex-situ collection environment is the primary source of the discrepancies (~33.8%), while the time delays in the scraping process play a smaller role (adding ~6.5% points in 90 days). Our research emphasizes the need for data collection methods that capture web content directly in the user’s environment. However, acknowledging its complexities, we further explore strategies to mitigate biases in web-scraped browsing histories, offering recommendations for researchers who rely on this method and laying the groundwork for developing error-correction frameworks.

Zusammenfassung in einer weiteren Sprache

Fachgebiet (DDC)
004 Informatik

Schlagwörter

Konferenz

Rezension
undefined / . - undefined, undefined

Forschungsvorhaben

Organisationseinheiten

Zeitschriftenheft

Zugehörige Datensätze in KOPS

Zitieren

ISO 690ULLOA, Roberto, Frank MANGOLD, Felix SCHMIDT, Judith GILSBACH, Sebastian STIER, 2025. Beyond time delays : how web scraping distorts measures of online news consumption. In: Communication Methods and Measures. Informa UK Limited. ISSN 1931-2458. eISSN 1931-2466. Verfügbar unter: doi: 10.1080/19312458.2025.2482538
BibTex
@article{Ulloa2025-03-26Beyon-72942,
  title={Beyond time delays : how web scraping distorts measures of online news consumption},
  year={2025},
  doi={10.1080/19312458.2025.2482538},
  issn={1931-2458},
  journal={Communication Methods and Measures},
  author={Ulloa, Roberto and Mangold, Frank and Schmidt, Felix and Gilsbach, Judith and Stier, Sebastian}
}
RDF
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/72942">
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dc:creator>Mangold, Frank</dc:creator>
    <dc:creator>Gilsbach, Judith</dc:creator>
    <dcterms:abstract>As the exploration of digital behavioral data revolutionizes communication research, understanding the nuances of data collection methodologies becomes increasingly pertinent. This study focuses on one prominent data collection approach, web scraping;specifically, its application in the growing field of research relying on web browsing data. We investigate discrepancies between content obtained directly during user interaction with a website (insitu) and content scraped using the URLs of participants’ logged visits (exsitu) with various time delays (0, 30, 60, and 90 days). We find substantial disparities between the methodologies, uncovering that errors are not uniformly distributed across news categories regardless of the classification method (domain, URL, or content analysis). These biases compromise the precision of measurements used in the existing literature. The ex-situ collection environment is the primary source of the discrepancies (~33.8%), while the time delays in the scraping process play a smaller role (adding ~6.5% points in 90 days). Our research emphasizes the need for data collection methods that capture web content directly in the user’s environment. However, acknowledging its complexities, we further explore strategies to mitigate biases in web-scraped browsing histories, offering recommendations for researchers who rely on this method and laying the groundwork for developing error-correction frameworks.</dcterms:abstract>
    <dcterms:rights rdf:resource="http://creativecommons.org/licenses/by/4.0/"/>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/43613"/>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-07T09:02:40Z</dcterms:available>
    <dc:contributor>Schmidt, Felix</dc:contributor>
    <dc:creator>Schmidt, Felix</dc:creator>
    <dc:contributor>Stier, Sebastian</dc:contributor>
    <dc:rights>Attribution 4.0 International</dc:rights>
    <dc:creator>Ulloa, Roberto</dc:creator>
    <dc:creator>Stier, Sebastian</dc:creator>
    <dcterms:title>Beyond time delays : how web scraping distorts measures of online news consumption</dcterms:title>
    <dc:contributor>Ulloa, Roberto</dc:contributor>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dc:contributor>Mangold, Frank</dc:contributor>
    <dc:language>eng</dc:language>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/72942"/>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/43613"/>
    <dcterms:issued>2025-03-26</dcterms:issued>
    <dc:contributor>Gilsbach, Judith</dc:contributor>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-07T09:02:40Z</dc:date>
  </rdf:Description>
</rdf:RDF>

Interner Vermerk

xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter

Kontakt
URL der Originalveröffentl.

Prüfdatum der URL

Prüfungsdatum der Dissertation

Finanzierungsart

Kommentar zur Publikation

Allianzlizenz
Corresponding Authors der Uni Konstanz vorhanden
Internationale Co-Autor:innen
Universitätsbibliographie
Ja
Begutachtet
Ja
Online First: Zeitschriftenartikel, die schon vor ihrer Zuordnung zu einem bestimmten Zeitschriftenheft (= Issue) online gestellt werden. Online First-Artikel werden auf der Homepage des Journals in der Verlagsfassung veröffentlicht.
Diese Publikation teilen