Publikation: Beyond time delays : how web scraping distorts measures of online news consumption
Dateien
Datum
Autor:innen
Herausgeber:innen
ISSN der Zeitschrift
Electronic ISSN
ISBN
Bibliografische Daten
Verlag
Schriftenreihe
Auflagebezeichnung
DOI (zitierfähiger Link)
Internationale Patentnummer
Angaben zur Forschungsförderung
Projekt
Open Access-Veröffentlichung
Core Facility der Universität Konstanz
Titel in einer weiteren Sprache
Publikationstyp
Publikationsstatus
Erschienen in
Zusammenfassung
As the exploration of digital behavioral data revolutionizes communication research, understanding the nuances of data collection methodologies becomes increasingly pertinent. This study focuses on one prominent data collection approach, web scraping;specifically, its application in the growing field of research relying on web browsing data. We investigate discrepancies between content obtained directly during user interaction with a website (insitu) and content scraped using the URLs of participants’ logged visits (exsitu) with various time delays (0, 30, 60, and 90 days). We find substantial disparities between the methodologies, uncovering that errors are not uniformly distributed across news categories regardless of the classification method (domain, URL, or content analysis). These biases compromise the precision of measurements used in the existing literature. The ex-situ collection environment is the primary source of the discrepancies (~33.8%), while the time delays in the scraping process play a smaller role (adding ~6.5% points in 90 days). Our research emphasizes the need for data collection methods that capture web content directly in the user’s environment. However, acknowledging its complexities, we further explore strategies to mitigate biases in web-scraped browsing histories, offering recommendations for researchers who rely on this method and laying the groundwork for developing error-correction frameworks.
Zusammenfassung in einer weiteren Sprache
Fachgebiet (DDC)
Schlagwörter
Konferenz
Rezension
Zitieren
ISO 690
ULLOA, Roberto, Frank MANGOLD, Felix SCHMIDT, Judith GILSBACH, Sebastian STIER, 2025. Beyond time delays : how web scraping distorts measures of online news consumption. In: Communication Methods and Measures. Informa UK Limited. ISSN 1931-2458. eISSN 1931-2466. Verfügbar unter: doi: 10.1080/19312458.2025.2482538BibTex
@article{Ulloa2025-03-26Beyon-72942, title={Beyond time delays : how web scraping distorts measures of online news consumption}, year={2025}, doi={10.1080/19312458.2025.2482538}, issn={1931-2458}, journal={Communication Methods and Measures}, author={Ulloa, Roberto and Mangold, Frank and Schmidt, Felix and Gilsbach, Judith and Stier, Sebastian} }
RDF
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:void="http://rdfs.org/ns/void#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/72942"> <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/> <dc:creator>Mangold, Frank</dc:creator> <dc:creator>Gilsbach, Judith</dc:creator> <dcterms:abstract>As the exploration of digital behavioral data revolutionizes communication research, understanding the nuances of data collection methodologies becomes increasingly pertinent. This study focuses on one prominent data collection approach, web scraping;specifically, its application in the growing field of research relying on web browsing data. We investigate discrepancies between content obtained directly during user interaction with a website (insitu) and content scraped using the URLs of participants’ logged visits (exsitu) with various time delays (0, 30, 60, and 90 days). We find substantial disparities between the methodologies, uncovering that errors are not uniformly distributed across news categories regardless of the classification method (domain, URL, or content analysis). These biases compromise the precision of measurements used in the existing literature. The ex-situ collection environment is the primary source of the discrepancies (~33.8%), while the time delays in the scraping process play a smaller role (adding ~6.5% points in 90 days). Our research emphasizes the need for data collection methods that capture web content directly in the user’s environment. However, acknowledging its complexities, we further explore strategies to mitigate biases in web-scraped browsing histories, offering recommendations for researchers who rely on this method and laying the groundwork for developing error-correction frameworks.</dcterms:abstract> <dcterms:rights rdf:resource="http://creativecommons.org/licenses/by/4.0/"/> <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/> <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/43613"/> <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-07T09:02:40Z</dcterms:available> <dc:contributor>Schmidt, Felix</dc:contributor> <dc:creator>Schmidt, Felix</dc:creator> <dc:contributor>Stier, Sebastian</dc:contributor> <dc:rights>Attribution 4.0 International</dc:rights> <dc:creator>Ulloa, Roberto</dc:creator> <dc:creator>Stier, Sebastian</dc:creator> <dcterms:title>Beyond time delays : how web scraping distorts measures of online news consumption</dcterms:title> <dc:contributor>Ulloa, Roberto</dc:contributor> <foaf:homepage rdf:resource="http://localhost:8080/"/> <dc:contributor>Mangold, Frank</dc:contributor> <dc:language>eng</dc:language> <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/> <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/72942"/> <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/43613"/> <dcterms:issued>2025-03-26</dcterms:issued> <dc:contributor>Gilsbach, Judith</dc:contributor> <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-07T09:02:40Z</dc:date> </rdf:Description> </rdf:RDF>