Towards Reproducible Research of Event Detection Techniques for Twitter
2019-06, Weiler, Andreas, Schilling, Harry, Kircher, Lukas, Grossniklaus, Michael
A major challenge in many research areas is reproducibility of implementations, experiments, or evaluations. New data sources and research directions complicate the reproducibility even more. For example, Twitter continues to gain popularity as a source of up-to-date news and information. As a result, numerous event detection techniques have been proposed to cope with the steadily increasing rate and volume of social media data streams. Although some of these works provide their implementation or conduct an evaluation of the proposed technique, it is almost impossible to reproduce their experiments. The main drawback is that Twitter prohibits the release of crawled datasets that are used by researchers in their experiments. In this work, we present a survey of the vast landscape of implementations, experiments, and evaluations presented by the different research works. Furthermore, we propose a reproducibility toolkit including Twistor (Twitter Stream Simulator), which can be used to simulate an artificial Twitter data stream (including events) as input for the experiments or evaluations of event detection techniques. We further present the experimental application of the reproducibility toolkit to state-of-the-art event detection techniques.
Revealing the Invisible : Visual Analytics and Explanatory Storytelling for Advanced Team Sport Analysis
2018-10, Stein, Manuel, Breitkreutz, Thorsten, Häußler, Johannes, Seebacher, Daniel, Niederberger, Christoph, Schreck, Tobias, Grossniklaus, Michael, Keim, Daniel A., Janetzko, Halldor
The analysis of invasive team sports often concentrates on cooperative and competitive aspects of collective movement behavior. A main goal is the identification and explanation of strategies, and eventually the development of new strategies. In visual sports analytics, a range of different visual-interactive analysis techniques have been proposed, e.g., based on visualization using for example trajectories, graphs, heatmaps, and animations. Identifying suitable visualizations for a specific situation is key to a successful analysis. Existing systems enable the interactive selection of different visualization facets to support the analysis process. However, an interactive selection of appropriate visualizations is a difficult, complex, and time-consuming task. In this paper, we propose a four-step analytics conceptual workflow for an automatic selection of appropriate views for key situations in soccer games. Our concept covers classification, specification, explanation, and alteration of match situations, effectively enabling the analysts to focus on important game situations and the determination of alternative moves. Combining abstract visualizations with real world video recordings by Immersive Visual Analytics and descriptive storylines, we support domain experts in understanding key situations. We demonstrate the usefulness of our proposed conceptual workflow via two proofs of concept and evaluate our system by comparing our results to manual video annotations by domain experts. Initial expert feedback shows that our proposed concept improves the understanding of competitive sports and leads to a more efficient data analysis.
Stability Evaluation of Event Detection Techniques for Twitter
2016-09-21, Weiler, Andreas, Beel, Joeran, Gipp, Bela, Grossniklaus, Michael
Twitter continues to gain popularity as a source of up-to-date news and information. As a result, numerous event detection techniques have been proposed to cope with the steadily increasing rate and volume of social media data streams. Although most of these works conduct some evaluation of the proposed technique, comparing their effectiveness is a challenging task. In this paper, we examine the challenges to reproducing evaluation results for event detection techniques. We apply several event detection techniques and vary four parameters, namely time window (15 vs. 30 vs. 60 mins), stopwords (include vs. exclude), retweets (include vs. exclude), and the number of terms that define an event (1...5 terms). Our experiments use real-world Twitter streaming data and show that varying these parameters alone significantly influences the outcomes of the event detection techniques, sometimes in unforeseen ways. We conclude that even minor variations in event detection techniques may lead to major difficulties in reproducing experiments.
Frames : Data-driven Windows
2016, Grossniklaus, Michael, Maier, David, Miller, James, Moorthy, Sharmadha, Tufte, Kristin
Traditional Data Stream Management Systems (DSMS) segment data streams using windows that are defined either by a time interval or a number of tuples. Such windows are fixed—the definition unvarying over the course of a stream—and are defined based on external properties unrelated to the data content of the stream. However, streams and their con- tent do vary over time—the rate of a data stream may vary or the data distribution of the content may vary. The mismatch between a fixed stream segmentation and a variable stream motivates the need for a more flexible, expressive and physically independent stream segmentation. We introduce a new stream segmentation technique, called frames. Frames segment streams based on data content. We present a theory and implementation of frames and show the utility of frames for a variety of applications.
Experiences with Implementing Landmark Embedding in Neo4j
2019, Hotz, Manuel, Chondrogiannis, Theodoros, Wörteler, Leonard, Grossniklaus, Michael
Reachability, distance, and shortest path queries are fundamental operations in the field of graph data management with various applications in research and industry. However, while various preprocessing-based methods have been proposed to optimize the computation of such queries, the integration of existing methods into graph database management systems and processing frameworks has been limited. In this paper, we present an implementation of a static graph index that employs landmark embedding for Neo4j, to enable the index-based computation of reachability, distance, and shortest path queries on the database. We explore different strategies for selecting landmarks and different schemes for storing the precomputed landmark distances. To evaluate the efficiency of each landmark selection strategy and each storage scheme, we conduct an experimental evaluation using four real-world network datasets. We measure the preprocessing cost, the query processing time, and the accuracy of the distance estimation of different configurations of our index structure.
Bucket Selection : A Model-Independent Diverse Selection Strategy for Widening
2017-10-04, Fillbrunn, Alexander, Wörteler, Leonard, Grossniklaus, Michael, Berthold, Michael R.
When using a greedy algorithm for finding a model, as is the case in many data mining algorithms, there is a risk of getting caught in local extrema, i.e., suboptimal solutions. Widening is a technique for enhancing greedy algorithms by using parallel resources to broaden the search in the model space. The most important component of widening is the selector, a function that chooses the next models to refine. This selector ideally enforces diversity within the selected set of models in order to ensure that parallel workers explore sufficiently different parts of the model space and do not end up mimicking a simple beam search. Previous publications have shown that this works well for problems with a suitable distance measure for the models, but if no such measure is available, applying widening is challenging. In addition these approaches require extensive, sequential computations for diverse subset selection, making the entire process much slower than the original greedy algorithm. In this paper we propose the bucket selector, a model-independent randomized selection strategy. We find that (a) the bucket selector is a lot faster and not significantly worse when a diversity measure exists and (b) it performs better than existing selection strategies in cases without a diversity measure.
An Algebra and Equivalences to Transform Graph Patterns in Neo4j
2016, Hölsch, Jürgen, Grossniklaus, Michael
Modern query optimizers of relational database systems embody more than three decades of research and practice in the area of data management and processing. Key advances in- clude algebraic query transformation, intelligent search space pruning, and modular optimizer architectures. Surprisingly, many of these contributions seem to have been overlooked in the emerging field of graph databases so far. In particular, we believe that query optimization based on a general graph algebra and its equivalences can greatly improve on the current state of the art. Although some graph algebras have already been proposed, they have often been developed in a context, in which a relational database system is used as a backend to process graph data. As a consequence, these algebras are typically tightly coupled to the relational algebra, making them unsuitable for native graph databases. While we support the approach of extending the relational algebra, we argue that graph-specific operations should be defined at a higher level, independent of the database backend. In this paper, we introduce such a general graph algebra and corresponding equivalences. We demonstrate how it can be used to optimize Cypher queries in the setting of the Neo4j native graph database.
From Movement to Events : Improving Soccer Match Annotations
2019, Stein, Manuel, Seebacher, Daniel, Karge, Tassilo, Polk, Tom, Grossniklaus, Michael, Keim, Daniel A.
Match analysis has become an important task in everyday work at professional soccer clubs in order to improve team performance. Video analysts regularly spend up to several days analyzing and summarizing matches based on tracked and annotated match data. Although there already exists extensive capabilities to track the movement of players and the ball from multimedia data sources such as video recordings, there is no capability to sufficiently detect dynamic and complex events within these data. As a consequence, analysts have to rely on manually created annotations, which are very time-consuming and expensive to create. We propose a novel method for the semi-automatic definition and detection of events based entirely on movement data of players and the ball. Incorporating Allen’s interval algebra into a visual analytics system, we enable analysts to visually define as well as search for complex, hierarchical events. We demonstrate the usefulness of our approach by quantitatively comparing our automatically detected events with manually annotated events from a professional data provider as well as several expert interviews. The results of our evaluation show that the required annotation time for complete matches by using our system can be reduced to a few seconds while achieving a similar level of performance.
On the Performance of Analytical and Pattern Matching Graph Queries in Neo4j and a Relational Database
2017, Hölsch, Jürgen, Schmidt, Tobias, Grossniklaus, Michael
Graph databases with a custom non-relational backend promote themselves to outperform relational databases in answering queries on large graphs. Recent empirical studies show that this claim is not always true. However, these studies focus only on pattern matching queries and neglect analytical queries used in practice such as shortest path, diameter, degree centrality or closeness centrality. In addition, there is no distinction between different types of pattern matching queries. In this paper, we introduce a set of analytical and pattern matching queries, and evaluate them in Neo4j and a market-leading commercial relational database system. We show that the relational database system outperforms Neo4j for our analytical queries and that Neo4j is faster for queries that do not filter on specific edge types.
Optimization of Nested Queries Using the NF2 Algebra
2016, Hölsch, Jürgen, Grossniklaus, Michael, Scholl, Marc H.
A key promise of SQL is that the optimizer will find the most efficient execution plan, regardless of how the query is formulated. In general, query optimizers of modern database systems are able to keep this promise, with the notable exception of nested queries. While several optimization techniques for nested queries have been proposed, their adoption in practice has been limited. In this paper, we argue that the NF2 (non-first normal form) algebra, which was originally designed to process nested tables, is a better approach to nested query optimization as it fulfills two key requirements. First, the NF2 algebra can represent all types of nested queries as well as both existing and novel optimization techniques based on its equivalences. Second, performance benefits can be achieved with little changes to existing transformation-based query optimizers as the NF2 algebra is an extension of the relational algebra.