Declarative Access to Filesystem Data : New application domains for XML database management systems
2012, Holupirek, Alexander
XML and state-of-the-art XML database management systems (XML-DBMSs) can play a leading role in far more application domains as it is currently the case.
Even in their basic configuration, they entail all components necessary to act as central systems for complex search and retrieval tasks. They provide language-specific indexing of full-text documents and can store structured, semi-structured and binary data.
Besides, they offer a great variety of standardized languages (XQuery, XSLT, XQuery Full Text, etc.) to develop applications inside a pure XML technology stack. Benefits are obvious: Data, logic, and presentation tiers can operate on a single data model, and no conversions have to be applied when switching in between.
This thesis deals with the design and development of XML/XQuery driven information architectures that process formerly heterogeneous data sources in a standardized and uniform manner. Filesystems and their vast amounts of different file types are a prime example for such a heterogeneous dataspace. A new XML dialect, the Filesystem Markup Language (FSML), is introduced to construct a database view of the filesystem and its contents. FSML provides a uniform view on the filesystem’s contents and allows developers to leverage the complete XML technology stack on filesystem data.
BaseX, a high performance, native XML-DBMS developed at the University of Konstanz, is pushed to new application domains. We interface the database system with the operating system kernel and implement a database/filesystem hybrid (BaseX-FS), which is working on FSML database instances. A joint storage for both the filesystem and the database is established, which allows both developers and users to access data via the conventional and proven filesystem interface and, in addition, through a novel declarative, database-supported interface. As a direct consequence, XML languages such as XQuery can be used by applications and developers to analyze and process filesystem data. Smarter ways for accessing personal information stored in filesystems are achieved by retrieval strategies with no, partial, or full knowledge about the structure, format, and content of the data (“Query the filesystem like a database”).
In combination with BaseX-Web, a database extension that facilitates the development of desktop-like web applications, we present a system architecture that makes it easier for application developers to build content-oriented (data-centric) retrieval and search applications dealing with files and their contents. The proposed architecture is ready to drive (expert) information systems that work with distinct data sources, using an XQuery-driven development approach. As a concluding proof of concept, a complete development cycle for an OPAC (Online Public Access Catalogue) system is presented in detail.
BaseX and DeepFS - Joint Storage for Filesystem and Database
2009, Holupirek, Alexander, Grün, Christian, Scholl, Marc H.
BaseX is an early adopter of the upcoming XQuery Full Text Recommendation. This paper presents some of the enhancements made to the XML database to fully support the language extensions. The system s data and index structures are described, and implementation details are given on the XQuery compiler, which supports sequential scanning, index-based, and hybrid processing of full-text queries. Experimental analysis and an insight into visual result presentation of query results conclude the presentation.
Pushing XPath Accelerator to its Limits
2006, Grün, Christian, Holupirek, Alexander, Kramis, Marc, Scholl, Marc H., Waldvogel, Marcel
Two competing encoding concepts are known to scale well with growing amounts of XML data: XPath Accelerator encoding implemented by MonetDB for in-memory documents and X-Hive's Persistent DOM for on-disk storage. We identified two ways to improve XPath Accelerator and present prototypes for the respective techniques: BaseX boosts in-memory performance with optimized data and value index structures while Idefix introduces native block-oriented persistence with logarithmic update behavior for true scalability, overcoming main-memory constraints.
An easy-to-use Java-based benchmarking framework was developed and used to consistently compare these competing techniques and perform scalability measurements. The established XMark benchmark was applied to all four systems under test. Additional fulltext-sensitive queries against the well-known DBLP database complement the XMark results.
Not only did the latest version of X-Hive finally surprise with good scalability and performance numbers. Also, both BaseX and Idefix hold their promise to push XPath Accelerator to its limits: BaseX efficiently exploits available main memory to speedup XML queries while Idefix surpasses main-memory constraints and rivals the on-disk leadership of X-Hive. The competition between XPath Accelerator and Persistent DOM definitely is relaunched.
A framework for retrieval and annotation in digital humanities using XQuery full text and update in BaseX
2012, Mahlow, Cerstin, Grün, Christian, Holupirek, Alexander, Scholl, Marc H.
A key difference between traditional humanities research and the emerging field of digital humanities is that the latter aims to complement qualitative methods with quantitative data. In linguistics, this means the use of large corpora of text, which are usually annotated automatically using natural language processing tools. However, these tools do not exist for historical texts, so scholars have to work with unannotated data. We have developed a system for systematic iterative exploration and annotation of historical text corpora, which relies on an XML database (BaseX) and in particular on the Full Text and Update facilities of XQuery.
XQuery Full Text Implementation in BaseX
2009, Grün, Christian, Gath, Sebastian, Holupirek, Alexander, Scholl, Marc H.
BaseX is an early adopter of the upcom- ing XQuery Full Text Recommendation. This paper presents some of the enhancements made to the XML database to fully support the language extensions. The system s data and index structures are described, and implementation details are given on the XQuery compiler, which sup- ports sequential scanning, index-based, and hybrid processing of full-text queries. Experimental analysis and an insight into visual result presen- tation of query results conclude the presentation.
Melting Pot XML : Bringing File Systems and Databases One Step Closer
2007, Holupirek, Alexander, Grün, Christian, Scholl, Marc H.
Ever-growing data volumes demand for storage systems beyond current file systems abilities, particularly, a powerful querying capability. With the rise of XML, the database community has been challenged by semi-structured data processing, enhancing their field of activity. Since file systems are structured hierarchically they can be mapped to XML and as such stored in and queried by an XML-aware database. We provide an evaluation of a state-of-the-art XML-aware database implementing a file system.
INEX Efficiency Track meets XQuery Full Text in BaseX
2009, Gath, Sebastian, Grün, Christian, Holupirek, Alexander, Scholl, Marc H.
BaseX is an early adopter of the upcoming XQuery Full Text Recommendation. This extended abstract describes the challenges of joining the INEX Efficiency Track using BaseX, an XML database system. We will describe some of the problems we encountered during the application of XQuery Full Text to the INEX topic set and discuss the remaining comparability problem.
Implementing Filesystems by Tree-aware DBMSs
2008-08-01, Holupirek, Alexander, Scholl, Marc H.
With the rise of XML, the database community has been challenged by semi-structured data processing. Since the data type behind XML is the tree, state-of-the-art RDBMSs have learned to deal with such data (e.g., [18, 5, 6, 16]). This paper introduces a Ph.D. project focused on the question in how far the tree-awareness of recent RDBMSs can be used to, once again, try to implement filesystems using database technology. Our main goal is to provide means to query the data stored in filesystems and to find ways to enhance/ combine the data storage and query capabilities of operating systems using semi-structured database technology. Two DBMSs with relational XML storage, built on top of the XPath accelerator numbering scheme , are the foundations for our work. With BaseX, an XML database, we establish a link between user, database and lesystem content. BaseX allows visual access to filesystem data stored in the database. An integrated query interface allows users to filter query results in interactive response time. Second, we establish a link between DBMS and OS. We implement a filesystem in userspace backed by the MonetDB/XQuery system, a well-known relational database system, which integrates the Pathfinder XQuery compiler  and the MonetDB kernel . As a result, the DBMS is mounted as a conventional filesystem by the operating system kernel. Consequently, access via the established (virtual) filesystem interface as well as database enhanced access to the same data is provided.
Visually Exploring and Querying XML with BaseX
2007, Grün, Christian, Holupirek, Alexander, Scholl, Marc H.
XML documents are widely used as a generic container for textual contents. As they are increasingly growing in size, XML databases are emerging to efficiently store and query their contents. Besides, due to the hierarchic structure of XML documents, hierarchic visualizations are needed to facilitiate cognitive access to query results. BaseX is a simple database prototype, mapping XML documents to a table based tree encoding. An integrated treemap visualization and a query interface allow visual access to the documents and demonstrate the efficiency of the underlying data storage.