Technical Report MSC-2009-01

Title: PIXSAR: Incremental Reclustering of Augmented XML Trees
Authors: Shnaiderman Lila
Supervisors: Oded Shmueli
Abstract: XML is one of the primary encoding schemes for data and knowledge. We investigate incremental physical data clustering in systems that store XML documents using a native format. We formulate the XML clustering problem as an augmented (with sibling edges) tree partitioning problem and propose the PIXSAR (Practical Incremental XML Sibling Augmented Reclustering) algorithm for incrementally clustering XML documents. We show the fundamental importance of workload-driven dynamically rearranging storage. We also touch upon extending PIXSAR for handling insertions and deletions. PIXSAR incrementally executes reclustering operations on selected subgraphs of the global augmented document tree. The subgraphs are implied by significant changes in the workload. As the workload changes, PIXSAR incrementally adjusts the XML data layout so as to better fit the workload. PIXSAR's main parameters are the radius, in pages, of the augmented portion to be reclustered and the way reclustering is triggered. We also adapted PIXSAR to operate on a physical disk. Further we suggest novel architectures that utilize PIXSAR in multi-level storage. We propose a specific algorithm for one of these architectures. We use an experimental data clustering system that includes a disk simulator and a File System simulator for storing native XML data. We mainly consider static files but also explain how dynamically updated files can be handled. We use a novel method for 'exporting' the Saxon query processor into our setting. Experimental results indicate that using PIXSAR significantly reduces the number of page faults (counting ALL page faults incurred while querying the document as well as those during maintenance operations) thereby resulting in improved query performance (often by 20%-40%). We also conducted extensive experiments on a physical disk. Interestingly, the improvements observed on the real disk are much larger (often by more than 50%) than those predicted by simulations. In addition to the XML augmented tree, there are also indices. The kind of index we consider is based on a XPath expression and it consists of index entries pointing to XML target nodes. Using such index entries one jumps directly to target nodes. Often, target XML nodes are accessed in temporal proximity and hence, for paging reasons, it is beneficial to store them on the same disk page. In other cases, such temporal proximity is absent and hence co-storing is not optimal. Designing an algorithm that views the XML data and indices as a sibling augmented tree with multiple roots (the additional roots correspond to indices) is complex. An extension to the PIXSAR algorithm, called iPIXSAR, is proposed. It extends PIXSAR so as to make storing decisions of target XML nodes based on possible membership in more than one tree. Experimental results show that in the presence of indices iPIXSAR is superior to PIXSAR by at times up to 8%.
CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2009
To the main CS technical reports page

Computer science department, Technion