Effective reorganization and self-indexing of big semantic data

  1. Hernández Illera, Antonio
Supervised by:
  1. Miguel Á. Martínez Prieto Director
  2. Javier David Fernández García Co-director

Defence university: Universidad de Valladolid

Fecha de defensa: 09 November 2020

Committee:
  1. Miguel Rodríguez Penabad Chair
  2. Aníbal Bregón Bregón Secretary
  3. Raquel Trillo Lado Committee member

Type: Thesis

Abstract

The classic Web infrastructure used to publish, consume, and exchange content is also available to host raw data so machines can access and process such information. This so-called Web of Data has grown exponentially in recent years, weaving its own net of online, connected datasets, using RDF as a common language and a bridge between them. All this amount of generated RDF data result in huge collections, consequently opening the doors to various lines of research, including RDF data compression, which optimizes the storage and streamlines data exchange. In contrast to universal compressors, RDF compression techniques are able to detect and exploit specific forms of redundancy, leveraging syntactic and semantic redundancies in RDF data. However, to date, little attention has been paid to some structural regularities that real-world datasets follow and that constitute another source of redundancy. In this thesis we have analyzed the structural redundancy that the RDF graph inherently possesses and we have proposed a preprocessing technique called RDF-Tr (RDF Triples Reorganizer) which groups, reorganizes and re-codes RDF triples, alleviating two sources of structural redundancy underlying the schema-relaxed nature of RDF. We have integrated RDF-Tr into two of the main state-of-the-art RDF compressors, HDT and k2-triples, significantly reducing in both cases the size that the original compressors achieve, thus outperforming the most prominent state-of-the-art techniques. We have denominated HDT++ and k2-triples++ the result of applying RDF-Tr to each compressor. RDF is supported by a whole set of semantic technologies that allows, among other things, access to data in large RDF collections thanks to SPARQL, its own SQL-like query language. In the field of RDF compression, different compact data structure configurations are used to build RDF self-indexes, providing efficient access to the data without (partial or total) decompression. The indexed HDT (called HDT-FoQ) was the pioneer in this scenario and is nowadays used by the semantic community to publish and consume large RDF data collections. In this thesis, we could not ignore this fact, and we have extended HDT++ (called iHDT++) to support full SPARQL Triple Patterns resolution, consuming less memory than its counterpart. We have proven that iHDT++ reduces by 20-45% the space that HDT-FoQ needs, while speeding up the resolution of most Triple Pattern queries, reporting space-time tradeoffs that compete and outperform, in different scenarios, the state-of-the art RDF self-indexes.