PhD Thesis: Abstract

The formulation of constraints and the validation of RDF data against these constraints is a common requirement and a much sought-after feature, particularly as this is taken for granted in the XML world. Recently, RDF validation as a research field gained speed due to shared needs of data practitioners from a variety of domains. For constraint formulation and RDF data validation, several languages exist or are currently developed. Yet, there is no clear favorite and none of the languages is able to meet all requirements raised by data professionals. Therefore, further research on RDF validation and the development of constraint languages is needed.

There are different types of research data and related metadata. Because of the lack of suitable RDF vocabularies, however, just a few of them can be expressed in RDF. Three missing vocabularies have been developed to represent all types of research data and its metadata in RDF and to validate RDF data according to constraints extractable from these vocabularies.

Data providers of many domains still represent their data in XML, but expect to increase the quality of their data by using common RDF validation tools. We propose a general approach to directly validate XML against semantically rich OWL axioms when using them in terms of constraints and extracting them from XML Schemas adequately representing particular domains, without having any manual effort defining constraints.

We have published a set of constraint types that are required by diverse stakeholders for data applications and which form the basis of this thesis. Each constraint type, from which concrete constraints are instantiated to be checked on the data, corresponds to one of the requirements derived from case studies and use cases provided by various data institutions. We use these constraint types to gain a better understanding of the expressiveness of solutions, investigate the role that reasoning plays in practical data validation, and give directions for the further development of constraint languages.

We introduce a validation framework that enables to consistently execute RDF-based constraint languages on RDF data and to formulate constraints of any type in a way that mappings from high-level constraint languages to an intermediate generic representation can be created straight-forwardly. The framework reduces the representation of constraints to the absolute minimum, is based on formal logics, and consists of a very simple conceptual model with a small lightweight vocabulary. We demonstrate that using another layer on top of SPARQL ensures consistency regarding validation results and enables constraint transformations for each constraint type across RDF-based constraint languages.

We evaluate the usability of constraint types for assessing RDF data quality by collecting and classifying constraints on common vocabularies and validating 15,694 data sets (4.26 billion triples) of research data according to these constraints. Based on the large-scale evaluation, we formulate several findings to direct the future development of constraint languages.