Gprom

GProM

GProM is a generic provenance database middleware that computes provenance for SQL queries, updates, and transactions on demand.

Provenance tracking for database operations, i.e., automatically collecting and managing information about the origin of data, has received considerable interest from the database community in the last decade. Efficiently generating and querying provenance is essential for debugging data and queries, evaluating trust measures for data, defining new types of access control models, auditing, and as a supporting technology for data integration and probabilistic databases. The de-facto standard for database provenance is to model provenance as annotations on data and compute the provenance for the outputs of an operation by propagating annotations. Many provenance systems use a relational encoding of provenance annotations. These systems apply query rewrite techniques to transform a query q into a query that propagates input annotations to produce the result of q annotated with provenance. This approach has many advantages. It benefits from existing database technology, e.g., provenance computations are optimized by the database optimizer. Queries over provenance can be expressed as SQL queries over the relational encoding. Alternatively, we can compile a special-purpose provenance query language into SQL queries over such an encoding. In this project we advance the current state-of-the-art in several aspects:

We have developed the first provenance model and capture mechanism for transactional updates.
We achieve interoperability with other provenance systems by supporting import and export of provenance represented as PROV-JSON
We have developed a cost-based optimizer for provenance instrumentation pipelines.
We have build GProM, a generic provenance middleware that supports multiple frontend languages and backend databases.

GProM is a database middleware that adds provenance support to multiple database backends. Provenance is information about how data was produced by database operations. That is, for a row in the database or returned by a query we capture from which rows it was derived and by which operations. The system compiles declarative queries with provenance requests into SQL code and executes this SQL code on a backend database system. GProM supports provenance capture for SQL queries and transactions, and produces provenance graphs explaining existing and missing answers for Datalog queries. Provenance is captured on demand by using a compilation technique called instrumentation. Instrumentation rewrites an SQL query (or past transaction) into a query that returns rows paired with their provenance. The output of the instrumentation process is a regular SQL query that can be executed using any standard relational database. The instrumented query generated from a provenance request returns a standard relation that maps rows to their provenance. Provenance for transactions is captured retroactively using a declarative replay technique called reenactment that we have developed at IIT. GProM extends multiple frontend languages (e.g., SQL and Datalog) with provenance requests and can produce code for multiple backends (currently Oracle). The reenactment approach was developed in collaboration with Oracle as part of the the provenance for temporal databases project. Other noteworthy features of GProM include: support for multiple database backends and an optimizer for rewritten queries.

System Highlights

Database independent
Provenance for queries, updates, and transactions
User decides when to compute and store provenance
Supports multiple provenance models
Retroactive on-demand provenance computation using instrumentation and reenactment

Only requires audit log and time travel
No additional runtime and storage overhead

Non-invasive provenance computation using query instrumentation and annotation propagation
Provenance import/export
Heuristic and Cost-based optimizations for instrumented queries

Architecture and Approach

An overview of GProM's architecture is shown above. The user interacts with the system using an extension of one of the supported frontend languages (currently SQL and Datalog). Specifically, we support new language constructs for capturing and managing provenance (similar to Perm). Incoming statements are translated into a relational algebra graph representation which we call (1). Similar to intermediate code representations used by compilers, we use relational algebra as an intermediate representation of computation which is independent of the target language. If the statement does not use any provenance features, then the algebra graph is translated back into the declarative language understood by the backend, e.g., we support several native SQL dialects using vendor specific SQL code generators (7). If the input query uses one of the provenance extensions supported by our system, then several instrumentation modules may get involved to serve the user request.

Provenance Computation

Similar to Perm (and other systems) we represent provenance information using a relational encoding of provenance annotations. This representation is flexible enough to encode typical database provenance models including PI-CS (and, thus, provenance polynomials), Where- and Why-provenance, and many others. The provenance rewriter module (2) uses provenance-type specific rules to rewrite an input query $q$ into a query $q^+$ that propagates annotations to produce an encoding of data annotated with provenance (we call this process instrumentation.

Supporting Past Queries, Updates, and Transactions

One unique feature of GProM is that the system can retroactively compute the provenance of queries, updates, and transactions. This feature requires that a log of database operations is available (we call this an audit log) and that the underlying database system supports time travel, i.e., querying past versions of a relation. These features are available in most database systems or can be added using extensibility mechanisms. An audit log paired with time travel functionality is sufficient for computing the provenance of past queries using simple modifications of standard provenance rewrites. Our main contribution is to demonstrate that this is also sufficient for tracking the provenance of updates and transactions. If the user requests provenance for a transaction $T$ , the transaction reenactor (3) extracts the list of SQL statements executed by $T$ from the audit log and constructs a reenactment query $q(T)$ that simulates the effects of these statements. This query runs over the database version valid at the time when the transaction was executed.

We use the provenance rewriter to rewrite $q(T)$ into a query $q(T)^+$ that computes the provenance of the reenacted transaction. Note that the construction of $q(T)$ is independent of the provenance rewrite and $q(T)$ is a standard relational query. This is because the reenactment query $q(T)$ and transaction $T$ are annotation-equivalent, i.e., they have the same result and provenance. Using this approach, we can compute any type of provenance for updates, transactions, and across transactions as long as rewrite rules for computing the provenance of queries have been implemented for this provenance type.

Optimizing Rewritten Queries

GProM includes an optimizer (7) which applies heuristic and cost-based rules to transform instrumented queries into SQL code that can be successfully optimized by the backend DBMS. This is necessary, because provenance rewrites generate queries with unusual access patterns and operator sequences. Even sophisticated database optimizers are not capable of producing reasonable plans for such queries.

Database Backends

Support for additional database backends can be added to GProM by implementing new parser, catalog lookup, and SQL code generator plugins. Here we benefit from our backend-independent relational algebra graph representation of queries, because all the remaining functionality, e.g., provenance computation, works on the database-independent algebraic representation of queries.

Provenance Language Extensions

The wiki of the github repository for GProM documents the SQL and Datalog frontend language extensions.

Collaborators

Dieter Gawlick - Oracle
Oliver Kennedy - SUNY Buffalo
Vasudha Krishnaswamy - Oracle
Venkatesh Radhakrishnan
Zhen Hua Liu - Oracle

Publications

Heuristic and Cost-based Optimization for Diverse Provenance Tasks
Xing Niu, Raghav Kapoor, Boris Glavic, Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy and Venkatesh Radhakrishnan
IEEE Transactions on Knowledge and Data Engineering. 31, 7 (2019) , 1267–1280.

@article{NK18,
  author = {Niu, Xing and Kapoor, Raghav and Glavic, Boris and Gawlick, Dieter and Liu, Zhen Hua and Krishnaswamy, Vasudha and Radhakrishnan, Venkatesh},
  doi = {10.1109/TKDE.2018.2827074},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  keywords = {Provenance; Optimization; GProM},
  longversionurl = {https://arxiv.org/pdf/1804.07156.pdf},
  number = {7},
  pages = {1267--1280},
  pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/NX18.pdf},
  projects = {GProM},
  title = {Heuristic and Cost-based Optimization for Diverse Provenance Tasks},
  venueshort = {TKDE},
  volume = {31},
  year = {2019}
}