PQL

download | documentation | source

PQL, or Path Query Language or "pickle", is a database query language for graph-structured data. It is a declarative or algebraic query language, like SQL, as opposed to an algorithmic or operational language. That is, a PQL query describes the data to be retrieved from a graph database, rather than a procedure for retrieving it. It works by matching paths in the graph against path patterns provided in a query.

PQL works on property graphs, and is most suitable for OLTP-type queries. A property graph is a graph where the nodes and edges are augmented with arbitrary key/value data, or properties. A designated edge property is interpreted as the name or label of the edge; PQL's path patterns match these. While we have always wanted to make PQL workable for graph analytics as well as OLTP-type queries, so far it is not expressive enough to represent some common graph analyses; furthermore, transforming an algebraic representation of the results of graph analyses to efficient executable graph algorithms is an open research problem.

Key Features

  • Query on property graphs
  • Query by path matching
  • Path patterns are fully general regular expressions over edge labels
  • (This implies support for transitive closure queries)
  • Path patterns can bind and extract nodes and edges for examination
  • Path patterns can bind anywhere, including inside regexp operators
  • Path patterns can branch, including inside regexp operators
  • Path patterns can extract first-class path values
  • Support for optionally or conditionally-present properties
  • Full support for grouping and aggregation, including ungroup
  • Full support for subqueries
  • Support for relational joins between paths, or of extracted data
  • Support for multiple graphs in a database
  • Support for database-level installable types and functions
  • Approximate type-checking even against a dynamically-typed database
  • Language support for tuples, lists, and (soon) maps
  • Queries can be parameterized rather than assembled by string substitution
  • SQL-like clause syntax

Some Things PQL Isn't

  • A tool for storing and retrieving whole graphs from a database of graphs
  • A tool for computing transformations on graphs

Some Related Work

The Neo4j query language Cypher is rather similar in general intent and style to PQL. PQL has been around longer and remains more powerful, but at this point the differences are rather technical. PQL's path patterns are more general; also, Cypher doesn't support subqueries except in a limited sense and for some reason lacks support for intersection and set difference. However, both support first-class paths, which is the most significant feature for expressive power.

Implementation

PQL is based on a mapping from path patterns to nested relational algebra extended with transitive closure. This allows much of the implementation to use off-the-shelf technology; however, the translation is not entirely trivial (or perhaps we still haven't gotten it framed quite right), and so the implementations have been fairly expensive. PQL has also been under development for a long time, and like many projects has been subject to shifting goals and requirements.

The current version of PQL is intended to be the native high-level query language for the GOAT database project. The current implementation, accordingly, is a compiler that outputs GOAT's low-level procedural query language. This compiler is known as pqqolo and is (perhaps regrettably) written as nearly 40,000 lines of Haskell and likely to grow. It has a join planner designed specifically for dealing with path queries (which generate very large systems of joins) but otherwise so far does only the most basic query optimization; also several fairly large sections are not fully implemented yet. As of December 2014pqqolo is now also a Cypher compiler: the new Cypher frontend does not work yet but is largely complete.

An earlier implementation called PQLBRL ("pickle barrel", the PQL Base Reference Library) was written as a query processing engine suitable for plugging into arbitrary storage backends and arbitrary query submission frontends. This implementation was developed for querying provenance data as part of the PASS (Provenance Aware Storage Systems) project. It is written in about 40,000 lines of C. PQLBRL implements an older version of PQL with different syntax and even a different data model (more like RDF instead of property graphs); the implementation is more or less complete but not efficient and has a lengthy bug list. Updating it for later fairly major revisions to the language did not seem worthwhile.

Project Members

Documentation

If you are interested, please pester dholland to post an updated version of the language specification. The last version available is from 2009 and corresponds to the last release version of PQLBRL, which is completely out of date. However, it can be had here.

(A reasonably current specification exists, but needs to be edited/checked over before it can be posted.)

Publications

For various reasons, for a long time no PQL paper got published; among other things, PQL got started on a bad semantic footing and sorting that out was a slow process. Then, when Cypher appeared, and as Cypher grew more complete and more powerful, the amount of publishable novel material in PQL dwindled. While there are still some points (most notably, PQL's path patterns are more expressive) we eventually decided publishing a paper wasn't worthwhile.

Source Code

The pqqolo implementation is not released yet. If you would like to try it, contact dholland.

The last release of the PQLBRL implementation was 0.4.1. You can download or browse the source. The last development version is substantially newer than 0.4.1; you can check it out via Mercurial. To check out a copy:

   hg clone http://www.eecs.harvard.edu/~dholland/hg/pqlbrl/

You can also browse the source.

Acknowledgements

Source of Support: NSF
CSR-PDOS (IIS-0849392)
Proposal Number: 0849392
Title: SGER: PQL: A Path Query Language