More About PQL

 
   

Data Model

PQL works on property graphs. A property graph is a graph where both nodes and edges are augmented with properties, arbitrary key/value data. Property keys (names) and values areatoms (integers, strings, etc.) or unordered multisets of atoms. Currently, property values may not be tuples or tuple-sets; however, whether this should be allowed or not is a periodic topic of discussion and the status quo is likely to shift in the future.

Edges are understood to have names (or labels); these are the elements in path patterns. The name of an edge is a designated (over a whole graph) property, not a separate value.

PQL is designed to operate on dynamically typed (or "semistructured") graph databases with no formal schema. Developing an adequately expressive but still tractable formulation for schemas on graph databases is a topic of intended future research.

Query Model

Queries in PQL are based on the concept of path: a route through a graph, starting at some node, crossing some sequence of graph edges, and ending at some (probably different) node. Apath expression is a a piece of language syntax that, when evaluated and applied to a database, matches some collection of physical paths in the database. A path expression can containbinders, syntax for extracting elements from the database as found during path matching and binding them to variables in the query. The query can then compare these elements or inspect their properties.

Path expressions in PQL are fully general regular expressions over edge names. For example, the query

   select M[name]
   from nodes(Genealogy) as P,
        P.mother+:M
   where P[name] = "John Harvard";
retrieves the names of all female-line ancestors of John Harvard from a hypothetical genealogy database. The path expression P.mother+ follows one or more mother edges starting at the node P, and the syntax :M binds the variable M to the node found at that point. While many other graph languages allow simple transitive closures as in this query, more complex examples that use more elaborate regular expressions, such as
   select F[name]
   from nodes(Genealogy) as P,
        P(.mother|.stepmother)+(.father|.stepfather):F
   where P[name] = "John Harvard";
(which follows either mothers or stepfathers and then takes a final step of father or stepfather) cannot be expressed in most graph languages. To our knowledge, no other comparable query language (even Cypher) supports fully general regular expressions. This query (which retrieves ancestors found by alternating fathers and mothers) requires a complex repeating pattern:
   select A[name]
   from nodes(Genealogy) as P,
        P(.father?)(.mother.father)*(.mother?):A
   where P[name] = "John Harvard";
and this query (which does the same but retrieves the names of all the mothers) requires binders inside the repeating pattern as well as a subquery:
   select (
      select X[name]
      from (case MZ:
		missing: M,
		MZ0: M union all { MZ0 }
	    ;
      ) as X
   )
   from nodes(Genealogy) as P,
        P(.father?)(.mother:M.father)*(.mother:MZ?)
   where P[name] = "John Harvard";

When binding intermediate values within a path pattern is not enough, paths themselves can be extracted and handled in arbitrary ways. Paths thus discovered are first-class values that can be compared, inspected, or returned as query results. This allows querying the database structure as well as the database contents. It also allows easy expression of various sorts of queries that might otherwise be awkward, difficult, or even impossible without external programmatic support. This query, for example, accesses a hypothetical PQL interface to IMDBto evaluate Bacon numbers, but customizes the computation by excluding a couple well-known movies from the search.

   select A2[name], min(length(P))/2 + 1
   from
      nodes(IMDB) as A1,
      A1(.actor-of:M.actor, M[name] as MN)*:A2 as P
   where
      A1[type] = "actor"
      and A1[name] = "Kevin Bacon"
      and "Ishtar" <> all MN
      and "Waterworld" <> all MN
   group by A2
The syntax :M binds the variable M to each intermediate result at that point in the path; in this case, a movie between two actors. The additional attribute lookup within the repeat block causes MN to be bound to the name of each M. This allows excluding certain movies from the computation. Meanwhile, the syntax as P after the path expression binds the variable P to the whole path matched by the pattern. Then the length of this (suitably adjusted) gives the Bacon number.

Query Syntax

The general syntax (as seen in the examples above) is similar to SQL. The primary difference is that the from-clause contains paths rather than relations. There are some other superficial differences, such as support for proper string constants.

The path syntax was chosen to resemble the structure or record lookup syntax of ordinary programming languages, except extended with regular expression operators. (This choice has its origins in PQL's earlier domain of semistructured object data.) The syntax .actor follows a single edge called "actor"; the syntax .actor* crosses zero or more such edges. As in Unix regular expressions, the + and ? operators match one or more, or zero or one, edge respectively, and the | operator handles alternatives. The syntax .actor|.director crosses either an "actor" or a "director" edge.

For more information, including the formal grammar, see the language reference.

History

PQL originally started out as a reimplementation of Lorel, the query language developed for the Lore semistructured database project back in the 90s. In the course of developing PASS, we decided we needed a query language for provenance. Because provenance is graph-structured, existing relational or XML-based tools proved inadequate, so we went looking for alternatives. We originally intended to reimplement the published description of Lorel. However, we ran into some difficulties and in the course of many significant revisions over many years diverged a great deal. PQL now works with property graphs, and it contains a number of things Lorel did not, most notably full regular expressions including complex repeating sections.