My proposal, titled ParallelGraphs, was to develop a parallelized/distributed graph algorithms library. However, in the first month or so, we decided to work towards a more general framework that supports data analysis on networks (graphs with attributes defined on vertices and edges). Our change in direction was mainly motivated by:
The challenges associated with distributed graph computations. This blog post was an eye opener.
Only very large graphs, of the order of terabytes or petabytes, require distributed execution. Most useful graphs can be analyzed on a single compute node.
Multi-threading is under heavy development, and we decided to wait for the full multi-threaded programming model to be available.
As we looked at public datasets, we felt that the ability to combine graph theoretic analyses with real world data was the missing piece in Julia. LightGraphs.jl already provides fast implementations for most graph algorithms, so we decided to target graph data analysis.
The modified proposal could be summarized as the development of a package that supports:
Vertex and edge metadata : Key value pairs for vertices and edges.
Vertex labelling : Allow vertices to be referenced, externally, through arbitrary Julia types.
SQL like queries for edge data and metadata.
ParallelGraphsturned out to be a misnomer, since we were moving towards a more general purpose data analysis framework. So we chose the name
Graft, a kind of abbreviation for Graph Toolkit. The following sections detail
At first we tried placing the edge data in a SparseMatrixCSC. This turned out to be a bad idea, because sparse matrices are designed for numeric storage. A simpler solution is to store edge metadata in a DataFrame, and have a SparseMatrixCSC map edges onto indices for the DataFrame. This strategy needed a lot less code, and the benchmarks were more promising. Mutations such as the addition or removal of vertices and edges become more complicated however.
If vertex labels were used in the internal implementation, the graph data structure would probably look like this:
Dict( "Alice" => Dict( "age" => 34, "occupation" => "Doctor", "adjacencies" => Dict("Bob" => Dict("relationship" => "follow"))) ), "Bob" => Dict( "age" => 36, "occupation" => "Software Engineer", "adjacencies" => Dict("Charlie" => Dict("relationship" => "friend")) ), "Charlie" => Dict( "age" => 30, "occupation" => "Lawyer", "adjacencies" => Dict("David" => Dict("relationship" => "follow")) ), "David" => Dict( "age" => 29, "occupation" => "Athlete", "adjacencies" => Dict("Alice" => Dict("relationship" => "friend")) ) )
Cleary, using labels internally is a very bad idea. Any sort of data access would set off multiple dictionary look-ups. Instead, if a bidirectional map could be used to translate labels into vertex identifiers and back, the number of dictionary lookups could be reduced to one. The data would also be better structured for query processing.
# Label Map to resolve queries LabelMap( # Forward map : labels to vertex identifiers Dict("Alice" => 1, "David" => 4, "Charlie" => 3, "Bob" => 2), # Reverse map : vertex identifiers to labels String["Alice", "Bob", "Charlie", "David"] ) # Vertex DataFrame 4×2 DataFrames.DataFrame │ Row │ age │ occupation │ ├─────┼─────┼─────────────────────┤ │ 1 │ 34 │ "Doctor" │ │ 2 │ 36 │ "Software Engineer" │ │ 3 │ 30 │ "Lawyer" │ │ 4 │ 29 │ "Athlete" │ # SparseMatrixCSC : maps edges onto indices into Edge DataFrame 4×4 sparse matrix with 4 Int64 nonzero entries: [4, 1] = 1 [1, 2] = 2 [2, 3] = 3 [3, 4] = 4 # Edge DataFrame 4×1 DataFrames.DataFrame │ Row │ relationship │ ├─────┼──────────────┤ │ 1 │ "follow" │ │ 2 │ "friend" │ │ 3 │ "follow" │ │ 4 │ "friend" │
@querymacro is used to simplify the query syntax, and accepts a pipeline of abstractions separated by the pipe operator
|>. The stages are described through abstractions:
# Check if the user has overridden the default labels julia> (g |> eachvertex(v.id == v.label)) |> all # Kirchoff's law :P julia> (g |> eachvertex(v.outdegree - v.indegree)) .== 0
# Kirchoff's law :P julia> (g |> eachvertex(v.outdegree - v.indegree)) .== 0
sis used to denote the source vertex, and
tis used to denote the target vertex in the edge. The symbol
eis used to denote the edge itself. Edge properties can be expressed through the dot notation. Some reserved properties are
# Arithmetic expression on edge, source and target properties julia> g |> eachedge(e.p1 - s.p1 - t.p1) # Check if constituent vertices have the same outdegree julia> g |> eachedge(s.outdegree == t.outdegree) # Count the number of "mutual friends" between the source and target vertices in each edge julia> g |> eachedge(e.mutualcount)
# Count the number of "mutual friends" between the source and target vertices in each edge julia> g |> eachedge(e.mutualcount)
# Remove vertices where property p1 equals property p2 g |> filter(v.p1 != v.p2) # Remove self loops from the graph g |> filter(e.source != e.target)
# Remove self loops from the graph g |> filter(e.source != e.target)
# Preserve vertex properties p1, p2 and nothing else g |> select(v.p1, v.p2) # Preserve vertex property p1 and edge property p2 g |> select(v.p1, e.p2)
# Preserve vertex property p1 and edge property p2 g |> select(v.p1, e.p2)
The typical workflow we hope to support with
Load a graph from memory
Use the query abstractions to construct new vertex/edge properties or obtain subgraphs.
Run complex queries on the subgraphs, or export data to
LightGraphs and run computationally expensive algorithms there.
Bring the data back into
Graft as a new property, or use it to modify the graphs structure.
The following examples should demonstrate this workflow:
Google+: This demo uses a real, somewhat large, dataset with plenty of text data.
Baseball Players: Two separate datasets spliced together, a table on baseball players
and a trust network. The resulting data is quite absurd, but does a good job of showing the quantitative queries Graft can run.
Graph IO : Support more graph file formats.
Improve the query interface: The current pipelined macro based syntax has a learning curve, and the macro itself does some eval at runtime. We would like to move towards a cleaner composable syntax, that will pass off as regular Julia commands.
New abstractions, such as Group-by, sort, and table output.
Database backends : A RDBMS can be used instead of the DataFrames. Or Graft can serve as a wrapper on a GraphDB such as Neo4j.
Integration with ComputeFramework for out of core processing. Support for parallelized IO, traversals and queries.
More information can be found here