Using Apache Spark and Neo4j for Big Data Graph Analytics

Monday, November 3, 2014

As engineers, when we think about how to solve big data problems, evaluating technologies becomes a choice between scalable and not scalable. Ideally we choose the technologies that can scale to a variety of business problems without hitting a ceiling down the road.

Database technologies have evolved to be able to store big data, but are largely inflexible. The data models require tedious transformations and shuffling around of data. This is a complex process that is compounded in its complexity by combining a variety of inflexible solutions and platforms.

Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.

Still, where does all that data come from? Where does it go when the analysis is done?

Graph databases


I've been working with the Neo4j graph database for the last few years and I have yet to become jaded by its powerful ability to combine both data transformation and data analysis. Graph databases like Neo4j are solving analytical problems that relational databases struggle to solve in a flexible way.

Graph processing at scale from a graph database like Neo4j is a tremendously valuable power.

But if you wanted to run PageRank on a dump of Wikipedia articles in less than 2 hours on a laptop, you'd be hard pressed to be successful. More so, what if you wanted the power of a high-performance transactional database that seamlessly handled graph analysis at this scale?

Mazerunner for Neo4j


Mazerunner is a Neo4j unmanaged extension and distributed graph processing platform that extends Neo4j to do big data graph processing jobs while persisting the results back to Neo4j.

Mazerunner uses a message broker to distribute graph processing jobs to Apache Spark's GraphX module. When an agent job is dispatched, a subgraph is exported from Neo4j and written to Apache Hadoop HDFS.

After Neo4j exports a subgraph to HDFS, a separate Mazerunner service for Spark is notified to begin processing that data. The Mazerunner service will then start a distributed graph processing algorithm using Scala and Spark's GraphX module. The GraphX algorithm is serialized and dispatched to Apache Spark for processing.

Once the Apache Spark job completes, the results are written back to HDFS as a Key-Value list of property updates to be applied back to Neo4j.

Neo4j is then notified that a property update list is available from Apache Spark on HDFS. Neo4j batch imports the results and applies the updates back to the original graph.

Example


Neo4j ships with a sample movie dataset and I'll use that for this example. The graph data model looks like this:



If we run a transformation on this model and create relationships between people who acted in the same movie, we can use the resulting graph to determine which actors are most valuable to work with.

The Cypher query to do this looks like this:
MATCH (a1:Person)-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors)
CREATE (a1)-[:KNOWS]->(coActors);
Now our model looks like this:


By doing this, we create a direct link between actors that can be used for a PageRank analysis of the most valuable actors to work with.

To start the PageRank job on Apache Spark, Mazerunner extends an endpoint to start the processing job.
http://localhost:7474/service/mazerunner/pagerank
Hitting this endpoint will run the PageRank algorithm on actors who know each other and then write the results back into Neo4j as a weight property on the Person nodes.

When the job finishes then we can find the top ten most valuable actors to work with. Running a Cypher query against this dataset produces the following results:

neo4j-sh (?)$ MATCH n WHERE HAS(n.weight) RETURN n ORDER BY n.weight DESC LIMIT 10;
+-----------------------------------------------------------------------+
| n                                                                     |
+-----------------------------------------------------------------------+
| Node[71]{name:"Tom Hanks",born:1956,weight:4.642800717539658}         |
| Node[1]{name:"Keanu Reeves",born:1964,weight:2.605304495549113}       |
| Node[22]{name:"Cuba Gooding Jr.",born:1968,weight:2.5655048212974223} |
| Node[34]{name:"Meg Ryan",born:1961,weight:2.52628473708215}           |
| Node[16]{name:"Tom Cruise",born:1962,weight:2.430592498009265}        |
| Node[19]{name:"Kevin Bacon",born:1958,weight:2.0886893112867035}      |
| Node[17]{name:"Jack Nicholson",born:1937,weight:1.9641313625284538}   |
| Node[120]{name:"Ben Miles",born:1967,weight:1.8680986516285438}       |
| Node[4]{name:"Hugo Weaving",born:1960,weight:1.8515582875810466}      |
| Node[20]{name:"Kiefer Sutherland",born:1966,weight:1.784065038526406} |
+-----------------------------------------------------------------------+
10 rows

PageRank on Wikipedia Dump


A small movie dataset is cute, but if we're talking about scale, we need a dataset that is at a massive scale. I took a dump of Wikipedia and imported it into Neo4j using Graphipedia

The resulting graph of hyperlink references connecting articles to each other resulted in a massive graph in Neo4j.

Nodes: 10,985,338
Relationships: 104,673,778
Database size: 11,002 MB

It took a little less than 3 hours for Mazerunner to run PageRank on 11 million Wikipedia articles on my laptop.

Here are the top 100 results of that analysis:

+-------------------------------------------------------------+
| title                                  | weight             |
+-------------------------------------------------------------+
| "United States"                        | 13481.465434735326 |
| "France"                               | 5866.673240306334  |
| "Animal"                               | 5665.755894233272  |
| "Country"                              | 5422.733465218685  |
| "County"                               | 5097.065153505375  |
| "World War II"                         | 4903.803574507514  |
| "Germany"                              | 4758.829204613294  |
| "United Kingdom"                       | 4538.534060897834  |
| "Russia"                               | 4328.2166400665055 |
| "English"                              | 4220.257198573251  |
| "India"                                | 3956.6896584623237 |
| "England"                              | 3895.7939536781337 |
| "Japan"                                | 3665.9152429486285 |
| "Canada"                               | 3657.4749257843423 |
| "Australia"                            | 3595.941069092141  |
| "Iran"                                 | 3482.6418104876884 |
| "Province"                             | 3400.887300974972  |
| "China"                                | 3239.811469195496  |
| "American"                             | 3192.7699175209023 |
| "French"                               | 3182.4457685761104 |
| "The New York Times"                   | 3181.5839069249864 |
| "Arthropod"                            | 3097.8173315850213 |
| "Italy"                                | 3032.3644182841776 |
| "German"                               | 3012.815858491087  |
| "London"                               | 2965.4743930058657 |
| "Insect"                               | 2869.920827785085  |
| "District"                             | 2784.755324500963  |
| "Spain"                                | 2668.2978896378804 |
| "New York City"                        | 2650.752385794727  |
| "Latin"                                | 2584.6877541535127 |
| "Poland"                               | 2575.2112457023773 |
| "World War I"                          | 2573.8477947401866 |
| "Soviet Union"                         | 2477.694486425502  |
| "Catholic Church"                      | 2474.683318006141  |
| "New York"                             | 2468.8443657519097 |
| "Brazil"                               | 2435.2531424225617 |
| "Romania"                              | 2393.1873377303928 |
| "Europe"                               | 2353.968692909295  |
| "California"                           | 2266.6608579578156 |
| "Lepidoptera"                          | 2158.715841127394  |
| "State"                                | 2132.243578231637  |
| "Italian"                              | 2106.6553548486086 |
| "Greek"                                | 2028.1563330073805 |
| "Switzerland"                          | 2027.8234786567634 |
| "Paris"                                | 1970.357297581375  |
| "Chordata"                             | 1950.4239922246595 |
| "Spanish"                              | 1926.201753857372  |
| "New Zealand"                          | 1828.6683122383472 |
| "Mexico"                               | 1828.0197503654385 |
| "British"                              | 1811.7648105938251 |
| "Washington, D.C."                     | 1810.533732133272  |
| "Netherlands"                          | 1803.7312379217158 |
| "Scotland"                             | 1776.2112626996402 |
| "National Register of Historic Places" | 1765.6396180107192 |
| "Sweden"                               | 1750.4933182628488 |
| "USA"                                  | 1708.9086633673999 |
| "Chordate"                             | 1659.5183460486119 |
| "Ireland"                              | 1630.2162859072075 |
| "United States Census Bureau"          | 1590.6841804100668 |
| "Turkey"                               | 1571.7895450758779 |
| "South Africa"                         | 1552.1113319282647 |
| "Belgium"                              | 1548.2393303043832 |
| "Austria"                              | 1517.0098756018515 |
| "Los Angeles"                          | 1487.3395876804987 |
| "Region"                               | 1455.4554257569105 |
| "Municipality"                         | 1422.385733793246  |
| "Greece"                               | 1415.3614936831354 |
| "CET"                                  | 1393.056892061773  |
| "Norway"                               | 1387.064297388938  |
| "Chicago"                              | 1386.6868499051984 |
| "European Union"                       | 1382.571030912079  |
| "President"                            | 1376.0926002579163 |
| "Philippines"                          | 1371.3808757708794 |
| "BBC"                                  | 1362.697735274599  |
| "White"                                | 1329.6229862450132 |
| "Voivodeship"                          | 1325.3419002294163 |
| "African American"                     | 1316.08289317503   |
| "New York Times"                       | 1305.9732797320457 |
| "Israel"                               | 1289.4941308687928 |
| "United Nations"                       | 1275.3410092372967 |
| "Rome"                                 | 1268.7924011211794 |
| "Gmina"                                | 1264.8386282950032 |
| "CEST"                                 | 1263.3655294822495 |
| "Canadian"                             | 1257.0066905218277 |
| "Argentina"                            | 1256.590368894054  |
| "Dutch"                                | 1253.7080447443823 |
| "Angiosperms"                          | 1247.3964328063778 |
| "Taiwan"                               | 1244.5572462360428 |
| "Native Americans"                     | 1231.9646926767173 |
| "Egypt"                                | 1230.2800089879247 |
| "Chinese"                              | 1226.417515716627  |
| "North America"                        | 1226.211997093438  |
| "Indonesia"                            | 1221.68337046196   |
| "Oxford University Press"              | 1205.4955933795306 |
| "Roman Catholic"                       | 1205.3842960845823 |
| "Islam"                                | 1200.4126147456059 |
| "Pakistan"                             | 1199.2589818906795 |
| "Jews"                                 | 1187.7942956454976 |
| "Texas"                                | 1187.3229003531255 |
| "The Guardian"                         | 1185.9922865148167 |
+-------------------------------------------------------------+

Try it out


Mazerunner is currently in its alpha stages of development. For this initial release PageRank is the only available graph processing algorithm.

Head over to the GitHub repository to keep track of the progress of the project as it moves forward to its 1.0.0 release.

Vote on Hacker News