The chart below was generated using data analyzed with a Neo4j Graph Database and Apache Spark GraphX. 10.9 million Wikipedia articles and 110 million hyperlinks were analyzed to produce a PageRank and Triangle Count for each node in the graph. The Triangle Count metric is a measure of clustering, while the PageRank metric is a measure of relevancy.
Knowledge moves forward in time
Every year through 1850—2012 on the X-axis represents a Wikipedia page that describes historical events and facts about that calendar year. Link analysis was performed on the inbound and outbound hyperlinks for each page and all other pages in the graph that contribute to that page's relevancy.
The chart describes a probability distribution over time. This distribution indicates that if a person were to randomly click hyperlinks starting from any page on Wikipedia, the person would move towards articles with a higher closeness centrality to Category:Year pages occurring later in the timeline.
When it comes to our collective human knowledge, as time moves forward, distant history becomes inversely relevant to more recent events in our timeline.
To see this pattern you can click and drag areas of the chart to zoom in. You'll notice the pattern is local as well as global.
Why is the year 2000 so relevant?
Wikipedia, the world's largest encyclopedia of human knowledge, was first launched on January 15th, 2001.
Links
- Chart source code: jsfiddle/highcharts
- In an upcoming blog post I'll walk you through launching an EC2 Spark Cluster with a Wikipedia dataset in Neo4j. Stay tuned!
- If you're interested in learning more about how I computed the graph metrics for this chart, take a look at this blog post about how to extend Neo4j to do big data analysis with Apache Spark GraphX