What Graph Analysis of Wikipedia Tells Us About the Relevancy of Recent Knowledge

Sunday, December 7, 2014

The chart below was generated using data analyzed with a Neo4j Graph Database and Apache Spark GraphX. 10.9 million Wikipedia articles and 110 million hyperlinks were analyzed to produce a PageRank and Triangle Count for each node in the graph. The Triangle Count metric is a measure of clustering, while the PageRank metric is a measure of relevancy.

Knowledge moves forward in time

Every year through 1850—2012 on the X-axis represents a Wikipedia page that describes historical events and facts about that calendar year. Link analysis was performed on the inbound and outbound hyperlinks for each page and all other pages in the graph that contribute to that page's relevancy.

The chart describes a probability distribution over time. This distribution indicates that if a person were to randomly click hyperlinks starting from any page on Wikipedia, the person would move towards articles with a higher closeness centrality to Category:Year pages occurring later in the timeline.

When it comes to our collective human knowledge, as time moves forward, distant history becomes inversely relevant to more recent events in our timeline.

To see this pattern you can click and drag areas of the chart to zoom in. You'll notice the pattern is local as well as global.

Why is the year 2000 so relevant?

Wikipedia, the world's largest encyclopedia of human knowledge, was first launched on January 15th, 2001.