Deep Learning Sentiment Analysis for Movie Reviews using Neo4j

Monday, September 15, 2014

While the title of this article references Deep Learning, it's important to note that the process described below is more of a deep learning metaphor into a graph-based machine learning algorithm. No neural networks are used.

Sentiment analysis uses natural language processing to extract features of a text that relate to subjective information found in source materials.

Movie Review Sentiment Analysis

A movie review website allows users to submit reviews describing what they either liked or disliked about a particular movie. Being able to mine these reviews and generate valuable meta data that describes its content provides an opportunity to understand the general sentiment around that movie in a democratized way. That’s a pretty cool thing if you think about it. Using machine learning we can democratize subjectivity about anything in the world. We can make an objective analysis of subjective content, giving us the ability to better understand trends around products and services that we can use to make better decisions as consumers.

Sentiment Analysis Data Model

One of the major barriers to unlocking this ability is in the way we structure and transform our data. The current state-of-the-art methods include approaches such as Naive Bayes, Support Vector Machines, and Maximum Entropy. The challenges imposed by these approaches still remains in how features are extracted from a text and structured as data in a way that is least costly in terms of performance. I decided to focus on solving the problem of performance, in the way features are selected and extracted, and the availability of that data as the number of features grow over time.

Using a feature selection algorithm I describe here, I used the Graph Database Neo4j to solve the challenge of data transformation and availability. While the state of the art natural language parsing algorithms are focused on sentence structure, I’ve decided to pursue a statistical approach to natural language grammar induction. My approach focuses on generalizations across a vast corpus of text, generating new features using a method inspired by deep learning, which predicts features with the highest probability of being present to the left or right of a new feature.

Graph-based NLP Example

Let's assume that the phrase “one of the worst” has been extracted as a feature of a set of texts. The reason that this phrase was extracted was that a phrase that it was descended from had determined that this particular phrase was the most statistically relevant, meaning that the phrase had the best chance of being matched after the parent phrase. Using Neo4j we can determine the line of inheritance that produced this phrase as a feature.

Starting at the root node, which is captioned as “{0} {1}”, the path in which the phrase “one of the worst” will be parsed is (the)->(of the)->(one of the)->(one of the worst).

The hierarchy reveals more possibilities as you move deeper from “one of the worst”. Expanding the path seen in the image above to include all possible features that descend from the phrase “one of the worst” reveals the following:

This feature selection algorithm can select on the most statistically relevant features and phrases extracted from a corpus of text in less than a second. The reason an approach like this is extremely relevant to sentiment analysis is that these pattern nodes can be connected to the label of the text they were trained from, as seen below.

The result of this algorithm, and largely thanks to Neo4j’s graph traversals, is that any natural language text can be parsed with sub second performance and generate a subgraph to be used for whichever classification algorithm makes the most sense for your dataset.

Open Source Demo

For the movie review example I took 500 movie reviews for both negative and positive labels and trained a natural language parsing model in Neo4j using Graphify. In the next blog post in this series I will introduce you to a demo that can perform better at classifying movie reviews than a human. The human classification error being 0.3, or 70% success.

If you’re looking to get your feet wet before then, take a look at the finished demo here: Graphify Sentiment Analysis for Movie Reviews

Notes

Neo4j is an open source graph database.

Graphify is an open source extension to Neo4j that extends Neo4j to include classification-based algorithms for natural language processing.

Vote on Hacker News