Editor’s Note: MapR products and solutions sold prior to the acquisition of such assets by Hewlett Packard Enterprise Company in 2019 may have older product names and model numbers that differ from current solutions. For information about current offerings, which are now part of HPE Ezmeral Data Fabric, please visit https://www.hpe.com/us/en/software/data-fabric.html
Original Post Information:
"authorDisplayName": "Carol McDonald",
"publish": "2015-04-09T07:00:00.000Z",
"tags": "machine-learning"
Recommendation engines help narrow your choices to those that best meet your particular needs. In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model and search technology from Elasticsearch to simplify deployment of the recommender.
What is Recommendation?
Recommendation is a class of machine learning that uses data to predict a user's preference for or rating of an item. Recommender systems are used in industry to recommend:
- Books and other products (e.g. Amazon)
- Music (e.g. Pandora)
- Movies (e.g. Netflix)
- Restaurants (e.g. Yelp)
- Jobs (e.g. LinkedIn)
The recommender relies on the following observations:
- Behavior of users is the best clue to what they want.
- Co-occurrence is a simple basis that allows Apache Mahout to compute significant indicators of what should be recommended.
- There are similarities between the weighting of indicator scores in output of such a model and the mathematics that underlie text retrieval engines.
- This mathematical similarity makes it possible to exploit text-based search to deploy a Mahout recommender using a search engine like Elasticsearch.
Architecture of the Recommendation Engine
The architecture of the recommendation engine is shown below:
- Movie information data is reformatted and then stored in Elasticsearch for searching.
- An item-similarity algorithm from Apache Mahout is run with user movie ratings data to create recommendation indicators for movies. These indicators are added to the movie documents in Elasticsearch.
- Searches of a user's preferred movies among the indicators of other movies will return a list of new films sorted by relevance to the user's taste.
Collaborative Filtering with Mahout
A Mahout-based collaborative filtering engine looks at what users have historically done and tries to estimate what they might likely do in the future, if given a chance. This is accomplished by looking at a history of which items users have interacted with. In particular, Mahout looks at how items co-occur in user histories. Co-occurrence is a simple basis that allows Apache Mahout to compute significant indicators of what should be recommended. Suppose that Ted likes movie A, B, and C. Carol likes movie A and B. To recommend a movie to Bob, we can note that since he likes movie B and since Ted and Carol also liked movie B, movie A is a possible recommendation. Of course, this is a tiny example. In real situations, we would have vastly more data to work with.
In order to get useful indicators for recommendation, Mahout’s ItemSimilarity program builds three matrices from the user history:
1. History matrix: contains the interactions between users and items as a user-by-item binary matrix.
2. Co-occurrence matrix: transforms the history matrix into an item-by-item matrix, recording which items co-occur or appear together in user histories.
In this example movie A and movie B co-occur once, while movie A and movie C co-occur twice. The co-occurrence matrix cannot be used directly as recommendation indicators because very common items will tend to occur with lots of other items simply because they are common.
3. Indicator matrix: The indicator matrix retains only the anomalous (interesting) co-occurrences that will serve as clues for recommendation. Some items (in this case, movies) are so popular that almost everyone likes them, meaning they will co-occur with almost every item, which makes them less interesting (anomalous) for recommendations. Co-occurrences that are too sparse to understand are also not anomalous and thus are not retained. In this example, movie A is an indicator for movie B.
Mahout runs multiple MapReduce jobs to calculate the co-occurrences of items in parallel. (Mahout 1.0 runs on Apache Spark). Mahout’s ItemSimilarityJob uses the log likelihood ratio test (LLR) to determine which co-occurrences are sufficiently anomalous to be of interest as indicators. The output gives pairs of items with a similarity greater than the threshold you provide.
The output of the Mahout ItemSimilarity job gives items that identify interesting co-occurrences, or that indicate recommendation, for each item. For example, the Movie B row shows Movie A is indicated, and this means that liking Movie A is an indicator that you will like Movie B.
Elasticsearch Search Engine
Elasticsearch is an open-source search engine built on top of Apache Lucene™, a full-text search engine library. Full-text search uses precision and recall to evaluate search results:
- Precision = proportion of top-scoring results that are relevant
- Recall = proportion of relevant results that are top-scoring
Elasticsearch stores documents, which are made up of different fields. Each field has a name and content. Fields can be indexed and stored to allow documents to be found by searching for content found in fields.
For our recommendation engine, we store movie meta data such as id, title, genre, and also movie recommendation indicators, in a JSON document:
{ "id": "65006", "title": "Electric Horseman", "year": "2008", "genre": ["Mystery","Thriller"] }
The output row from the indicator matrix that identified significant or interesting co-occurrence is stored in the Elasticsearch movie document indicator field. For example, since Movie A is an indicator for Movie B, we will store Movie A in the indicator field in the document for Movie B. That means that when we search for movies with Movie A as an indicator, we will find Movie B and present it as a recommendation.
Search engines are optimized to find a collection of fields by similarity to a query. We will use the search engine to find movies with the most similar indicator fields to a query.