Search
Carol McDonald

An Inside Look at the Components of a Recommendation Engine

January 22, 2021

Editor’s Note: MapR products and solutions sold prior to the acquisition of such assets by Hewlett Packard Enterprise Company in 2019 may have older product names and model numbers that differ from current solutions. For information about current offerings, which are now part of HPE Ezmeral Data Fabric, please visit https://www.hpe.com/us/en/software/data-fabric.html

Original Post Information:

"authorDisplayName": "Carol McDonald",
"publish": "2015-04-09T07:00:00.000Z",
"tags": "machine-learning"

Recommendation engines help narrow your choices to those that best meet your particular needs. In this post, we’re going to take a closer look at how all the different components of a recommendation engine work together. We’re going to use collaborative filtering on movie ratings data to recommend movies. The key components are a collaborative filtering algorithm in Apache Mahout to build and train a machine learning model and search technology from Elasticsearch to simplify deployment of the recommender.

What is Recommendation?

Recommendation is a class of machine learning that uses data to predict a user's preference for or rating of an item.  Recommender systems are used in industry to recommend:

  • Books and other products (e.g. Amazon)
  • Music (e.g. Pandora)
  • Movies (e.g. Netflix)
  • Restaurants (e.g. Yelp)
  • Jobs (e.g. LinkedIn)

The recommender relies on the following observations:

  1. Behavior of users is the best clue to what they want.
  2. Co-occurrence is a simple basis that allows Apache Mahout to compute significant indicators of what should be recommended.
  3. There are similarities between the weighting of indicator scores in output of such a model and the mathematics that underlie text retrieval engines.
  4. This mathematical similarity makes it possible to exploit text-based search to deploy a Mahout recommender using a search engine like Elasticsearch.

Architecture of the Recommendation Engine

The architecture of the recommendation engine is shown below:

  1. Movie information data is reformatted and then stored in Elasticsearch for searching.
  2. An item-similarity algorithm from Apache Mahout is run with user movie ratings data to create recommendation indicators for movies. These indicators are added to the movie documents in Elasticsearch.  
  3. Searches of a user's preferred movies among the indicators of other movies will return a list of new films sorted by relevance to the user's taste.

Collaborative Filtering with Mahout

A Mahout-based collaborative filtering engine looks at what users have historically done and tries to estimate what they might likely do in the future, if given a chance. This is accomplished by looking at a history of which items users have interacted with. In particular, Mahout looks at how items co-occur in user histories.  Co-occurrence is a simple basis that allows Apache Mahout to compute significant indicators of what should be recommended. Suppose that Ted likes movie A, B, and C. Carol likes movie A and B. To recommend a movie to Bob, we can note that since he likes movie B and since Ted and Carol also liked movie B, movie A is a possible recommendation. Of course, this is a tiny example. In real situations, we would have vastly more data to work with.

In order to get useful indicators for recommendation, Mahout’s ItemSimilarity program builds three matrices from the user history:

1. History matrix:  contains the interactions between users and items as a user-by-item binary matrix.

2. Co-occurrence matrix:  transforms the history matrix into an item-by-item matrix, recording which items co-occur or appear together in user histories.

In this example movie A and movie B co-occur once, while movie A and movie C co-occur twice. The co-occurrence matrix cannot be used directly as recommendation indicators because very common items will tend to occur with lots of other items simply because they are common.

3. Indicator matrix: The indicator matrix retains only the anomalous (interesting) co-occurrences that will serve as clues for recommendation. Some items (in this case, movies) are so popular that almost everyone likes them, meaning they will co-occur with almost every item, which makes them less interesting (anomalous) for recommendations.  Co-occurrences that are too sparse to understand are also not anomalous and thus are not retained. In this example, movie A is an indicator for movie B.

Mahout runs multiple MapReduce jobs to calculate the co-occurrences of items in parallel. (Mahout 1.0 runs on Apache Spark). Mahout’s ItemSimilarityJob uses the log likelihood ratio test (LLR) to determine which co-occurrences are sufficiently anomalous to be of interest as indicators. The output gives pairs of items with a similarity greater than the threshold you provide.

The output of the Mahout ItemSimilarity job gives items that identify interesting co-occurrences, or that indicate recommendation, for each item. For example, the Movie B row shows Movie A is indicated, and this means that liking Movie A is an indicator that you will like Movie B.

Elasticsearch Search Engine

Elasticsearch is an open-source search engine built on top of Apache Lucene™, a full-text search engine library. Full-text search uses precision and recall to evaluate search results:

  • Precision = proportion of top-scoring results that are relevant
  • Recall = proportion of relevant results that are top-scoring

Elasticsearch stores documents, which are made up of different fields. Each field has a name and content. Fields can be indexed and stored to allow documents to be found by searching for content found in fields.

For our recommendation engine, we store movie meta data such as id, title, genre, and also movie recommendation indicators, in a JSON document:

{
"id": "65006",
"title": "Electric Horseman",
"year": "2008",
"genre": ["Mystery","Thriller"]
}

The output row from the indicator matrix that identified significant or interesting co-occurrence is stored in the Elasticsearch movie document indicator field. For example, since Movie A is an indicator for Movie B, we will store Movie A in the indicator field in the document for Movie B. That means that when we search for movies with Movie A as an indicator, we will find Movie B and present it as a recommendation.

Search engines are optimized to find a collection of fields by similarity to a query. We will use the search engine to find movies with the most similar indicator fields to a query.

Related

Ted Dunning & Ellen Friedman

3 ways a data fabric enables a data-first approach

Mar 15, 2022
Nicolas Perez

A Functional Approach to Logging in Apache Spark

Feb 5, 2021
Cenz Wong

Getting Started with DataTaps in Kubernetes Pods

Jul 6, 2021
Kiran Kumar Mavatoor

Accessing HPE Ezmeral Data Fabric Object Storage from Spring Boot S3 Micro Service deployed in K3s cluster

Sep 13, 2021
Carol McDonald

Analyzing Flight Delays with Apache Spark GraphFrames and MapR Database

Dec 16, 2020
Nicolas Perez

Apache Spark as a Distributed SQL Engine

Jan 7, 2021
Carol McDonald

Apache Spark Machine Learning Tutorial

Nov 25, 2020
Nicolas Perez

Apache Spark Packages, from XML to JSON

Dec 11, 2020

HPE Developer Newsletter

Stay in the loop.

Sign up for the HPE Developer Newsletter or visit the Newsletter Archive to see past content.

By clicking on “Subscribe Now”, I agree to HPE sending me personalized email communication about HPE and select HPE-Partner products, services, offers and events. I understand that my email address will be used in accordance with HPE Privacy Statement. You may unsubscribe from receiving HPE and HPE-Partner news and offers at any time by clicking on the Unsubscribe button at the bottom of the newsletter.

For more information on how HPE manages, uses, and protects your personal data please refer to HPE Privacy Statement.