The recommendation engine built by 3Pillar Labs is a complete framework and sample mobile application that combines big data analytics with social media and mobile. The sample mobile application recommends fitness products to users based on data collected through smart sensors. It also uses the recommendation engine to list appropriate products based on a user's preferences.
Recommendation engines are a fascinating entry point into the world of machine learning. They have been made famous by programming problems such as the Netflix Challenge (2009). These algorithms have an uncanny way of providing recommendations close to what a human would probably have predicted or recommended. One such class of algorithms are known as collaborative filtering algorithms which analyze the association of one set to another set and predict new associations between the elements of the sets. For Netflix, one set was for users and another set was movies. Collaborative filtering does not know (or care) about the nature (or domain) of the elements in either set. How does this play out for businesses that do not have the data volume to take advantage of collaborative filtering? They face the familiar cold start problem. One way to solve the problem is to consider domain knowledge for the predictions.
The questions we wanted to address were the following –
- What does a recommendation engine using collaborative filtering look like?
- How can we build this engine such that any purely mathematical component can be swapped out for a domain aware component?
- What is the template for scaling such an engine?
We started with identification of a suitable framework or library that would have implemented the collaborative filtering algorithms. We looked at a number of options and settled on Apache Mahout as a stable and popular library. Apache Mahout provides a set of libraries which you can use to quickly get started with different machine learning algorithms and another set of libraries you can use to achieve scale with batch processing systems such as Apache Hadoop.
THE VALUE OF THE RECOMMENDATION ENGINE
We built a recommendation engine that analyzes existing user preferences and makes recommendations in real time. Conceptually the engine looks like this.
This simple model allows us to experiment with many different configurations to determine the model and configuration that works best for the problem we are trying to solve. For example, the Data Model can be implemented over a relational database, a graph database, a document database or even a flat file. The User Similarity model can be purely mathematical or can take into account user demographics. The whole User Similarity model can be replaced with an Item Similarity model, which in turn can be purely mathematical or aware of the item domain. In addition there are various thresholds and injection points which can affect the recommendations.
If you want to design a car that goes 70 miles a gallon, that is one problem. If you want to design a car that does 500 miles a gallon, that is a completely different problem!
– Eric Schmidt & Jonathan Rosenberg (from a Khan Academy talk)
We realized that we needed to think completely differently if we wanted to achieve scale. The primary problem of achieving scale was the template that we could reuse across different domains. We needed to build something where a certain component could be replaced to address domain concerns and yet the architectural integrity would be maintained. We experimented with different frameworks and ultimately found the answer with this model.
The Oozie component (the small orange block) is the component that mathematically computes the recommendations. If we wanted to introduce domain knowledge, we would only have to replace that component. The rest of the architecture will remain unchanged. Other things we like about this template:
- The system is completely independent of the application using it. There is no on-premise dependency we need to take care of.
- The system is not constrained to a single recommendation data set. We use HDFS to store the data and different data sets could reside at different HDFS paths. The batch process is triggered by a REST call which can point to the desired data set.
- The system is not constrained to a recommendation system! You could use this template for any kind of event collection and batch processing with Apache Hadoop.
We have researching further into recommendation systems. The avenues for further research are –
- Improving architecture by replacing HDFS with HBase.
- Recommendation engines using Apache Spark.
- Recommendations engines that work directly on graph databases.
The recommendation engine utilizes Apache Mahout and Hadoop in combination with a MongoDB database, Quartz scheduler, and Oozie workflow scheduler. It employs the Runkeeper API to power iOS and Android applications, as well as the Google Products API to make product recommendations.