REAL TIME ANALYTICS & VISUALIZATION WITH APACHE SPARK

REAL TIME ANALYTICS & VISUALIZATION WITH APACHE SPARK

I’m sure you’ve heard fast data is the new black? If you haven’t, here’s the memo – big data processing is moving from a ‘store and process’ model to a ‘stream and compute’ model for data that has time-bound value.

For example, consider a logging subsystem that records all kinds of informational, warning, and error events in a live application. You would want security-related events to immediately raise a set of alarms instead of looking at the events in a report the next morning. If you are wondering about an architecture that realizes the ‘stream and compute’ model, read on!

THE TECH STACK THAT ENABLES STREAM & COMPUTE

The stack is comprised of 4 layers – the event source, a message broker, a real time computation framework, and a computation store for computed results.

(This is not an exhaustive list)

There are a number of components that can fulfill the need at each layer, and the exact choices depend on your scalability needs or familiarity with the associated ecosystem. One possible combination would be to use RabbitMQ as the message broker and Apache Spark as the computation framework. We have created a demonstration of real-time streaming visualization using these components.

HOW DOES IT WORK?

The event source is a log generated by events from video players simultaneously playing different content in different geographical regions of the world. Each entry in the log consists of multiple parameters, three of which are:

  1. session ID – a unique identifier that represents a particular streaming session. There can be multiple log entries with this identifier.
  2. buffering event – represents a period of time for which the player was buffering, waiting for data.
  3. IP address – the recorded IP address of the event posted to the log collection server (by the video player).

The need was to be able to visualize the buffering events in real time as they occurred in different geographies so that root cause analysis could be performed in near real time.

The sequence of operations is illustrated as follows:

The active sessions are computed by counting the distinct session IDs in the Spark streaming processing window.

(Active sessions over time)

(Distribution of active sessions over country)

A geographic IP database is used to lookup the ISO code for the country given the IP address of the recorded event.

The buffering events are filtered from the streaming RDD and the latitude-longitude information is added to each buffering event. The visualization uses a heat map to denote areas with a larger concentration of the buffering events so that such regions “light up.” We used the Open Layers library for this visualization.

REAL-TIME ANALYTICS, VISUALIZED

As noted earlier, the components can be mixed and matched as per your need. However, if you look at the architecture for the demonstration, there is no poll-wait at any stage. The closing piece of advice is thus: maintain a push model throughout the architecture.

The screenshots are taken from working demonstrable software. Please contact us at Sayantam.Dey@3PillarGlobal.com or Dan.Greene@3PillarGlobal.com if you have need for real-time data visualization and we can walk you through the actual demonstration! This post originally appeared on the 3Pillar Global website. 

TOPIC CLUSTERS WITH TF-IDF VECTORIZATION USING APACHE SPARK

TOPIC CLUSTERS WITH TF-IDF VECTORIZATION USING APACHE SPARK

In my previous post about building an Information Palace that clusters information automatically into different nodes, I wrote about using Apache Spark for creating the clusters from the collected information.

This post outlines the core clustering approach that uses the new DataFrame API in Apache Spark and uses a zoomable circle packing to represent the clusters (using the excellent D3.js). In addition, the common terms and the unique terms at each level of the hierarchy are shown when a cluster or a node is zoomed in. The hierarchy is defined by the overall data set, at each cluster in the data set, and at each data node in a cluster.

In addition to the traditional clustering by machine learning, I have added a “Node Size” attribute which may present a new approach to the organization and presentation of unstructured data such as news or any free text. The “Node Size” attribute is calculated from the unique terms a data node provides to the corpus in addition to the common terms, which allows it to be clustered with similar data nodes.

The thinking behind this approach is if there are some articles that have a common set of words, the one article with more unique words is likely to cover the subject in greater detail than the other ones. In addition, there is a non-deterministic limit for these unique words, beyond which the clustering algorithm will eject it from the cluster. This thinking is further validated by how typical search engines use an inverted documented index to find the most relevant documents for a given search term.

For the media industry, suggesting a particular source of information over other similar sources by using the “Node Size” attribute can be a simpler alternative or even a correlating factor with user trends. I think this is particularly suitable for businesses such as law firms, healthcare, and banking where relevance must be determined from the articles or data sources themselves.

The rest of this post outlines the implementation of the clustering approach.

DATA SET

In order to test the effectiveness of the clustering approach, I curated data from different sources on three historical figures: Napoleon Bonaparte, Sir Winston Churchill, and Mahatma Gandhi. Each data point is represented by a file with the following format:

source||||content

  • source: This is the source URL for the content. For this data set it is the relative path to the file, for example data/churchill-file-1.txt.

  • delimiter: The |||| is the delimiter.

  • content: This is the extracted content from the source. This is just plain text.

In this representation, I have 3 data sources for Napoleon Bonaparte, 2 for Sir Winston Churchill, and 4 for Mahatma Gandhi. The perfect clustering algorithm would create 3 clusters and put the data sources for each historical figure in its own cluster.

LOADING THE CORPUS

The data set is loaded as a paired RDD with the source as the key and the content as the value.  In Spark, an RDD (Resilient Distributed Dataset) is the primary abstraction of a collection of items. Two successive map operations are executed on the values to tokenize the content into sentences and tokenize each sentence into a flat list of words.

The first data frame is built from this RDD and it has a text label and a sequence of tokens (words). A Data Frame is analogous to an SQL table, which holds multiple columns with labels. Spark can create data frames from multiple sources such as Parquet, Hive, or a Spark RDD. Just like some operations on an RDD, data frame transformations are “lazy” – they are executed on the Spark cluster when Spark is asked to compute the result and this computation is known as a ‘fit’. A fit transforms a data frame into a model, which is used to make predictions. Multiple transformers and models can be chained together (in a directed acyclic graph) to make up a pipeline for machine learning.

The initial data frame has “label” for the source (the paired RDD key) and “tokens” for the words (the paired RDD value).

DATA FRAME TRANSFORMATIONS

A series of transformations are added to the pipeline to add additional columns to the data frame. The transformations can be expressed as a DAG where each node denotes the column added to the frame and a callout is the applied transformation.

All the transformers in the pipeline marked with a green background are known as “User Defined Transformers” (UDTs) – Spark makes it simple to use the provided transformers as well as to roll your own transformers. I will cover some of the important transformers:

  • Hashing TF: This transformer is provided by the Spark machine learning library and it transforms each term (word) to its hash (calculated by a hashing function). The input to this transformer is a sequence of terms and the output is a sparse vector (from the linear algorithm module). A sparse vector stores the indices with active dimensions and the values in each dimension. In this case, each index is the output of the hashing function for the term and the value at the index is the frequency of the term in the data node (document).

  • IDF Transform: This Spark-supplied transformer calculates the inverse document frequency of each term. The TF-IDF score for a term that is unique in the corpus is higher than a term that is common in the corpus. Therefore, the lower dimensions of a TF-IDF vector for a document represent its share of common terms in the corpus, while the higher dimensions represent the unique terms.

  • Term Indexer, Term Sorter: The Hashing TF and IDF transformers do not maintain a one-on-one mapping with a term. I did not find a built-in way to correlate a TF-IDF value with the term that produced it (backtracking). I needed this backtracking to extract the common terms and the unique terms. The term indexer accepts a sequence of terms, runs each term by the same hashing function as the Hashing TF transformer and maps out a sequence of (hashed index, term) tuple.  The term sorter joins the index from this tuple to the TF-IDF vector to produce a sequence of (TF-IDF score, term) tuples.

  • Common Features Extractor: This transformer operates on the TF-IDF vector and extracts the lower dimensions of the sparse vector to return a dense vector. This transformation is actually the link to the final clustering transformation.

A Data Frame can be queried for its schema. Once the transformations have been added (not necessarily applied), you can see the schema:

KMEANS CLUSTERING

The Spark machine learning library has a bunch of clustering algorithms. While KMeans is probably the simplest to understand and implement, it is also quite effective and is put to use in lots of applications. KMeans is also implemented as a transformer that takes the Features column from the Data Frame and predicts the clusters in an output column which we call – wait for it – “Cluster”! Once the transformation (actually the pipeline) is applied, the new column reflects the data we were working towards:

The final visualization is available at http://atg.3pillarglobal.com/infop-static/.

The clustering algorithm produces 3 clusters and almost gets it right – one data node for Mahatma Gandhi is clustered with all the Napoleonic data nodes (the irony of it!), rest of the data nodes are clustered correctly.

The KMeans clustering algorithm needs to be told the number of clusters to create. If you ask it to create too few, dissimilar items will appear in the same cluster; too many and similar items will land up in different clusters. The current implementation uses the “Rule of Thumb” to calculate the number of clusters (k):

where n is the number of documents to be clustered. In addition, the current implementation caps ‘k’ to 1000.

FUTURE SCOPE

There are quite a few ways in which the clustering approach can be improved or augmented:

  • For this study, I curated the data manually. In order to automate this, we will need a clever way to filter out the noise to produce meaningful tokens. While this can probably be addressed at the first level by using a series of filtering regular expressions, another way would be to work with a corpus specific stop word list.

  • If you look at the list of unique and common terms you will notice terms like “called” and “acted” – non-root forms of the words. We would probably get better results if the term and document frequencies are calculated on word-stems by applying a lemmatizer.

  • For some corpora, it might be beneficial to perform a part of speech tagging to identify interesting words.

  • At the moment, the number of dimensions analyzed for common and unique terms are fixed. It will be useful to treat these as dynamic variables as a function of the document sizes.

  • The KMeans implementation from Spark uses Euclidean distance as a distance measure between the TF-IDF vectors. I think a cosine similarity model would be a better fit for analyzing similarity between documents. I did not find a way to influence the KMeans transformer to use a cosine distance measure; a future implementation could make use of a custom transformer.

  • The number of clusters to create can be determined by the “Elbow Method” instead of applying the rule of thumb.

Apart from the core algorithm, it would be necessary to evolve this into a system which ingests data and persists the clustered models. If you think of something else, please leave a comment!

This post originally appeared on the main 3Pillar Global website at http://www.3pillarglobal.com/insights/topic-clusters-tf-idf-vectorization-using-apache-spark. 

OUR SECRET SAUCE WITH INTENT ANALYSIS

OUR SECRET SAUCE WITH INTENT ANALYSIS

Interpreting the intent behind spoken or written language is a very complex process that humans learn through experience in different social settings. This is one reason why literature is studied in universities around the world and academics earn doctoral degrees distilling the various nuances of the works of great literary figures.

3Pillar recently teamed up with one of our customers that promises disruption in the way businesses understand competitor strategies and respond to them. The customer has invested years of research and thought leadership in dissecting discourse from a wide variety of industries. The outcome of this research is a model for classifying discourse as a particular strategy that you may be battling or playing. The model further suggests your counter strategies or assistive strategies.

The key challenge in building an automated decision system based on this model is that the computational linguistics involved with intent analysis are not well developed or understood today. The customer asked 3Pillar to build a beta version of the product to understand the possibilities and the challenges. When the 3Pillar team analyzed the model, it realized that many components of the model could be realized with Natural Language Processing (NLP). NLP in and of itself is not the solution to intent analysis, but starting from an ontology or a model of human interaction, NLP should be the next stop.

We buzzed around armed with our Python REPLs (this is not a word yet, but it should be!) and promptly imported the venerable NLTK library and its corpora. The most useful components of the library were as follows:

  • Tokenization – Tokenizing text found on the web is a notoriously difficult activity. There is an entire ecosystem busy with scraping web pages with automated and manual processes. NLTK made it simple to tokenize content in two steps – first to sentences and next to words, and we did not need to bother ourselves with various sentence and word delimiters.

  • Pronunciation – The CMU Pronouncing Dictionary contains about 150,000 words from North American English and their pronunciations. Phonetic analysis of words in a sentence allowed us to find patterns with alliteration and rhyming words.

  • Part of Speech Tagging – We used the Average Perceptron tagger with the UPENN tagset to understand the structure of sentences. Understanding if a sentence began with a verb, if it had cardinal numbers or if it had superlatives were essential for the decision system.

Our choice of Python ultimately enabled us to deliver the decision system prototype complete with an API built with the Flask microframework and an automated installation procedure.

One more successful product and a happy customer. Interested in finding out more? Contact us to hear the rest of the story!

This blog post originally appeared on the main 3Pillar Global website here

BUILDING A MICROSERVICE ARCHITECTURE WITH DOCKER AND SPRINGBOOT

BUILDING A MICROSERVICE ARCHITECTURE WITH DOCKER AND SPRINGBOOT

As more products are being built around reusable REST-based services, it has been quickly discovered that splitting out your business functions into reusable services is very effective, but also introduces a risk point. Every time you update a service, you have to re-test every other service that is deployed along with it -- even if you're confident the code change wouldn't impact the other service, you could never really guarantee it because the services inevitably share code, data, and other components.

The Future of Alternative Mobile Data Input

The Future of Alternative Mobile Data Input

Smart mobile devices have become ubiquitous in our lives and they are also seen as a solution in non-tech business areas. Any business that needs data input has also begun considering mobile devices. We have been observing an emergence of new use-cases for mobile devices: usage for data input in environments with low computer skills personnel, like plumbers, construction workers, or in environments where traditional interaction with such mobile devices is uncommon.

Building a Software Information Palace

Building a Software Information Palace

On the television show The Mentalist, the protagonist, Patrick Jane, often describes a "Memory Palace," which he purportedly uses to store vast amounts of information which he is able to retrieve at will. To create this palace, he advises choosing a large, real, physical location with which you are intimately familiar. Once you have such a palace, he advises you to slot information into the appropriate places in the palace. This, he says, is the best way to not only keep an extensive memory, but also access needed memories with ease. How, as software engineers, can we build this palace in software?

The Future of Digital Payments

The Future of Digital Payments

Currency exchange has seen incremental innovations in the digital sector, but with contactless digital payment mediums we have the potential to create something revolutionary. Major financial institutions have been issuing Near Field Communication Cards (NFC cards) for contactless payments worldwide, mostly in Europe, the US, Australia, and Canada. Most smartphone companies, with Apple Pay, Android Pay, and Samsung Pay leading the charge, have taken the technology further-they’ve developed their own digital payment mediums, which not only eliminates the need of carrying a separate card, but also makes payments more secure.

Introduction to Blockchain Technology

Introduction to Blockchain Technology

The introduction of cryptocurrencies, specifically Bitcoin, has brought the concept of blockchain technology into the mainstream. A blockchain is a continuously growing distributed database that protects against tampering and revision of data. The industry has already seen the power of a distributed system with Git Version control; blockchain builds on the same Merkle tree approach, but also adds consensus, which specifies rules on how data can be added and verified. Transactions are added in blocks and must follow the exact order in which they happened (thus the name blockchain).

Editor's Compass - Find the News First

Editor's Compass - Find the News First

Are you a publisher who is interested in making relevant news available to a localized audience? How often do you find yourself using Twitter for news rather than trusted sources? With the majority of news consumption turning toward digital media, the amount of information available is exponentially increasing. Editors with limited manpower must pick the right news to pursue, but manually filtering out all of the new information to find the worthwhile stories is next to impossible. 

Adding Real-Time Web Functionality to Applications Using SignalR

Recently for a research project, we had to design a solution for a check-in process for a company who specializes in organizing big conferences. For the longest time, these conferences where lacking an automated system. This meant that attendees would line up in multiple queues at the venue, and organizers would manually register and check in each person. This process was very slow and time consuming, so the company was seeking a smart and cost-effective solution that would automate the registration and check-in process at the venue.

Plotting Real-Time Data on an iPad Using Core Plot

Plotting Real-Time Data on an iPad Using Core Plot

In one of our research projects, we had to plot the real-time Heart Rate (HR) of a user -- which was collected from a wearable device during course of their workout -- against time elapsed. This real-time plot showed the status of how well a user is progressing in a workout. To plot this graph, we utilized data visualization techniques. Data visualization is essential to give meaning to raw data. It helps the viewer quickly understand trends and tracking data, as well as aid data analysis. The data can be represented in a multitude of ways, including a flow diagram, pie chart, line graph, scatter plot, and time-series graph.

WebAssembly: Running Byte Code in the Browser

WebAssembly: Running Byte Code in the Browser

Currently, all of the major browser vendors are collaborating together on a new standard called WebAssembly, which is a compilation model for the web using compressed binary AST encoding. This will allow engineers to run code written in different languages directly in the browser. One immediate effect of WebAssembly is that it should bring about a faster online browsing experience. It would also provide C/C++ support to begin with and streamline the internal development process to make Web-app creation easier, according to a study by ReadWrite.

Sage Pay Server Integration using Microsoft.NET

Sage Pay Server Integration using Microsoft.NET

Sage Pay is one of the leading payment gateways in the UK market. To offer more flexibility, it's available in two integration forms, which are called "Form integration" and "Server Integration." Both are PCI compliant. This blog post will help Microsoft.NET developers to integrate the Sage Pay payment gateway (protocol v3.0) along with some of the integration problems and solutions that will be helpful in quickly integrating. Using the steps below, the developer can integrate iFrame solution of Server integration in their website though utilizing “SagePay.IntegrationKit.DotNet.dll” published by Sage Pay.

Building Rails Apps and Deploying to Heroku Cloud with Jenkins

Building Rails Apps and Deploying to Heroku Cloud with Jenkins

Who should read this article? If you are interested in continuous integration or love to use the best tools available to get the most out of the development process, or an if you are an organization looking to make the build and release process smoother. This article will demonstrate the power of Jenkins for building and deploying Ruby on Rails Applications.

Some of the prerequisite knowledge to get the most out of this article:

  • Basic knowledge of RVM and how to install Ruby and Ruby on Rails using RVM
  • Experience with Getting a Rails application up and running

 

"What Is DevOps?" by Dan Greene Published in TechCrunch

"What Is DevOps?" by Dan Greene Published in TechCrunch

Dan Greene, 3Pillar's Director of Advanced Technology in the US, had an article titled "What Is DevOps?" published on TechCrunch today. In the article, Dan gives his definition of the term before talking about some of the benefits companies stand to gain from employing DevOps practices.

The term itself comes from practitioners beginning to "blur the lines between the development tasks such as coding and operational deployment tasks such as server provisioning."

Docker - A Different Breed of Virtualization

Docker - A Different Breed of Virtualization

From Wikipedia: Virtualization, in computing, refers to the act of creating a virtual (rather than actual) version of something, including but not limited to a virtual computer hardware platform, operating system (OS), storage device, or computer network resources.

Today we’re encountering virtualization in most of our computing environments. It provides valuable benefits for software development where one can isolate completely the runtime environment, thus keeping the host machine intact. In the web development world, virtualization is a “must-have” which enables companies to optimize server operation costs.

Building Highly Available Systems Using Amazon Web Services

Building Highly Available Systems Using Amazon Web Services

High Availability is the fundamental feature of building software solutions in a cloud environment. Traditionally high availability has been a very costly affair but now with AWS, one can leverage a number of AWS services for high availability or potentially “always availability” scenario. In this blog, I’ll be mentioning some tips to attain high availability for your application deployed on Amazon Web Services.

The How To Of HBase Coprocessors

The How To Of HBase Coprocessors

What is HBase™? HBase is a column oriented non-relational big data database. HBase is a distributed, scalable, reliable, and versioned storage system capable of providing random read/write access in real-time. It was modeled after Google's Bigtable and just as Bigtable is built on top of Google File System, in the same manner HBase is built on HDFS (Hadoop Distributed File System). This means it is tightly integrated with Hadoop and provides base classes for writing MapReduce jobs for HBase.

Fast Data is the New Black

Back in 2001, Gartner (then META Group) initially described what has since become known as the “Three Vs of Big Data:"

  • Volume - the total amount of data
  • Variety - the different forms of data coming in (Tweets, video, sensor data, etc…)
  • Velocity - the speed at which data is coming into a system

A great deal of focus since then has been on the first two Vs. Many technologies (e.g. HadoopMapRCeph) enable the storage of massive quantities of data, commonly referred to as a ‘data lake,’ and perform analysis on the data (e.g. MapReduce, Mahout). 

Big Data & Machine Learning: Case Study of a Fitness Product Recommender Application

In my previous posts on Big Data and Machine Learning, we explored the core concepts of a recommendation engine using Apache Mahout. We covered how collaborative filtering recommendation engines compares the similarity of users or of items and makes recommendations on this notion of similarity and examining the neighborhood of users. We learned about the interfaces and implementations (of algorithms) available in Apache Mahout.  We also demonstrated how a collaborative filtering recommendation engine nearly recommends the same items as a human with contextual information.

In this post, I'll cover a case study of an application that we at 3Pillar Labs built that recommends fitness products to users based on their fitness data collected through smart sensors.