Back in 2001, Gartner (then META Group) initially described what has since become known as the “Three Vs of Big Data:"
- Volume - the total amount of data
- Variety - the different forms of data coming in (Tweets, video, sensor data, etc…)
- Velocity - the speed at which data is coming into a system
A great deal of focus since then has been on the first two Vs. Many technologies (e.g. Hadoop, MapR, Ceph) enable the storage of massive quantities of data, commonly referred to as a ‘data lake,’ and perform analysis on the data (e.g. MapReduce, Mahout). The focus on the variety of data has led to the proliferation of NoSQL solutions that focus on ‘schemaless’ storage, allowing different formats of data to be co-located together. An entire discussion surrounding which NoSQL option (and the misnomer of ‘schemaless’) may be right for your product will be coming in a future blog post.
The third V - velocity - has largely been relegated to custom solutions designed to get the incoming data into said ‘data lake’ as quickly as possible. The irony is that the velocity of data in Big Data was the most marketed aspect. Infographics were shared by the thousands demonstrating how much data was being produced in 60 seconds in the world. Domo and Qmee produced two of the most widely seen Big Data infographics.
While we can discuss the reality that many examples of Big Data (including these infographics) have no impact on most companies' Big Data plans (does your product care about the proliferation of Youtube videos?), I’d like to spend a little time discussing the third V, Velocity, in more detail. In the past few years, new tooling has arisen to address what's called ‘stream event processing.' This is when data is processed as it arrives. By viewing data as it actively occurs, immediate insight can be gained, trends can be visualized, threats/warnings can be immediately acted upon, and long term data storage can be reduced, or even eliminated.
The challenge is being able to process this deluge of data in a scaleable, elastic manner. Amazon Web Services (AWS) offers a stream processing service called Kinesis. Additionally, there are open source tools like Storm and Spark. We have more devices in our lives that are collecting more and more data. Most smartphones collect 30-40 events per second, and with the expansion of wearables, this is only going to expand. Factor in the Internet of Things (IoT) and existing social media streams and you can see that more data is available to product developers than ever before.
The traditional approach to this data would be to store it all, then perform analytics on all the data. This would take a long time to run (contrary to market-ese, processing billions of rows of data still takes a long time - just way less than previously). If the incoming data has a measurable ‘time-of-value’ (e.g. security violation in system log), then waiting for that answer may take too long. Even worse, by processing all the data collectively, while it may give a more precise response, enough data may come in while that process is running that the answer may have changed.
This brings us back to stream processing. Being able to effectively process incoming data in a scalable fashion can allow us to gain insights in real-time. Some use cases for stream processing:
- Monitoring web application logs from large clustered applications for suspicious behavior
- Calculating real-time aggregations of messages sent from a deployed mobile app
- Pre-processing data to determine trends (top terms, financial trends, shopping).
- Grouping and aggregating data and storing only the result for later analysis. This may seem counterintuitive to ‘Big Data’ - but while storage costs are definitely cheap, at enough data volume (and stored in triplicate in Hadoop), it adds up. Estimate what the value of the details will be in 3, 6, or 9 months. You may indeed want to save it all, but it should be a deliberate decision.
The key factors in designing your streaming processing system:
- Ensure it can scale to your incoming data speed. If it takes longer to process messages than they are generated by all sources, you’ll need to expand your processing pool.
- You may want to consider a scalable message queue such as Kafka, RabbitMQ, ZeroMQ, or many others to act as a ‘buffer’ if your event traffic is prone to ebbs and flows. Expanding clusters is not instantaneous (nor free). You may, depending on your stream processing framework/hosting be able to automatically adjust your processing cluster depending on queue size.
- If you decide to save the inbound data in a long-term store, consider configuring your stream processing to allow for ‘sampling’ via configuration. Even if you start with storing 100% of the inbound data, being able to selectively tune that down will give you the freedom to do reasonable data analysis without the significant potential overhead of storage of petabytes of data (4PB of data = $45k-$125k per month).
In closing, when dealing with rapid data, particularly data that has a short half-life value, instead of immediately pursuing a strategy of store and analyze, consider a model that processes the data as it arrives, giving you the immediate results you need, and possibly saving a great deal on storage costs. As an amusement, I’d like to leave with one last infographic, a great one from IBM that shows all the ‘Big Data’ associated with searching for ‘Big Data’ (including the number of infographics about Big Data)...