As promised in our previous blog post Flavours of Streaming Processing, in this post we report the performance of our weapon of choice when it comes to classification on data streams: the Streaming Random ForestSM (SRF).
Random Forests were introduced by Leo Breiman et al in 2001. They combine two of the most powerful ideas in classification, decision trees and bagging, to represent a decision rule as a kind of majority vote over a large number of different decision trees that are generated probabilistically.
However, Random Forests scale poorly with the size of the dataset. This makes them impractical in streaming contexts: similarly to nearest neighbours, Random Forests need multiple runs over the full data history in order to update their predictions whenever a new datapoint arrives. Eventually, no matter how fast your hardware, Random Forests will become slower than your data arrival rate; and no matter how big your memory, they will outgrow it, at which point you will have to scale out. This is now possible using tools like Apache Spark MLib but the fundamental truth that the algorithm becomes slower over time remains true no matter how advanced the infrastructure.
So one reason why one should switch to online Random Forests is speed and scalability. But the other is that decision rules that were valid last week might be obsolete today. An immediate “hack” is to retrain your classifier in regular intervals, using a fixed number of the most recent observations. This is a great quick fix, but it suffers from two drawbacks:
To demonstrate this latter issue, we run an experiment on a dataset knownn as ELEC2 which has become a benchmark in the streaming data literature. Each record contains a timestamp, as well as four covariates capturing aspects of electricity demand and supply for the Australian New South Wales (NSW) Electricity Market from May 1996 to December 1998. Labels indicate the price change related to a recent moving average.
Below we report the error rate (lower is better) achieved by a sliding window implementation of random forests using the randomForest package in R, for 8 different window sizes (left group of bars in the Figure below). When the dataset is ordered by timestamp, the best performing window size is 100, on the lower end of the scale. This is classic case of “more data does not equal more information”: using 100 times more data (w=10,000 vs w=100) almost doubles (175%) the error rate!!
To drive the point home, we took the same data, but presented it to the classifier in random order so that it was no longer possible to take advantage of temporal effects. In this case, without any temporal effects, indeed the accuracy improved with larger window sizes.
The advantage that a well-calibrated streaming method can have over its offline counterpart in a streaming context is quite dramatic: in this case, the best-performing streaming classifier has an error rate of 12%, whereas a random forest trained on the entire dataset (minus 10% withheld for testing) achieves an error rate of 24%. A fraction of the data, double the accuracy !
So it seems that you can have your cake and eat it. Or can you? Well, you can only achieve this fantastic result if the window size is well-tuned. And that’s a tricky business… Set it too small, and you will have to learn everything from scratch every few observations. Set it too long, and obsolete patterns will start misleading you. And to make things work, a technique like random forest consists of a composition of thousands of decision rules, some of which may be getting obsolete faster than others.
Put simply, any model that runs on a live stream must be updated regularly, but there are fundamentally better ways to update a model than to rebuild it from scratch. A bit like a human, successful online learning algorithms recognise that they have a finite memory, and that some patterns “age” faster than others. The brute force of a sliding window cannot take you very far in real applications.
The Mentat SRF
Mentat’s Streaming Random Forest has three unique features:
Indeed, in the example above, the SRF tool achieves the same result as the best-performing window, without any manual tuning whatsoever. This performance is identical to the state-of-the-art in the literature (see a related publication by our Chief Data Scientist here).
Empirical Study: Detecting MalwareThe Kaggle competition of 1999 introduced a benchmark dataset for classification on large datasets. It considered five types of traffic:
This is an example of a multi-label classification problem: we don’t simply need to detect malicious traffic, but want to know what type of attack it is in real-time, so that we can take automatic action.
It was revealed that Probe and DOS traffic can be distinguished from normal traffic by very simple methods, such as a nearest neighbours classifier – the winning method only managed to improve accuarcy in these cases by a few percentage points. User-to-root and remote-to-local attacks were much more challenging, with the winning method achieving 92% and 87% error rate respectively – more than 8 out of 10 examples were misclassified in both cases.
Below we compare SRF to the winning method, in terms of the error rate (lower is better):
When it comes to DOS and Probe there wasn’t much to improve upon – the baseline method already achieved error rates below 5%. However, for remote-to-local and user-to-root, the ’99 Winner improved upon the baseline significantly (red is lower than blue in the last two columns). SRF blows it out of the water, with less than half the errors for remote-to-local, and a 40% decrease for user-to-root. These improvements look even more impressive when one takes into account two things:
Description of the API
The Mentat SRF has a very simple API, with three components:
To make it easier for you we can also accept multiple datapoints in one request. Our webservice for both predict calls is extremely fast and scalable, and our premiums service can support even the most demanding real-time applications, via Amazon Kinesis.
To sum up:
For the enthusiasts out there, we can expose more control parameters, as well optionally return the updated decision rule as an XML file after an update call. We can also expose monitoring parameters such as the recent error rate or AUC (although we prefer to measure classification performance by the H-measure as well as a metric of how fast your data is aging.) If you want to be part of the private beta, please get in touch.
Feature extraction and unstructured data
Often datasets require a feature extraction step before they can be turned into the right format for classification. This step can be another crucial factor in performance. Our experienced data science team can support this on a consulting basis.
If your dataset or tasks involve unstructured data, such as free text, images, or networks, do not despair!. We can support feature extraction steps on all such data types, either using our own bespoke software or via integration with Alchemy API, as we are an IBM Watson Ecosystem Partner.
What else is coming?
Classification is a supervised technique: a bit like a child, the classifier needs to be shown labelled examples of the ground truth in order to extrapolate for the future. But children also learn by themselves, without explicit guidance by gradually developing an understanding of “normal behaviour” of their environment, and responding appropriately to any “anomalous events”.
This is also true of machine learning algorithms. For example, in cybersecurity, the most dangerous attacks come in the form of previously unseen types of threats (0-day). A supervised classifier is by construction hopeless against such novel threats. Similarly, new types of fraud appear every day; new types of bugs are are introduced in software; and new customer behaviours can develop on a weekly basis.
When everything changes, profiling and anomaly detection is a necessary complement to agile, adaptive classification technology. This is why our core product, datastream.io, is designed to perform unsupervised profiling and anomaly detection at scale, as a service. Please stay tuned for our next blog post on this topic.
We organised together with our friends at Workable the inaugural Data Science Meetup in Athens, Greece on October 1st. The event proved to be a huge success with over 100 participants attending at the Corallia a2-innohub cluster. A great start for a very promising community !
In his excellent recent blog post "Streaming 101", Tyler Akidau made a great contribution to the streaming community by teasing apart certain notions that have been confounded by standard albeit increasingly obsolete practices. Within that same spirit of paving the ground for the streaming revolution, this blog post wishes to emphasise the observation that streaming does not necessarily mean approximate, with a particular focus on machine learning.
Let's start with the simplest possible example, that of a linear sum: it can be easily computed in an incremental (exactly-once) processing manner:
The answer in this case is exact: it agrees precisely with a batch sum. Such examples are referred to as exact incremental or exact online processing.
However, not all queries can be computed exactly in an online fashion. A classical example is top-N, or a median. To understand why, it is best to define our terms: what does incremental or online really mean? Put simply, it means that the memory and time requirements of the process do not grow over time - despite the fact that the number of data entries processed does (since data streams are by definition semi-infinite - in Tyler's terminology: unbounded).
Indeed, to maintain a sum, one only needs one to store one floating number in memory (the last value of the sum). To maintain an average, one needs to store two: the sum, and a count of how many data entries have been seen so far. However, to maintain a median, one needs to keep in memory an array whose size grows with the total number of data entries seen. Similarly for top-N. Consequently, the best we can do in a streaming context is offer approximate answers to queries such as a median, top-N, or a percentile calculation.
"Approximate answers" has been a longstanding criticism of streaming processing. MapReduce enthusiasts have always argued that, no matter how big your dataset is becoming over time, unless you desperately need real-time answers, you should always scale out and run batch to ensure correctness.
Needless to say that, as Tyler also mentions, approximate queries could be implemented in batch mode, too, in time-critical contexts. In fact, the initial motivation behind online processing was precisely that: to generate fast approximations by sequentially processing massive datasets.
However, as soon as we move beyond simple analytics such as sums and top-Ns, and into the realm of machine learning, the situation becomes a lot more interesting. Very few batch Machine Learning algorithms have exact online counterparts (one example being linear regression which can be exactly computed online via the Recursive Least Squares algorithm). Therefore, your favorite batch random forest will only approximately match its online version.
Does that automatically make the online algorithm inferior, or sub-optimal to the batch version? The answer is a definite (and perhaps surprising): No !
A major distinction between analytical queries and machine learning algorithms is that the former are fixed functions of the data, such as a sum or a rank, whereas the latter is a process, rather than a function, whose objective is to discover hidden patterns or rules that generalise well against future datapoints. There is therefore no theoretical reason why a batch random forest must necessarily outperform a well-designed online random forest.
In the seminal book Principles of Data Mining the chairman of our advisory board Professor David Hand and his coauthors make this point really nicely. They break down a data mining algorithm into the following constituents:
Possibly the easiest way to understand this is to consider a neural network with a fixed architecture (i.e., number of hidden layers etc.), which has to be trained on a dataset of labelled examples. Because neural networks are infamous for producing optimisation surfaces with a huge number of local optima, no gradient descent method is guaranteed to identify the global optimum, and, indeed, conjugate GD might produce a different answer than simple GD. In this sense, an online algorithm using Stochastic Gradient Descent is just another candidate, which might or might not outperform the batch candidates. In any case, it makes no sense to consider the SGD answer as an approximation to the GD answer. In truth, they are both approximations of the unknown "true" decision boundary, and, if they are well-designed, they should both approach the "true" answer as the size of the dataset increases. Which one gets there faster is largely dependent on the problem and the class of methods. Put differently, in machine learning, both batch and online algorithms are approximations, that use data and computational resources differently.
A clarification is in order at this point: sliding windows are an extremely inefficient way to perform streaming learning, precisely because they are designed as approximations to batch learning, albeit starved of information due to the fact that they use only a fraction of the available data. Overarching frameworks for online learning such as stochastic gradient descent and stochastic approximation ensure instead that information is extracted and stored in the most efficient manner possible and updated sequentially.
But, surely, you might argue, the batch paradigm has more computational resources and can therefore support superior algorithms, right? As a general statement, this argument makes sense, but in practice, several statistical phenomena conspire to make online algorithms extremely competitive. We only mention here a few, and promise to analyse further in a future post:
Finally, are streaming algorithms really "more complicated"? We think this question is moot. Some models admit simple algorithms whereas others don't, and the same holds of streaming versions. Moreover, online learning is no longer as fragmented as it was ten years ago, since overarching frameworks such as Stochastic Approximation and Stochastic Gradient Descent haved matured to the extent that it is now possible to reuse a lot of components in a streaming machine learning library.
Let's recap our key observations:
A typical wind turbine is equipped with around 100 sensors, each producing 100 datapoints per second. And yet communication can be patchy, 3G rare. CCTV networks can only afford to send back to the data center a small fraction of the captured videostreams. Robots in a manufacturing plant can sense at millisecond granularity, but the plant's SCADA infrastructure can handle perhaps 1 observation per minute (with 1 observation per 15 minutes being a typical sampling rate).
This is a very '2015' problem. As the IoT gets more commoditized and high-frequency, one Moore's law (cheaper sensors) goes against another (cheaper storage) and the outcome is not predetermined: in a large variety of applications it is now necessary to perform data reduction at the edge. Put simply, one reports less frequently than the capture rate, by performing some form of aggregation at the edge.
At Cisco Live in San Diego where we presented our solution for on-the-fly, on-the-edge Machine Learning, an IoX product manager put it nicely: "right now we have devices flooding the network with a constant stream of 'I am doing OK. I am doing OK. I am doing OK. I am doing OK ...'. I'd much rather just hear from them when they are in trouble". Mentat's anomaly detection technology can enable fruitful interactions between humans and the tsunami of data from the IoT.
Senior voices joined in during the Executive Symposium panel discussion, where Mala Anand, our Chief Data Scientist and other prominent figures in Data Science addressed the future of data and machine learning. In her concluding remarks, Mala emphasised the need for data reduction on the edge as data sources such as video and high-frequency sensing from the IoT proliferate.
What is it that makes Streaming Edge analytics different from Traditional analytics, or Big Data analytics? There are three main differences :
1. Scale with Velocity (not just Volume). A Hadoop query that runs a little slow is a discomfort; a streaming query that runs too slow is a disaster, resulting either in data loss or a system crash (or both), as new observations pile in waiting to be processed. Moreover, most edge use cases involve the need for real-time actionable insights with short lifespan, where "delayed" is just as bad as "missed".
2. You can't store everything. By definition of the use case, some data will be lost. But the information contained in this data need not be lost, if your streaming analytics are done properly. The simplest example is a sum: it is trivial to compute it in a streaming fashion, via a cumulative sum that yields a 100% accurate answer without any data storage. Surprisingly many statistical insights can be extracted in this way.
3. You can't scale out easily (neither on CPU nor on RAM). Edge computing generally relies on few, small computers. Think 2000. Forget about machine learning techniques that need GPU farms to get started.
These are the constraints we listed on our white board at the South West of rainy London when we set out to design our demo for Cisco Live at sunny San Diego a few months back. And this is what we came up with.
The core idea came easily: we needed to deploy on a Raspberry Pi. These tiny very low cost devices are fairly close to the capabilities of a typical fog node. The Pi would receive a stream of data in one port at high frequency, process it in real-time, and output the resulting analytical insights in another port at a lower frequency.
The challenge for the data replay was to stream the data at the same pace as the original dataset was being saved. In order to do so, we created a small tool coded in Elixir that would read and replay a dataset (CSV format) and re-stream over the network it in real time, re-timing each observation with the current timestamp. The tool would detect the original time delta between two observations and would make sure that we don’t emit a new observation too early or too late.
Since Linux’s Process Scheduler doesn’t give you any guarantee about when your process is going to be put asleep and how long this will last, we made sure to detect and discard any record that would be considered as “too old”.
I/O is the general bottleneck and in our case, reading the CSV dataset from the disk was of course slow. Furthermore, as the Process Scheduler was putting us asleep from time to time, we had to make sure that we had a buffer for the CSV records.
We therefore had a bunch of different Elixir (Erlang) Processes in charge of the entire pipeline: reading the CSV file, transforming each line in a new structured record, retiming the data and coercing some of the values, queuing the records, consuming the queue and streaming the records.
In order to have a clear, visual and realistic demo, we decided to use a second RaspberryPI to run the data-replay tool. This way we could very easily plug and unplug the Ethernet cable that was linking them to stop and resume the consumption and analysis of the data.
Since only part of the team could go to San Diego, we had to make sure that we could still deploy fixes and updates remotely.
We deployed an OpenVPN and a private Docker registry server in AWS to make sure that the technical team could deploy new containers with the latest version of the code and we provisioned the Raspberry PIs with Chef to ease the setup. In just a couple of commands the latest versions were deployed remotely round the world.
So one Rpi streaming 90 wind turbine sensor readings at 100 Hz each (or 9000 datapoints per second in total) going into our datastream.io core engine running on a single core of the CPU of the second Rpi, which then displayed a custom dashboard built for the demo. Anomalies were detected on all sensor streams in real time and our alert correlation technology gave a holistic view of the wind turbine health. You can find more in our whitepapers about datastream.io here and about the wind turbine use case here.