With the growing use of computers and portable devices connected to the network, the amount of data available for automatic processing quickly increases. One of the essential procedures of the data-processing is a detection and removal of anomalous samples, a problem called anomaly (outlier) detection. Formally, the anomaly detection aims to identify samples, which in some form deviates from the majority. The anomaly detection can be used as a part of the overall data-processing, for example to identify measurement errors (faulty sensors), or it can be the only algorithm used on the data. The example of the latter is a detection of faulty behavior of some industry process (anomaly might indicate a malfunction), fraud detection in credit card transactions, monitoring of health or environmental processes, etc.
In many application domains, the data are acquired at a very high speed and the time between data acquisition and the appropriate action is very limited (imagine the need to process tens of thousands samples per minute). Despite the short time available for processing, the anomaly detection still needs to be done from the reasons stated above. An example of this application domain is network intrusion detection aiming to detect attacks from the outside against the protected system. It is clear that if the attack is detected one day after it has been performed, the detection does not have much value.
It is a frequent case that the probability distribution of normal (non-anomalous) samples is not stationary --- it changes with time. Consequently, the notion of anomaly changes as well and samples, which are at certain time period normal can be anomalous in other time period and vice versa. Of course, this feature makes the problem of anomaly detection even more difficult, but happens in many practical applications. To name few, in network intrusion detection the traffic on the network is different during office hours and during night, or in monitoring environmental processes, the normality depends on a season, etc.
Our research focuses on the real-time anomaly detection in non-stationary environment, since we believe that this topic is extremly important for many contemporary applications. Our approach relies on ensemble systems of very simple detectors. We believe in this approach, because simple detectors can be easily updated which combats the concept drift. This is an important feature of the system, because low false positive rate is essential for practical usability. For example network operator should not be flooded with false alarms, or the production-line in the factory should not be stopped because the algorithm incorrectly evaluates the current conditions. It can be stated that false alarms are expensive and the excessively high false alarm rate makes the system practically unusable. The side benefit of simple detectors is that the classification is very fast making the system near real-time.
The anomaly detection is tightly connected to the behavior modelling, which deals with describing the usuall behavior of individuals, industrial processes, etc. Identification of anomalies in behavior is important in security, for example to identify persons on airports for deeper inspections. Our reasearch here focuses on twitter, where we model daily habits of users, topics of their interest, etc. The goal is to identify doubled identities, anomaly behaving users, users influencing other users, etc.