Anomaly Detection in Video

We propose a new method for anomaly detection of human actions. Our method works directly on human pose graphs that can be computed from an input video sequence. This makes the analysis independent of nuisance parameters such as viewpoint or illumination. We map these graphs to a latent space and cluster them. Each action is then represented by its soft-assignment to each of the clusters. This gives a kind of ”bag of words” representation to the data, where every action is represented by its similarity to a group of base action-words. Then, we use a Dirichlet process based mixture, that is useful for handling proportional data such as our soft-assignment vectors, to determine if an action is normal or not. We evaluate our method on two types of data sets. The first is a fine-grained anomaly detection data set (e.g. ShanghaiTech) where we wish to detect unusual variations of some action. The second is a coarse-grained anomaly detection data set (e.g., a Kinetics-based data set) where few actions are considered normal, and every other action should be considered abnormal. Extensive experiments on the benchmarks show that our method performs considerably better than other state of the art methods.
Given a sequence of video frames, we use a pose estimation method to extract the key points of every person in each frame. Every person in a clip is represented as a temporal pose graph. We use a combination of an autoencoder and a clustering branch to map the training samples into a latent space where samples are soft clustered. Each sample is then represented by its soft-assignment to each of the k clusters. This can be understood as learning a bag-of-words representation for actions. Each cluster corresponds to an action-word, and each action is represented by its similarity to each of the action-words. Figure 1 gives an overview of our method.

Figure 1. Model Diagram (Inference Time): To score a video, we first perform pose estimation. The extracted poses are encoded using the encoder part of a Spatio-temporal graph autoencoder (ST-GCAE), resulting in a latent vector. The latent vector is soft-assigned to clusters using a deep clustering layer, with pick denoting the probability of sample xi being assigned to cluster k.


Extensive experiments show that we achieve state-of-the-art results on Shanghai Tech, one of the leading (fine grained) anomaly detection data sets. We also outperform existing unsupervised methods on our new coarse-grained anomaly detection test.
Link to full article:

Sign up for
our events

    Life Science