Technical Report MSC-2021-36

TR#:MSC-2021-36
Class:MSC
Title: On Anomaly Detection in Tabular Data
Authors: Igor Margulis
Supervisors: Ran El-Yaniv, Yuval Filmus
PDFCurrently accessibly only within the Technion network
Abstract: We consider the problem of anomaly detection in tabular data and present a modular framework for anomaly detection based on classification of self-labelled data. Given a set of records, all considered as belonging to a “normal” class (e.g., measurements corresponding to some physical phenomenon of interest), we demonstrate how a deep neural model appropriate for the classification of tabular data can be incorporated into the detection scheme for sorting out anomalous records (e.g., measurements corresponding to some background signal).

Tables are a very popular way of presenting data so clearly anomaly detection in tabular data is of utmost importance. The task of anomaly detection is challenging due to heterogeneity of data stretching across various facets of real-world phenomena captured by measurements and ordered in the form of tables.

The standard and intuitive approach to the problem of anomaly detection is learning the model of normality. Having acquired an understanding of normal patterns, the system can track down the non-conforming patterns and declare them to be anomalies.

Classic approaches to solving the anomaly detection problem usually do not perform well on high-dimensional data, which in general can be the case for tabular data in many applications, e.g., medical records of patients can include hundreds of measured parameters from blood analysis, immune system status, genetic background, nutrition, alcohol and tobacco consumption, treatments and diagnosed diseases. To circumvent this issue, many recent approaches employ some mechanism for dimensionality reduction of the data and apply anomaly detection techniques on the low-dimensional representation space.

In contrast to these methods, the main idea behind the classification-based scheme, presented in this thesis, is to train a multiclass classifier to distinguish between several dozens of transformations applied on all the given “normal” records.

The data representation learned by the model turns out to be useful in identifying, at test time, anomalous records based either on the softmax activation (i.e., the output of the classifier, which represents the probability that an input record belongs to each class) statistics of the classification model when applied to transformed records, or distance-based statistics calculated for the produced representation of an input record with respect to the cluster centers of learned representations. To validate our solution, we present experiments using the proposed framework.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2021/MSC/MSC-2021-36), rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2021
To the main CS technical reports page

Computer science department, Technion
admin