PyData Global 2022

Detecting anomalous sequences using text processing methods
12-01, 11:30–12:00 (UTC), Talk Track II

Hello wait you talk see to can’t all my in!

Sounds weird, right?! Detecting abnormal sequences is a common problem.
Join my talk to see how this problem involves Bert, Word2vec, and Autoencoders, and learn how you can also apply it to information security


Dealing with sequences is always an interesting project, because each item has its unique position in the sequence and there’s a connection between all items and their positions. One of the most common issues when working with sequences is dealing with anomalous sequences, that doesn’t fit with the regular sequence’s structure. Those sequences make no sense and create noise in the data and interrupt the learning process.

The most common sequences are text sentences, and possible scenario for abnormal text sequences could be when trying to translate sound to text, and sometimes there’re some irrelevant noise that is translating to nonsense sequences, and if we want to build a model based on that data, we need to find a way to identify and clean this irrelevant and anomalous data.

Detecting anomalous sequences could be also related to non-text sequences, such as sequence of action or events. Those scenarios could be related to information security problems. For example, in many organizations there’re logs of actions that has been made on internal systems and detecting suspicious sequences of actions on the system could be a crucial in detecting attacks or misusage of the systems.

In order to detect those sequences, we need to model the items in the sequence and understand its structure and connections. There’re several word embedding algorithms for generating vectors embedding out of sequences, such as word2vec, bert etc. The next step, after creating the sequence embedding, is detecting the anomalies. The algorithm we used for the anomaly detection phase is autoencoder, where you can train the model on normal data and detect the abnormal events.

This pipeline has some challenges, for example one of them is that each sequence has different length and there’s a need for training both the word embedding algorithm and the autoencoder to know how to learn the right structure of all possible lengths.

Join my talk and I’ll show you how you can process your sequences, using word embedding algorithms such as Bert and Word2vec and use their output for autoencoders in order to create an anomaly detection model for detecting suspicious sequences.


Prior Knowledge Expected

No previous knowledge expected

Liron is a Data Scientist at PayPal for more than 5 years. She works for the cyber security threat oversight team in the information security organization. Her focus is finding machine learning solutions for information security problems in general and specifically for insider frauds threats.

Liron holds a Msc in software and information systems engineering. In her thesis she researched the field of user verification on mobile devices using sequences of touch gestures. The full paper of this thesis was published at PAKDD 2018 conference, and also she published an extended abstract of this research in UMAP 2017 conference.