How LinkedIn uses DL to detect abusive accounts
Companies are adopting emerging machine learning technologies at breakneck speed, reducing human labor and processing data efficiently. LinkedIn is the largest platform for professional and employment-oriented services and for years has taken advantage of AI / ML to optimize the various treatments on the forum. Since March 2021, LinkedIn is bragging more 740 million users in more than 200 countries and territories around the world.
LinkedIn’s anti-abuse AI team works to create, deploy, and maintain AI models to detect and prevent abuse on the platform. Platforms like LinkedIn are prone to abuse such as creating fake accounts, scraping member profiles, automated spamming, and account takeovers.
The team had to overcome three challenges:
- As attackers adapt and evolve rapidly against the anti-abuse defense, there is a need to constantly update LinkedIn’s adversarial behavior tools.
- It also transcends several heterogeneous parts of the website that need to be protected from attackers.
- Keep in mind the need to maximize signals, as standard features do not fully exploit the signal available in member activity models.
The team created a DL model running directly on raw footage of member activity to overcome these challenges. In addition, the model exploits the available signal hidden in the data to prevent adversarial attacks.
The model was used to detect connected accounts by scraping data from member profile. Scratching is not destructive whenever search engines are allowed to scratch to collect and index information on the Internet. Yet when done without permission, it is a harmful practice.
Unauthorized scrapers automate connected LinkedIn accounts; that is, retrieve information that is visible when connected to a member account. The model looks for signals of bot-like activity and classifies user behavior sequences as automated. The team is also exploiting outlier detection to detect non-human activity.
Activity sequence modeling technique
Activity Sequence Modeling is a standardized dataset encapsulating the sequence of requests from members on LinkedIn. These are basically member activity patterns – “When a member visits LinkedIn, the member’s web browser makes numerous requests to LinkedIn’s servers; each request includes a path identifying the part of the site that the member’s browser intends to access ”, as explained by LinkedIn blog post. The sequence can be thought of as a “phrase” describing the member’s LinkedIn activity.
An illustration of LinkedIn’s organization of member requests in a sequence including information about the type of request, the order of requests, and the time between requests.
Standardized request paths translate specific request paths into a standardized token indicating the type of request. For example, a profile view is illustrated by linkedin.com/in/jamesverbus/.
The automated integer array process maps normalized request paths to integers based on the frequency of that request path to help understand how common this specific type of request is for a given user. These requests are color coded in the sequence of activities, based on homogeneity, making it easier for the human eye to identify abusive activity.
Comparison of 200 requests made by a non-abusive member and an abusive member. Colors represent the recurring character of a specific request.
NLP techniques help classify sequences by replacing member requests and user actions as tokens to create the sequence and further classify them as abusive or non-abusive. After processing the query path sequence data, the team relies on a supervised long-term memory (LSTM) model to produce abuse scores.
These are based on the sequence of the time difference between consecutive requests. LinkedIn Strategies status – “If we receive an unusually high number of page requests or detect patterns that indicate the use of an automated tool, we may suspend or restrict this account.”
The last step before behavior correction is to organize the training labels according to the type of abuse to be detected. An unsupervised outlier detection based on LinkedIn’s Isolation Forest Library generates the tags used to train the model.
Isolation Forest Library
The library is an unsupervised outlier detection tool because outliers are “few and different” and therefore are easier to isolate in leaf nodes and require less random division. Thus, they can be used to randomly generate binary tree structures to nonparametrically capture the distribution of multidimensional characteristics of the training dataset. This results in a shorter expected path length from root node to leaf node for outliers. Therefore, isolation forests are a very powerful unsupervised outlier detection algorithm.
Example of isolation tree
Activity Sequence Modeling technology helps solve anti-abuse problems by detecting abusive behavior, preventing adverse attackers, and providing a generalizable and scalable modeling approach to various attack surfaces.
Join our Discord server. Be part of an engaging online community. Join here.
Subscribe to our newsletter
Receive the latest updates and relevant offers by sharing your email.