Spam Classifier

5 min readApr 1, 2021

So recently during a college hackathon, I developed this project with my teammates and finally made us win the competition. So I will share the details of the building project come let us join…

In recent times, unwanted commercial bulk emails called spam have become a huge problem on the internet. The person sending the spam messages is referred to as the spammer. Such a person gathers email addresses from different websites, chatrooms, and viruses. Spam prevents the user from making full and good use of time, storage capacity, and network bandwidth. The huge volume of spam emails flowing through the computer networks has destructive effects on the memory space of email servers, communication bandwidth, CPU power, and user time. The menace of spam email is on the increase on a yearly basis and is responsible for over 77% of the whole global email traffic. Users who receive spam emails that they did not request find it very irritating. It also results in untold financial loss to many users who have fallen victim to internet scams and other fraudulent practices of spammers who send emails pretending to be from reputable companies with the intention to persuade individuals to disclose sensitive personal information like passwords, Bank Verification Number (BVN) and credit card numbers.

•According to stats we receive 40% of spam emails daily where we exclude social and promotional emails.

•In recent times, unwanted commercial bulk emails called spam have become a huge problem on the internet.

Model Overview

Let’s start with our spam detection data. We’ll be using the open-source Spambase dataset from the UCI machine learning repository, a dataset that contains 5569 emails, of which 745 are spam. The target variable for this dataset is ‘spam’ in which a spam email is mapped to 1 and anything else is mapped to 0. The target variable can be thought of as what you are trying to predict. In machine learning problems, the value of this variable will be modeled and predicted by other variables.

Data usually comes from a variety of sources and often in different formats. For this reason, transforming your raw data is essential. However, this transformation is not a simple process, as text data often contain redundant and repetitive words. This means that processing the text data is the first step in our solution. The fundamental steps involved in text preprocessing are Cleaning the raw data Tokenizing the cleaned data.

Feature extraction is a general term for methods of constructing combinations of the variables to get around these problems while still describing the data with sufficient accuracy. Many machine learning practitioners believe that properly optimized feature extraction is the key to effective model construction.

In machine learning, scoring is the process of applying an algorithmic model built from a historical dataset to a new dataset in order to uncover practical insights that will help solve a business problem. Text processing is the automated process of analyzing text data for getting structured information.

Text generation is a subfield of natural language processing. It leverages knowledge in computational linguistics and artificial intelligence to automatically generate natural language texts, which can satisfy certain communicative requirements.

Model selection is the process of selecting one final machine learning model from among a collection of candidate machine learning models for a training dataset. Text data can be easily interpreted by humans. But for machines, reading and analyzing is a very complex task. To accomplish this task, we need to convert our text into a machine-understandable format. Embedding is the process of converting formatted text data into numerical values/vectors which a machine can interpret.

Different performance metrics are used to evaluate different Machine Learning Algorithms. The key classification metrics: Accuracy, Recall, Precision, and F1- Score. metrics used to evaluate a classification model are an accuracy, precision, and recall. Accuracy is defined as the percentage of correct predictions for the test data. It can be calculated easily by dividing the number of correct predictions by the number of total predictions.

Factors for classifying mails:

We have many factors which affect classifying our emails namely some of them are: Mail sender- If their mail id is repeated many times under some promotions or ads we can simply classify them into spam.

Content of mail-By using text processing we can analyze all types of keywords and sentimental analysis is useful here.

The subject of mail-When we see some promotional or adds subject for example 70% off for this festive season so simple with help of the subject we can classify them into spam.

The received time of mail is also an important factor because during any festive or sales period most of the advertisement emails are received.

Some of our output screens are:

Conclusion:

We create a spam detection model by converting text data into vectors, creating a BiLSTM model, and fitting the model with the vectors. We also explored a variety of text processing techniques, text sequencing techniques, and deep learning models, namely RNN, LSTM, BiLSTM.Precision and recall are the two most widely used performance metrics for a classification problem to get a better understanding of the problem. Precision is the fraction of the relevant instances from all the retrieved instances. When applying a model like this to real-world data, we still need to actively monitor the model’s performance over time. We can also continue to improve the model by responding to results and feedback by doing things like adding features and removing misspelled words. The concepts and techniques learned in this article can be applied to a variety of natural language processing problems like building chatbots, text summarization, language translation models.

To download the code with report and presentation get it here (Code).

Spam Classifier

Model Overview

Written by Mandladharanireddy