Spam Identification using Machine Learning and Homomorphic Encryption
Project Abstract
Spam emails are unsolicited, undesired, and frequently malicious messages distributed in bulk to many recipients. Traditional techniques of recognizing and filtering spam emails face several challenges. Spammers are continuously changing their strategies, making it difficult for traditional rule-based filters to keep up with new spamming approaches. Spam emails use a variety of strategies such as image-based content, to avoid standard filters that look for specific keywords or patterns. Spammers alter email headers, such as sender addresses and domains, to make traditional filters ineffective and appear legitimate senders. These problems highlight the need for modern approaches, such as Machine learning algorithms which is a powerful tool to tackle this problem, as they can learn from large amounts of data to identify patterns and make predictions. Homomorphic encryption is a cryptographic technique that allows for computations to be performed on encrypted data, without requiring it to be decrypted first. This opens new possibilities for data privacy, as sensitive information can remain encrypted even during the processing stage. This project relies heavily on algorithms such as the Multinomial Naive Bayes classifier, support vector machine, and logistic regression. The training and testing procedure is carried out on the encrypted data. Once the data has been pre-processed, the input features and labels are encrypted using Python’s Paillier encryption technique. This guarantees that the sensitive nature of the training data stays encrypted during the training process. The initiative seeks to reduce the growing prevalence of spam emails and unwanted adverts while maintaining data privacy and security. The findings of this project will contribute to the field of email security and privacy for both individuals and organizations. Organizations can minimize their operational costs, by exercising control over the spam mails, which consume their network bandwidth, storage space, and computational resources.
Keywords: Spam-Identification, Machine-Learning and Homomorphic encryption, Classification Models
Conference Details
Session: Presentation Stream 14 at Presentation Slot 2
Location: GH043 at Wednesday 8th 09:00 – 12:30
Markers: Alma Rahat, Deshan Sumanathilaka (GTA)
Course: MSc Data Science, Masters PG
Future Plans: I’m looking for work