Cyberbullying Detection Using K-Nearest Neighbour and Logistic Regression

Cyberbullying, also known as online bullying is a type of bullying involving the use of digital technologies such as social media, messaging platforms, or other channels available. It is a harmful behavior in online spaces that poses significant challenges to individuals and society.  The harmful effect of cyberbullying makes the victims to suffer depression, ill-health, emotional disorder, low self-esteem, physical violence and possibly commit suicide. Detecting cyberbullying content by identifying harmful, offensive, or abusive content effectively is critical to mitigating its harmful effects. K-Nearest Neighbour is a non-parametric learning algorithm that makes no assumptions about the underlying data distribution, making it particularly suited for complex or non-linear decision boundaries. KNN works well with both numerical and categorical features, which is advantageous when combining lexical, syntactic, and semantic features of text. KNN was employed as a supervised learning algorithm that uses a labelled database to predict if an online content is a cyberbullying content or normal content. Similarities or dissimilarities between words in a content and words in the database were found using hamming distance metrics, while binary logistic regression was used to estimate the probability of "suspicious post" given the values of suspicious words in a post. The proposed framework integrates these two machine learning techniques to enhance predictive performance. It was trained using a benchmark dataset comprising social media texts from Twitter and Reddit obtained through kaggle.com (a data repository website). It was evaluated using standard metrics: Precision, Recall, and F-measure. The hybrid approach improves classification accuracy and reduces false-positive rates, offering an efficient solution for cyberbullying detection.

 

Volume
Year
Page Number
117-130
Upload
BJPS(2) Paper 8.pdf (964.61 KB)