CYBER CRIME CASES and CONFUSION MATRIX

7 min readJun 4, 2021

Hello Everyone 😃🙋🏻‍♀️

This Article is based on cyber crime cases where they talk about confusion matrix or its two types of error.

Let’s begin 😃

👉 What is CYBERCRIME ?

Cybercrime, or computer crime, is a crime that involves a computer and a network. The computer may have been used in the commission of a crime, or it may be the target. Cybercrime may harm someone’s security and financial health.

Common forms of cybercrime include:

phishing: using fake email messages to get personal information from internet users
misusing personal information (identity theft)
hacking: shutting down or misusing websites or computer networks
spreading hate and inciting terrorism
distributing child pornography

👉 What is a Confusion Matrix?

A Confusion matrix is the comparison summary of the predicted results and the actual results in any classification problem use case. The comparison summary is extremely necessary to determine the performance of the model after it is trained with some training data.

For a binary classification use case, a Confusion Matrix is a 2×2 matrix which is as shown below

From the above figure:
We have,

Actual Class 1 value= 1 which is similar to Positive value in a binary outcome.
Actual Class 2 value = 0 which is similar to a negative value in binary outcome.

There are various components that exist when we create a confusion matrix. The components are mentioned below

Positive(P): The predicted result is Positive (Example: Image is a cat)

Negative(N): the predicted result is Negative (Example: Images is not a cat)

True Positive(TP): Here TP basically indicates the predicted and the actual values is 1(True)

True Negative(TN): Here TN indicates the predicted and the actual value is 0(False)

False Negative(FN): Here FN indicates the predicted value is 0(Negative) and Actual value is 1. Here both values do not match. Hence it is False Negative.

False Positive(FP): Here FP indicates the predicted value is 1(Positive) and the actual value is 0. Here again both values mismatches. Hence it is False Positive.

Classification Accuracy

Type I error:

This type of error can prove to be very dangerous. Our system predicted no attack but in real attack takes place, in that case no notification would have reached the security team and nothing can be done to prevent it. The False Positive cases above fall in this category and thus one of the aim of model is to minimize this value.

Type II error:

This type of error are not very dangerous as our system is protected in reality but model predicted an attack. the team would get notified and check for any malicious activity. This doesn’t cause any harm. They can be termed as False Alarm.

🔰 CYBER CRIME CASES and CONFUSION MATRIX and it’s Two Types of Error

Cyber-attacks have become one of the biggest problems of the world. They cause serious financial damages to countries and people every day. The increase in cyber-attacks also brings along cyber-crime. The key factors in the fight against crime and criminals are identifying the perpetrators of cyber-crime and understanding the methods of attack. Detecting and avoiding cyber-attacks are difficult tasks.

Recently, an article was published in the news about how the registered level of crime in the Netherlands has decreased to that of 1980 [12][5]. Although the number of crimes has decreased in the Netherlands, the ratio between the different types of crime has shifted. Due to the growth of the Internet and other technologies in the past 20 years, crime involving information and communication technologies (ICT) has increased significantly. In 2016, 11% of all Dutch residents were victimized by cybercrime1 [6]. Only 8% of the victims filed a police report. In this paper, instead of using the term ‘cybercrime’, ‘a crime involving ICT’ is used to not only focus on the criminal court cases labeled as cybercrime, but to also be able to focus on criminal court cases where ICT played a role but which were not labeled as cybercrime. Not all crimes involving ICT are registered as such, as they may appear as computer-aided traditional crimes, or the involvement of ICT in the crime is ignored or the role of ICT in the crime is not explicitly mentioned [10][17]. Consequently, the number of crimes involving ICT may be much higher than originally thought and might become more relevant to fight and prevent. Therefore, it is interesting to investigate ICT involvement in crimes.

RESEARCH QUESTIONS

In this research, to find out how many criminal court cases involve ICT, the aim is to answer the following questions:

RQ1 How can criminal court cases be classified as child pornography, cyberattack, identity theft, phishing or platform fraud based on ICT involvement?

RQ 2. What features determine the classification of a criminal court case?

RQ 3 Which model can be extracted from the classification of criminal court cases?

RQ 4. What is the accuracy of the ICT involvement detection?

RELATED WORK

Some research is done on text mining and machine learning with crime detection, but not with criminal court cases as a dataset. A master’s thesis on text classification of Dutch police reports in which they try to find out if police reports can be classified through text mining, relates to this research topic but has made use of police reports instead of criminal court cases [4]. Androutsopoulos et al. compare in their research a Naïve Bayesian filter and a keyword-based filter, from which they conclude the viability of automatically trainable anti-spam filters [2]. In this research a Naïve Bayesian classifier is trained to automatically detect ICT involvement in criminal court cases. A study done by Wang et al. on automatic document classification uses the Bayes’ theorem as a basis for the algorithm to classify web documents [18]. One of the conclusions, which was consistent with earlier research, was that the multivariate Bernoulli event model performed worse than the multinomial event model classifier. This could be of interest for this research since it makes use of the Bayes’ theorem for classifying the court cases. There seem to be no studies that use the combination of Naïve Bayes as an algorithm and criminal court cases as a dataset, therefore the proposed research can create new insights.

Pre-processing of the Data

As the files were downloaded in an .xml format, they needed to be stripped of the XML-tags first, which resulted in a text-only string. This string then needed to be loaded into a data frame, so the learning algorithm could process it.

Calculating the accuracy

Next, the accuracy of the classification needed to be determined to measure how well the algorithm performed on the dataset.

The choice for this number is based on the low number of files per class. The K-Fold cross-validation enabled for calculating the f1_score for accuracy as well as creating a confusion matrix, which provides us more insight in the accuracy per class. The formula for the f1_score is as follows:

True positives and negatives, false positives and negatives can be put in a confusion matrix to show the performance of the classifier.

Confusion matrix and accuracy

The confusion matrix that was obtained from the classifier is depicted in Figure below. It is in normalized form, since the classes are imbalanced. The darker the blue, the better the classifier is at predicting files for this class. It is clear where the classifier gets ‘confused’. The ‘identity theft’ class does not seem to do well, which has a good reason. Through reading court cases, the discovery was made that ‘platform fraud’ is linked to ‘identity theft’, as it appears that stolen identities are often used to commit platform fraud. In the confusion matrix it is shown that ‘identity theft’ is often predicted as ‘platform fraud’.

Thanks for Reading !! 🙌🏻😁📃

🔰 Keep Learning !! Keep Sharing !!