K- means clustering and its real use case in Security Domain

4 min readJul 19, 2021

Hello Everyone 🙋🏻‍♀️

💫 Today I am here with a new Article that is based on K — Means Clustering and it’s Real usecase in Security domain 💫

💥 What is unsupervised learning?

Unsupervised learning is where you train a machine learning algorithm, but you don’t give it the answer to the problem. In Unsupervised Learning, the machine uses unlabeled data and learns on itself without any supervision. The machine tries to find a pattern in the unlabeled data and gives a response.

💥 What is Clustering ???

Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as Euclidean -based distance or correlation-based distance.

💥 What is k-means clustering?

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster. The K-Means clustering algorithm is an iterative process where you are trying to minimize the distance of the data point from the average data point in the cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

💥 How Does K-Means Clustering Work?

The flowchart below shows how k-means clustering works:

⚡Use-Cases in the Security Domain⚡

💥 Intrusion Detection System (IDS) using K-Means Clustering 💥

💥 What is Intrusion detection

Intrusion detection is the process of monitoring the events occurring in a computing system or network and analyzing them for signs of intrusions, defined as attempts to compromise the confidentiality.

💥 What is intrusion detection system (IDS)

An intrusion detection system (IDS) is a device or software application that monitors a network for malicious activity or policy violations. Any malicious activity or violation is typically reported or collected centrally using a security information and event management system. Anomaly detection is one of intrusion detection system. Current anomaly detection is often associated with high false alarm with moderate accuracy and detection rates when it’s unable to detect all types of attacks correctly.

💥 Solution 💥

To overcome this problem, K-Means clustering is useful. K-means is one of the simplest and efficient partitional clustering algorithms that is used for detecting intrusions in a computer system.

In 2008 Rajesh and Shina [13], proposed a method of analysis for the best feature selection method for Network intrusion detection model. In their paper they used K-means algorithm to cluster and analyze the data of KDD Cup 99 dataset. The simulation results that run on KDD Cup 99 dataset showed that the K-means method is an effective algorithm for partitioning large dataset and can detect unknown intrusions with detection rate 96%.

The K-means algorithm is one of the widely recognized clustering tools. K-means groups the data in accordance with their characteristic values into a user-specified number of K distinct clusters. Data categorized into the same cluster have identical feature values. K, the positive integer denoting the number of clusters, needs to be provided in advance.

The steps involved in a K-means algorithm are given consequently:

1. K points denoting the data to be clustered are placed into the space. These points denote the primary group centroids.

2. The data are assigned to the group that is adjacent to the centroid.

3. The positions of all the K centroids are recalculated as soon as all the data are assigned.

4. Repeat steps 2 and 3 until the centroid unchanged.

💥 Result 💥

The results of the evaluation of using K-means with feature selection confirm that a high detection rate can be achieved while maintaining a low false alarm rate ( DR = 98.214%, Error rate = 1.7857%). There is no need to get concerned about new types of attacks and the performance of the system is not reduced if the IDS undergoes unknown attacks as unsupervised learning algorithm has been use as a detection model in the first layer.

THANKS FOR READING !!

KEEP LEARNING !! KEEP SHARING !!