Using Fundamentals of Machine Learning to Detect Anomalies in Computer Security [republished]
originally published on 08/10/2022
Summary
The explosive growth in connectivity and accessibility to the internet has created a tremendous security threat to information systems worldwide. Various security controls, such as the use of encryption and authentication techniques have been proposed to reduce the risks of intrusion. No security controls are complete enough to eliminate the risk. Therefore, advanced technology available through machine learning algorithms can be used to detect gaps in security. A machine learning method based on SVMs (support vector machines) is described in this case study for detecting normal and attacked behavior of the system. The method was tested with the KDD (knowledge discovery and data mining) data set available at DARPA. The method successfully classifies the data set into normal and attacked versions. This case study presents a graphical representation of the results.
Keywords: Intrusion Detection, Machine Learning, Support Vector Machines.
1. Introduction
Intrusion detection systems (IDS) are software or hardware systems that automate the process of monitoring the intrusion events occurring in a computer system or network. They are used to determine whether security has been breached and unauthorized access is granted to property, resources, or data. Intrusion detection systems are classified as network based, protocol based, and host based. Network traffic instruction can be detected using a network-based intrusion detection system (NIDS). NIDS gain access to network traffic by connecting to a hub or to a network switch. These systems detect attacks [1] by capturing and analyzing network packets. A protocol-based system consists of a system or agent that would typically sit at the front end of a server, monitoring and analyzing the communication protocol between a connected device and the server. A host-based intrusion detection system consists of an agent on a host that identifies intrusions by analyzing system calls, application logs, and file-system modifications. This case study provides a machine learning based approach for detecting intrusions in NIDS.
Misuse detection and anomaly detection are two general approaches to computer intrusion detection [4]. Misuse detection or signature-based detection generates an alarm when a known attack signature is matched. Anomaly detection identifies activities that deviate from the normal behavior of the monitored system and thus have the potential to detect an attack. This case study includes an anomaly detection method for detecting intruders in the network. In fact, anomaly detection can be regarded as a classification problem which classifies normal or attacked behavior. Several machine learning approaches have been applied to intrusion detection to differentiate attacked data from normal. The rest of this case study is organized as follows. In section2 we described the background about support vector machines. In Section3 we review some related work. Section 4 describes the type of attacks included in the DARPA data set from the MIT Lincoln laboratory. Section 5 describes the selection of features. Section 6 describes the SVM tool. The results are shown in section 7. Finally, we summarize our work in Section 8.
2. Background on SVMs
A SVM is one of the most successful classification algorithms in neural networks. It is a type of feed forward network introduced by Vapnik [2] & [3] which is used for pattern classification and non — linear regression [13]. SVM constructs a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized. The property of SVM is that it is an approximate implementation to the structure risk minimization (SRM) principle in statistical learning theory.SVM classifies data by determining a number of support vectors, which are members of the set of training inputs that outline a hyperplane in the feature space.
2.1 Linear classification
SVM classifies to which class a new data point belongs. A data point is viewed as a p-dimensional vector, and if such points are separated by a (p — 1)-dimensional hyperplane, it is called a linear classifier. There are many types of hyperplanes that can classify data. However, an SVM seeks maximum separation (margin) between the two classes. If such a hyperplane exists, it is known as the maximum-margin hyperplane and that linear classifier is known as a maximum margin classifier. In short, an SVM is a maximum margin classifier.
Formally, for a given training data, a set of points of the form.
D = {(
+1, -1}} for i=1,….,n
indicates the class to which the point
belongs. Each
is a p-dimensional real vector. Any hyperplane can be written as the set of points x satisfying.
w . x — b = 0,
where the (
) denotes the vector dot product. The vector w is perpendicular to the hyperplane. The term b/||w|| determines the distance of the hyperplane from the origin.
The problem is to select w and b to maximize the margin.
2.2 Non-linear classification
In 1992, Vapnik suggested a way to create non-linear classifiers [14] by applying the “kernel trick” to maximum-margin hyperplanes. The resulting algorithm is formally similar to that of linear classification, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in the transformed feature space. The transformation may be non-linear and the transformed space is high-dimensional, thus though the classifier is a hyperplane in the high-dimensional feature space, it may be non-linear in the original input space.
Some common kernel functions include,
- Polynomial (Homogeneous)
- Polynomial (inhomogeneous)
- Radial basis function
- Gaussian Radial basis function
- Hyperbolic tangent
This case study uses Radial Basis Function (RBF) as a kernel.
Primal Problem for non linear data points
For a given training sample {(xi , yi)
,primal problem for non linear data points is described as follows ,find the optimal values of the weight vector w and bias b such that they satisfy the constraint
yi (wT φ(xi ) + b) ≥ 1 − ξi ,
where ξi ≥ 0, i = 1, . . . , n.
and such that the weight vector w and the slack variable ξi minimizes the cost function
Φ (w, ξ) =
wTw + C
Slack variable measures the deviation of a data point from the ideal condition of pattern separability. Parameter C is called a regularization parameter, which controls the tradeoff between complexity of the machine and the number of non separable points and it is selected by the user.
Dual Problem
Dual problem provides an optimal solution for primal problem using Lagrange multipliers.
Duality theorem states that:
- If the primal problem has an optimal solution, the dual problem has an optimal solution and the corresponding optimal values are equal.
- In order for wo (initial value) to be an optimal primal solution and αo to be an optimal dual solution, it is necessary and sufficient that wo is feasible for the primal problem.
Dual problem for a given training sample {(xi , yi)
is to find the lagrangian multiplier {αi
that minimizes the function
Φ(α) =
αTQ+ xTα
Subject to constraint
- yTα = 0 or
- yiα = 0
- 0
- αi
- C
where, C >0,x is an input vector, n is number of training examples and Q is n*n positive semi definite matrix , Qij = yiyj k(xi , xj )
In this case study we have selected the initial value for α as zero.
The training data are mapped to a higher dimensional space by the function φ, and k(xi , xj ) is a RBF kernel function defined as exp(-γ ||
-
|
for γ>0.
Above procedure is summarized below:
Step1: Initialize parameters as
α=0, γ=1/k, C=100
Step 2: Substitute above values in below expression to calculate α value.
Φ (α) =1/2 αTQ+ xTα subjected to yTα = 0 or
yiα = 0, 0
αi
C
where Qij = yiyj k(xi , xj )
x is an input vector which consists of intrusion detection data.
k(xi , xj ) = exp(-γ ||
-
|
and i=0…n,j=0…l.
Step 3: Step 2 is repeated for all training examples. At each iteration α values and 2- dimensional data points are evaluated using RBF kernel function.
Step 4: From the above calculated α values, non-zero α values are considered as support vectors.
Step 5: The data is classified into normal and attacks using these support vectors.
Though there are many advantages of SVMs, the reasons for using SVM for intrusion detection is speed and scalability (classification complexity does not depend on the dimensionality of the input space).
3. Related work
The idea of anomaly detection in computer security was introduced by Anderson [5]. Various methods for anomaly detection have been implemented using statistical models for network behavior [6] [7]. The aim of machine learning techniques for intrusion detection systems is to develop a training algorithm which classifies data as normal or abnormal. More recently, Eskin and others [8] [9] proposed unsupervised anomaly detection algorithms with unlabelled data. In recent years, there have been several learning-based or data mining-based research [20] efforts in intrusion detection. Ghosh and others [10] employed artificial neural network techniques to learn normal sequences of system calls for specific UNIX system programs. Rao V Vemuri and others [18] extended the work of Liao et al. and compared intrusion detection performance of RSVMs, the standard SVMs and the kNN classifiers over training data. Support Vector Machines [11] have also been used as anomaly intrusion detectors. Wang et al. [12] used “one class SVM” based on one set of examples belonging to a particular class.
4. Data
This case study uses 1998 DARPA intrusion detection evaluation program data which originated from MIT’s Lincoln Lab. The data includes a wide variety of intrusions simulated in a military network environment. Lincoln Labs set up an environment to acquire nine weeks of raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment but interrupted with multiple attacks. A connection is a sequence of TCP packets starting and ending at some well-defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal or attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.
The types of attacks found in this dataset can be categorized as follows:
- DOS: denial-of-service — The attacker uses a large number of computers to login in a short period of time or to transfer a mass number of packages. These actions lead to overloading of the host and thus network services of the host will stop. Large resources of the host such as the utilization of CPU and memory and the network’s bandwidth are consumed by this attack. SYN-flood, Smurf, Teardrop and Ping of Death are belonging to the dos attack.
- R2L (Remote to User Attacks): A remote to user attack is a class of attacks in which an unauthorized user sends packets to a machine over a network. The user exploits any vulnerability to gain local access to the machine. Examples are ftp_write, guess_passwd, imap, multihop, spy,Phf, warezclient and warezmaster.
- U2R: (User to Superuser or Root): A user to root attack is a class of attacks in which an attacker login as a normal user and is able to exploit vulnerability to gain root access. Examples are buffer_overflow, loadmodule, perl and rootkit.
- PROBING: Probing is a class of attacks in which an attacker scans a network of computers to find information or find known vulnerabilities. An attacker armed with network information of machines and services can exploit gaps. Examples are ipsweep, nmap, portsweep and satan.
The list of features included in the 1998 KDD CUP data set are shown in table 1. In table 1 C indicates Continuous and D indicates Discrete.
Table 1. List of features
N0
feature name
Description
Example
Type
1
Duration
length (number of seconds) of the connection
0
C
2
protocol_type
type of the protocol
tcp
D
3
Service
network service on the destination, e.g., http, telnet, etc.
http
D
4
Flag
normal or error status of the connection
SF
D
5
src_bytes
number of data bytes from source to destination
181
C
6
dst_bytes
number of data bytes from destination to source
5450
C
7
Land
1 if connection is from/to the same host/port; 0 otherwise
0
D
8
wrong_fragment
number of ``wrong’’ fragments
0
C
9
Urgent
number of urgent packets
0
C
10
Hot
number of ``hot’’ indicators
0
C
11
num_failed_logins
number of failed login attempts
0
C
12
logged_in
1 if successfully logged in; 0 otherwise
1
D
13
num_compromised
number of ``compromised’’ conditions
0
C
14
root_shell
1 if root shell is obtained; 0 otherwise
0
D
15
su_attempted
1 if ``su root’’ command attempted; 0 otherwise
0
D
16
num_root
number of ``root’’ accesses
0
C
17
num_file_creations
number of file creation operations
0
C
18
num_shells
number of shell prompts
0
C
19
num_access_files
number of operations on access control files
0
C
20
num_outbound_cmds
number of outbound commands in an ftp session
0
C
21
is_host_login
1 if the login belongs to the ``host’’ list; 0 otherwise
0
D
22
is_guest_login
1 if the login is a ``guest’’ login; 0 otherwise
0
D
23
Count
number of connections to the same host as the current connection in the past two seconds
8
C
24
serror_rate
% of connections that have ``SYN’’ errors
0.00
C
25
rerror_rate
% of connections that have ``REJ’’ errors
0.00
C
26
same_srv_rate
% of connections to the same service
1.00
C
27
diff_srv_rate
% of connections to different services
0.00
C
28
srv_count
number of connections to the same service as the current connection in the past two seconds
8
C
29
srv_serror_rate
% of connections that have ``SYN’’ errors
0.00
C
30
srv_rerror_rate
% of connections that have ``REJ’’ errors
0.00
C
31
srv_diff_host_rate
% of connections to different hosts
0.00
C
32
dst_host_count
9
C
33
Dst_host_srv_count
9
C
34
dst_host_same_srv_rate
1.00
C
35
dst_host_diff_srv_rate
0.00
C
36
dst_host_same_src_port_ra
0.11
C
37
dst_host_srv_diff_host_rate
0.00
C
38
dst_host_serror_rate
0.00
C
39
dst_host_srv_serror_rate
0.00
C
40
dst_host_rerror_rate
0.00
C
41
dst_host_srv_rerror_rate
0.00
C
42
attack
type of attack
normal
- Feature selection
Feature selection and ranking [17] is an important issue in intrusion detection. The purpose of ranking is to select features which are important to monitor intrusion detections. Because the elimination of useless features enhances the accuracy of detection while speeding up the computation and thus improves the overall performance of an IDS. Performance is calculated based on accuracy and CPU time for training SVM. This case study includes a performance based ranking method which deletes one feature at a time to rank the input features and identify the most important ones for intrusion detection using SVMs. According to this method, one input feature is deleted from the data set at a time and the resultant data set is then used for the training and testing. Then the classifier’s performance is compared to that of the original classifier with all features. These comparisons are shown in Table 4. Finally, we selected a feature set consisting of 13 features based on the performance comparison.
Selected features are listed below.
Table 2. Selected features after feature selection
S.No
Attack Type
1
Duration
2
src_bytes
3
dst_bytes
4
Count
5
srv_count
6
dst_host_count
7
dst_host_srv_count
8
dst_host_same_srv_rate
9
dst_host_diff_srv_rate
10
dst_host_same_src_port_rate
11
dst_host_srv_diff_host_rate
12
dst_host_serror_rate
13
dst_host_srv_serror_rate
For classifying the above data freeware package LIBSVM [16] is used.
6. LIBSVM
LIBSVM is a library for support vector machines (SVM). Its goal is to let the users easily use SVM as a tool. It includes C-support vector and ν — support vector for classification, distribution estimation (one class SVM), c-support vector regression (c -SVR), and ν — support vector regression (ν -SVR) for regression. In this case study we classified the data set using C-SVC.
It is based on an improved SMO algorithm [15] by R.-E. Fan, P.-H.Chen, and C.-J. Lin.
Given training vectors xi
, i = 1,…,n, in two classes, and a output vector yi such that yi
, C-SVC solves the primal and dual problems.
6.1. Experiments with LIBSVM
Before applying the data to SVM, data must be scaled. The main advantage of scaling is to avoid attributes in greater numeric ranges which dominate smaller numeric range values. Scaling can also eliminate numerical difficulties during the calculations. Because inner products of feature vectors depend on kernel values. So, we recommend linear scaling of each attribute to the range [0, 1]. The same method is used to scale test data before testing.
In this case study, we use the sparse format where zero values are ignored. In the sparse format the data is compressed, resulting in less memory usage.
Hence a data with attributes
1 0 2 0
is represented as
1:1 0 3:2 0
or else simply it can be represented as
1:1 3:2 by ignoring zeros.
Each and every instance in the training set contains one target value (class labels) and several attributes (features). The objective of SVM is to produce a model which predicts the target value of data instances in the testing set which are given only as the features.
The data is partitioned into two classes: normal and attack, where the attack is the collection of all 22 different types of attacks belonging to the four classes as described above. The objective of our SVM experiments is to separate normal and attack patterns. In our case all attacks are classified as -1, and normal data classified as +1.
7. Results
7.1. Training
For training the SVM, 1000 data points are randomly selected from 65000 data points. These randomly selected data points contain attacks and normal usage patterns. Training is done using the RBF (radial basis function) kernel. The kernel function defines feature space where the training set examples are classified. For training the SVM 13 features are used which are selected from the feature selection method as described in section5 and these selected features are listed in table 3. The results with 13 features are compared with results of 41 features of the dataset which are shown in Figure 1 and 2 respectively.
Figures 1 & 2 show a boundary between normal and attacked data as labeled. In these figures, the parameter set in the lower-right panel represents the model and kernel used for classification. -t 2 signifies radial basis function with gamma value 1/k, where k is the number of attributes in the input. The type of model used for classifying the data is C- SVC (support vector classification with cost function) which is represented by the parameter -c. The specification –c 100 corresponds to the cost function value 100, which represents the degree of punishment and has an impact on experiment results.
Training the SVM with 13 features takes less CPU time while giving almost the same error rate when compared with 41 features. These results are summarized in table 4.
Figure 1. Classification between normal & attacked data with 13 features.
Figure 2. Classification between normal & attacked data with 41 features
The final SVM model is generated after training which consists of parameters as in Table 3. And this model is used to predict target value (normal or attack) of data instances in the testing set given only the attributes.
Table 3. SVM model for 13 features
S.No
Parameters
Value
1.
Gamma (γ)
0.076923
2.
Bias (b)
2.933230
3.
Kernel_type
RBF
4.
Number of classes
2
5.
Class labels
+1 -1
6.
Number of support vectors
5(+1),6(-1)
7.2. Testing
After training the SVM, gamma and bias values are fixed. These values are used to test the data. For testing, we used an unlabeled data set consisting of 500 data points with 41 features in one case and 13 features in another case.
Table 4. Comparison metrics for 13 and 41 features
S.No
Comparison metrics
41 features
13 features
1.
Number of support vectors
8
11
2.
Training time(ms)
2.532
2.328
3.
Testing time(ms)
0.656
0.547
4.
Error rate
0.0
0.007143
As we go on training the SVM with 13 to 3 features, the number of support vectors and error rate are increased. Less number of support vectors is efficient. Because support vectors are the data points which are difficult to classify as they lie on the margin and take less running time as fewer support vectors are generated. So, SVM is better trained with selected 13 features. These comparisons are listed in Table 5 and shown in Figure 3.
Table 5. Comparison metrics for features from 3 to 13
No of features
Positive support vectors
Negative support vectors
Error rate
Training time
Testing time
3
51
52
0.2291
10.891
1.454
4
54
53
0.2291
16.36
1.703
5
39
38
0.2156
16.781
2.047
6
27
26
0.2156
17.234
2.156
7
16
17
0.1644
30.203
2.406
8
17
19
0.1156
13.172
2.422
9
18
17
0.1779
12.828
2.219
10
17
17
0.1779
37.625
2.203
11
16
17
0.1779
11.469
2.234
12
17
17
0.1752
33.375
2.25
13
5
6
0.0071
7.656
2.328
Figure 3. Graphical representation of Table 4
8. Conclusion
In this case study, we proposed a machine learning technique called SVM for anomaly detection in computer security. The SVM constructs a linear separating hyperplane between two different classes with the maximal margin in the higher dimensional space. Experiments with the 1998 DARPA data set show that SVMs can provide good classification ability. The comparison metrics shows that the data are better classified with 13 features than that of 41 features.
9. Future Scope
Other methods can be used to classify the data into 2-class (normal and attack), 5-class (4 types of attacks, 1 normal) and 23-class (22 specific attack types and 1 normal), the same may be compared with currently implemented techniques.
….
About the Author
Radhika Kanubaddhi is Data and AI Architect at Microsoft. Radhika Kanubaddhi has over 10 years of experience in software engineering, databases, AI/ML, and analytics. Radhika helps customers adopt Microsoft Azure services to deliver business results. She crafts ML and database solutions that overcome complex technical challenges and drive strategic objectives.
Before Microsoft, she developed and deployed machine learning models to improve revenues, profits, increase conversions for clients from various industries such as airlines, banks, pharma, retail, and hospitality. She is an expert in assembling the right set of services to solve client needs.
Radhika has worked with almost all technical innovations and services in the last decade — including Internet of Things, cloud application development, ML models, and Azure.
Some of her accomplishments include:
- Led three ML recommendation engine POCs converting 2 out of 3 clients resulting in $1.17M annual revenues.
- Implemented real-time recommendation engine for an airline client resulting in $214M increase in 30-day revenue.
- Developed ‘Backup as a Service’ to clone and encrypt data from millions of Internet of Things (IoT) devices
Radhika has a Master’s in computer science. She has authored impactful technical content across different media. Outside of work, she enjoys playing with her daughter, reading, and creating art.
….
References
[1] H. Debar, M. Dacier and A. Wespi. “Towards a taxonomy of intrusion detection systems”, Computer Networks, 31(8), pp 805–822, April 1999.
[2] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
[3] B. E. Boser, I.M. Guyon, and V. N.Vapnik, “A training algorithm for optimal margin classifier,” in Proc. 5th ACM Workshop Comput. Learning Theory, Pittsburgh, PA, July 1992, pp. 144–152.
[4] Anita K. Jones and Robert S. Sielken,”computer system Intrusion Detection:A Survey” University of Virginia,Thornton Hall
[5] J. P. Anderson. “Computer Security Threat Monitoring and Surveillance”, Technical Report, James P. Anderson Co., Fort Washington, Pennsylvania, April 1980.
[6] S. Mukkamala, G. I. Janoski, and A. H. Sung. “Intrusion Detection Using Support Vector Machines”, Proceedings of the High Performance Computing Symposium — HPC 2002, pp 178–183, San Diego, April 2002.
[7] W. Lee, S. J. Stolfo and K. Mok. “Data mining in workflow environments: Experiences in intrusion detection”, Proceedings of the 1999 Conference on Knowledge Discovery and Data Mining (KDD-99), 1999.
[8] E. Eskin, A. Arnold, M, Prerau, L. Portnoy and S. Stolfo. “A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data”, In D. Barbara and S. Jajodia (editors), Applications of Data Mining in Computer Security, Kluwer, 2002.
[9] E. Eskin. “Anomaly Detection over Noisy Data using Learned Probability Distributions”, In Proceedings of the 2000 International Conference on Machine Learning (ICML-2000). Palo Alto, CA: July, 2000.
[10] A. K. Ghosh, A. Schwartbard and A. M. Shatz. “Learning Program Behavior Profiles for Intrusion Detection”, Proceedings of 1st USENIX Workshop on Intrusion Detection and Network Monitoring, pp 51–62,
Santa Clara, CA, 1999.
[11]. Mukkamala, S., Janoski, G., Sung, A.: Intrusion detection: support vector machines and neural networks. In: Proceedings of the IEEE International Joint Conference on Neural Networks (ANNIE), pp. 1702–1707. St. Louis, MO (2002)
[12]. Wang, K., Stolfo, S.J.: One class training for masquerade detection. In: Proceedings of the 3rd IEEE Conference, Data Mining Workshop on Data Mining for Computer Security. Florida (2003)
[13]. S. Gunn, “Support vector machines for classification and regression,” Univ. Southampton, Southampton, U.K., ISIS Tech. Rep., May 1998. Image Speech and Intelligent Systems Group.
[14] S. Haykin, “Neural networks — A comprehensive Foundation,” in Chapter 6-Support Vector Machine. Englewood Cliffs, NJ: Prentice-Hall, 1999.
[15] H. Chen, R.-E. Fan, and C.-J. Lin. A study on SMO-type decomposition methods for support vector machines. IEEE Transactions on Neural Networks, 17:893 908, July 2006.
[16] C. Chang and C. J. Lin, LIBSVM — A Library for Support Vector Machines.
[17] Sung A. H. (1998) “Ranking Importance of Input Parameters of Neural Networks,” Expert Systems with Applications, Vol. 15, pp.405–41.
[18] Wenjie Hu, Yihua Liao, V. Rao Vemuri “Robust Support Vector Machines for Anomaly Detection in Computer Security”