Solving Malware Classification Task using Python

Содержание

Слайд 2

data analysis and visualization; machine learning; cybersecurity-related data analytics

My interests:

Topic is important

data analysis and visualization; machine learning; cybersecurity-related data analytics My interests: Topic
because:

application of machine learning techniques for malware detection allows to keep pace with malware evolution and combat security threats more effectively compared to other methods.

Слайд 3

Terms

Malware

software that is specifically designed to disrupt, damage, or gain unauthorized access

Terms Malware software that is specifically designed to disrupt, damage, or gain
to a computer system

Benign Ware

ordinary software without any malicious activity

Слайд 4

Main Steps

Dataset collection

Building a machine learning model

Data reduction

01

02

03

Main Steps Dataset collection Building a machine learning model Data reduction 01 02 03

Слайд 5

Dataset collection

01.

With data collection, “the sooner the better”, is always the best

Dataset collection 01. With data collection, “the sooner the better”, is always
answer.
—Marissa Mayer

Слайд 6

Problem

Create a dataset with features that will help the system distinguish between

Problem Create a dataset with features that will help the system distinguish
good and bad files:

find files representing malicious and benign activity

extract features from these files and tabulate them

Слайд 7

Solution

Found:

3077 binary malicious files

1952 binary benign files

collected from “VX Heavens

Solution Found: 3077 binary malicious files 1952 binary benign files collected from
Virus Collection”

collected on local PC

Слайд 8

Solution

Extracted:

100 features from binary portable executable files (.exe, .dll, .sys, etc.) using

Solution Extracted: 100 features from binary portable executable files (.exe, .dll, .sys,
“pefile” python module

Слайд 9

Dataset reduction

02.

Redundancy is expensive but indispensable.
—Jane Jacobs

Dataset reduction 02. Redundancy is expensive but indispensable. —Jane Jacobs

Слайд 10

Problem

Select features that yield the most accurate results:

apply data reduction algorithms

obtain

Problem Select features that yield the most accurate results: apply data reduction
dataset with reduced dimensionality

Слайд 11

Solution

Applied:

Feature importance technique based on Gini importance metric

Principal component analysis (PCA)

Solution Applied: Feature importance technique based on Gini importance metric Principal component

for input features with low correlation

for input features with high correlation

Слайд 12

Solution

Obtained:

10 features with the highest scores; the higher, the more important the

Solution Obtained: 10 features with the highest scores; the higher, the more important the feature
feature

Слайд 13

Solution

Obtained:

reduced the dimensionality of the data from 8 to 2
Principal component 1

Solution Obtained: reduced the dimensionality of the data from 8 to 2
- 78.77% of the variance
Principal component 2 - 13.03% of the variance

Слайд 14

Building a machine learning model

03.

What we want is a machine that

Building a machine learning model 03. What we want is a machine
can learn from experience.
—Alan Turing

Слайд 15

Problem

Determine which file is malicious and which is benign:

apply a machine learning

Problem Determine which file is malicious and which is benign: apply a
algorithm

split the data into training and validation sets

Слайд 16

Solution

The data was split into:

5 equal folds
Each fold was used for both

Solution The data was split into: 5 equal folds Each fold was
training and validation.

Слайд 17

Solution

Applied:

Decision Trees Classifier algorithm.
Built Decision Tree.
Classification rate (accuracy score): 0.9371

Solution Applied: Decision Trees Classifier algorithm. Built Decision Tree. Classification rate (accuracy score): 0.9371

Слайд 18

Libraries & frameworks used

Pandas
Numpy
Pefile
Scikit-learn
Matplotlib
Math

Libraries & frameworks used Pandas Numpy Pefile Scikit-learn Matplotlib Math

Слайд 19

Resources

Presentation template

M. Zubair Shafiq et al. (2009) PE-Miner: Mining Structural Information to

Resources Presentation template M. Zubair Shafiq et al. (2009) PE-Miner: Mining Structural
Detect Malicious Executables in Realtime. In: Engin Kirda, Somesh Jha, Davide Balzarotti, eds. Recent Advances in Intrusion Detection, 12th International Symposium, Saint-Malo: Springer, pp. 121-141.
California State University (2021) Malware, Trojan, and Spyware. [online], available from: https://www.csuchico.edu/isec/stories/malware-trojans-spyware.shtml#:~:text=Malware%3A%20Malware%20is%20short%20for,access%20to%20a%20computer%20system.
[accessed 13 June 2021]
Имя файла: Solving-Malware-Classification-Task-using-Python.pptx
Количество просмотров: 210
Количество скачиваний: 0