data through appropriate machine learning modelling approaches, and tuned to optimise their accuracy.
TASK
You are to train your selected supervised machine learning algorithms using the master dataset provided, and compare their performance to each other and to TOBORRM’s initial attempt to classify the samples.
Part 1 – General data preparation and cleaning.
Import the MLDATASET_PartiallyCleaned.xlsx into R Studio. This dataset is a partially cleaned version of MLDATASET-200000-1612938401.xlsx.
Write the appropriate code in R Studio to prepare and clean the MLDATASET_PartiallyCleaned dataset as follows:
For How.Many.Times.File.Seen, set all values = 65535 to NA;
Convert Threads.Started to a factor whose categories are given by
1 = 1 thread started
2 = 2 threads started
3 = 3 threads started
4 = 4 threads started
5 = 5 or more threads started
Hint: Replace all values greater than 5 with 5, then use the factor(.) function.
Log-transform Characters.in.URL using the log(.) function, and remove the original Characters.in.URL column from the dataset (unless you have overwritten it with the log-transformed data)
Select only the complete cases using the na.omit(.) function, and name the dataset MLDATASET.cleaned.
Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.
Write the appropriate code in R Studio to partition the data into training and test sets using an 30/70 split. Be sure to set the randomisation seed using your student ID. Export both the training and test datasets as csv files, and these will need to be submitted along with your code.
Note that the training set is typically larger than the test set in practice. However, given the size of this dataset, you will only use 30% of the data to train your ML models to save time.
Part 2 – Compare the performances of different machine learning algorithms
Select three supervised learning modelling algorithms to test against one another by running the following code. Make sure you enter your student ID into the command set.seed(.). Your 3 modelling approaches are given by myModels.
library(tidyverse)
set.seed(Enter your student ID)
models.list1 – c(-Logistic Ridge Regression-,
-Logistic LASSO Regression-,
-Logistic Elastic-Net Regression-)
models.list2 – c(-Classification Tree-,
-Bagging Tree-,
-Random Forest-)
myModels – c(-Binary Logistic Regression-,
sample(models.list1,size=1),
sample(models.list2,size=1))
myModels % % data.frame
For each of your ML modelling approaches, you will need to:
Run the ML algorithm in R on the training set with Actually.Malicious as the outcome variable. EXCLUDE Sample.ID and Initial.Statistical.Analysis from the modelling process.
Perform hyperparameter tuning to optimise the model (except for the Binary Logistic Regression model):
Outline your hyperparameter tuning/searching strategy for each of the ML modelling approaches, even if you’re using the same search strategy as the workshop notes. Report on the search range(s) for hyperparameter tuning, which k-fold CV was used, and the number of repeated CVs (if applicable), and the final optimal tuning parameter values and relevant CV statistics (where appropriate).
If your selected tree model is Bagging, you must tune the nbagg, cp and minsplit hyperparameters, with at least 3 values for each.
If your selected tree model is Random Forest, you must tune the num.trees, mtry, min.node.size, and sample.fraction hyperparameters, with at least 3 values for each.
Evaluate the performance of each ML models on the test set. Provide the confusion matrices and report the following:
Sensitivity (the detection rate for actual malicious samples)
Specificity (the detection rate for actual non-malicious samples)
Overall Accuracy
Provide a brief statement on your final recommended model and why you chose that model over the others. Parsimony, accuracy, and to a lesser extent, interpretability should be taken into account.
Create a confusion matrix for the variable Initial.Statistical.Analysis in the test set. Recall that the data in this column correspond to TOBORRM’s initial attempt to classify the samples. Compare and comment on the performance of your optimal ML model in part d) to the initial analysis by the TOBORRM team.
What to submit
Gather your findings into a report (maximum of 5 pages) and citing sources, if necessary.
Present how and why the data was manipulated, how the ML models were tuned and finally how they performed to each other and to the initial analysis by TOBORRM. You may use graphs, tables and images where appropriate to help your reader understand your findings.
Make a final recommendation on which ML modelling approach is the best for this task.
Your final report should look professional, include appropriate headings and subheadings, should cite facts and reference source materials in APA-7th format.
Your submission must include the following:
Your report (5 pages or less, excluding cover/contents page)
A copy of your R code, and two csv files corresponding to your training and test datasets.
The report must be submitted through TURNITIN and checked for originality. The R code and data sets are to be submitted separately via a Blackboard submission link.
Note that no marks will be given if the results you have provided cannot be confirmed by your code. Furthermore, all pages exceeding the 5-page limit will not be read or examined.
Marking Criteria
Criterion Contribution to assignment mark
Accurate implementation data cleaning and of each supervised machine learning algorithm in R. 20%
Explanation of data cleaning and preparation. 10%
An outline of the selected modelling approaches, the hyperparameter tuning and search strategy, the corresponding performance evaluation in the training set (i.e. CV results), and the optimal tuning hyperparameter values. 20%
Presentation, interpretation and comparison of the performance measures (i.e. confusion matrices) among the selected ML algorithms. Justification of the recommended modelling approach and how it compares against the results of the initial analysis in the test set. 30%
Report structure and presentation (including tables and figures, and where appropriate, proper citations and referencing in APA-7th style). Report should be clear and logical, well structured, mostly free from communication, spelling and grammatical errors. Appropriate and easy for a non-mathematical (semi-technical) audience to follow and understand. 20%

Related posts:

Related Posts

GET HELP WITH YOUR HOMEWORK PAPERS @ 25% OFF