Practical assignment: Applying methods of machine learning
Task:
For this assignment, you will select a dataset and will apply methods of unsupervised and supervised learning. The goal of
the assignment is to develop students’ skills to use machine learning algorithms and analyse the obtained results. The
deliverable is a report prepared by the student on the completion of the assignment.
To develop the assignment, the student must use the Orange tool.
Before starting to work on your assignment, you must find and choose a dataset on the web. Some of the well-known
repositories are the following:
When selecting the dataset, take into consideration the following aspects:
• select a dataset that is suitable for classification task;
• it is preferable to select a dataset that is already given in the format of .csv datafile;
• the dataset should be well-documented (there should be information about who created the set, when and what
the data source is);
• the dataset should be of reasonable size (at least 200 data objects);
• the dataset should be deeply annotated (there should be information about which features are stored and what
they mean);
• the number of features should be between 5-15;
• the dataset should be labelled;
• you should avoid datasets that contain a lot of Boolean types (true/false, 1/0, etc.) feature values. It is preferable
to use datasets with continuous and/or discrete (with more than 2 values) feature values;
• you should avoid datasets of unlabelled data (e.g. text corpora and raw images).
Part 1 – Pre-processing/Exploring the data
To complete this part of the assignment, you will need to take the following steps:1. Selecting and describing the dataset based on the information given in the repository/database where the dataset
was located.
2. If the dataset you have acquired from the repository is not in a format that is easy to work with (like a comma-
separated-values, or .csv, file), convert it into the needed format. Your dataset file should consist of an n×d table,
where d is the number of dimensions of the data and n is the number of data objects. Your columns should be
arranged in the following way: data object ID, the class label of the data object, and then the collected feature
values.
3. If the values of any feature are textual values (e.g. yes/no, positive/neutral/negative, etc.), they must be
transformed into numerical values.
4. If some data objects are missing values of features, it is necessary to find a way to obtain them by studying
additional sources of information.
5. Representing your training dataset visually and statistically:
a) you must create at least two 2- or 3-dimensional scatter plots illustrating the separability of classes in your
dataset based on different features; the student should avoid using the data object ID as a variable in the
scatterplot;
b) you must create at least 2 histograms showing the separation of classes for the features of interest;
c) you must show 2 distributions for the features of interest;
d) you must calculate statistics on your data (at least the central tendency and the dispersion of the feature values).
Include the following information in the report:
• description of the dataset (providing references to the sources of information used):
– title, source, author and/or owner of the dataset;
– description of the problem domain of the dataset;
– licensing regarding the dataset (if any);
– the way how the dataset was collected;
• description of the content of the dataset (providing references to the sources of information used):
– the number of data objects in the dataset;
– the number of classes in the dataset, the meaning of each class and the way of representing classes
(explanation of the labels assigned to classes); if the data set provides several possible data
classifications, then the report must clearly identify which classification is considered in the assignment;
– the number of data objects belonging to each class;
– the number and meaning of features in the dataset, as well as their value types and ranges (this
information should be presented in a table consisting of the feature representation, its meaning, value
type and range of values available in the dataset);
– a snippet of the structure of your datafile in which the columns of your datafile and class labels are
shown together with some data objects;
• conclusions coming from the analysis of scatter plots, histograms and distributions (from Step 5 in Part I) about the
separability of your classes (remember to include your graphs in the report). Try to answer the following questions:
– Whether classes in your dataset are balanced, or is one class (several classes) prevailing? It is determined
by how many data objects belong to each class.
– Does the visual representation of the data allow the structure of the data to be seen? It is a question of
whether data objects belonging to different classes are clearly separable.
– How many data groupings can be identified by studying the visual representation of the data? It is a
question of whether there are any separable groupings of data if the data objects of different classes
merge
– Are the identified data groupings close to each other or far from each other?
– conclusions coming from the analysis of statistical calculations (central tendency and dispersion).
Part II – Unsupervised learning
For this part of the assignment, you will be running unsupervised clustering on your dataset. Part I gave you an
understanding of what features and classes you have and how well you can separate data objects into classes. This part
of the assignment aims to look at the data in an unsupervised fashion to see if the assumptions about class structure hold.
To complete this part of the assignment, you will need to take the following steps:
1. Apply two methods of unsupervised learning considered in class: (1) Hierarchical clustering and (2) K-Means.
2. Perform at least 3 experiments with Hierarchical clustering, freely changing the values of hyperparameters, and
analysing the operation of the algorithm;
3. Perform experiments with the K-means algorithm using at least five different k values, calculate the Silhouette
Score, and analyse the performance of the algorithm.
Include the following information in the report:
• Description of the hyperparameters available in the Orange tool and their meaning for each algorithm.
• Description of the experiments performed, clearly indicating the hyperparameter values used, and conclusions
about the operation of each algorithm.
• Based on the analysis of the operation of both algorithms, conclusions are made about whether the classes in the
dataset are well or poorly separable.
Part III – Supervised learning
For this part of the assignment, you will be running at least 3 classification algorithms on the data you collected and
analysed in Part I and Part II of this assignment. One of the algorithms which you are obliged to use is artificial neural
networks (ANN). Two other algorithms you can choose on your own.
To complete this part of the assignment, you will need to take the following steps:
1. Choose at least two supervised learning methods that are suitable for classification task. You can use the methods
considered in class and any other of the algorithms available in the Orange tool for classification task.
2. Divide your dataset into training and test sets.
3. For each algorithm, perform at least 3 experiments using the training dataset, changing the values of the algorithm
hyperparameters and analysing the algorithm performance metrics.
4. For each algorithm, choose the trained model that provides the best algorithm performance.
5. Apply the trained model of each algorithm to the test dataset.
6. Evaluate and compare the performance of the trained models.
Include the following information in the report:
• Short description (1/3 of A4) of the essence of the supervised learning algorithms you have used and motivation
for choosing two of them (excluding the artificial neural network).
• Description of the hyperparameters available in the Orange tool and their meaning for each algorithm.
• Information on test and training datasets:
– the total number of data objects added to the test and training datasets (by number and %);o information on how many data objects from each class are included in your training and test sets (by
number and %);
• Using a table, represent the hyperparameter values used in the experiments for each algorithm.
• Conclusions on the performance of the models in the performed experiments, clearly identifying the model that
will be used for testing.
• Test results of trained models and comparison and interpretation of their performance.
General requirements
The following are general requirements that students must satisfy:
• The report must contain the following sections: title page, a page with an overview of the Orange tool workflow for
the entire assignment, body text of Parts I, II and III of this assignment, and a list of the information sources used.
• On the title page, the student must provide a link to the created project and data set on a public website (e.g.
Google, GitHub, etc.). Using the link provided, the teacher must be able to download the student’s assignment
without additional registration and restrictions.
• In all three parts of the assignment, the student must clearly describe if something is not relevant to his/her
assignment, for example, there is no information about the licensing aspects of the dataset or the algorithm does
not have hyperparameters.
• Evidence of the work done must be attached to the body text of the assignment, i.e. screenshots that show the
settings of the Orange Widgets and the results obtained.
• The figures and tables added to the report must be numbered, explained and referenced in the body text.
• The report must be submitted as a single .docx or .pdf file.
• The report does not need to be supplemented with unnecessary theory and information. The student should try
to give short and concise answers.
Evaluation criteria:
Evaluating the task, the teacher will consider:
• the quality of the report (whether the information required above is included and is meaningful);
• Orange tool workflow correctness;
• the quality of student’s made conclusions;
• absence of a violation of academic integrity.
The assignment will be automatically failed if:
• a breach of academic integrity has been identified;
• the student has selected a dataset that is not suitable for the classification task and does not meet the
requirements described in this document;
• the student has not provided a link to the project and dataset on the public website on the title page of the report.