University of Westminster School of Computer Science & Engineering 6BUIS001W Business Intelligence – Coursework 1 (2021/22)Module leaderDr. V.S. KontogiannisUnitCoursework 1 The current version of CW1 can be considered as provisional, as it needs to be moderated by both internal moderator and external examiner. Therefore, it may be subjected to slight changes following module leader’s agreement for such amendments.Weighting:50%Qualifying mark30%DescriptionShow evidence of understanding of various Business Intelligence concepts, through the implementation of clustering & forecasting algorithms using real datasets. Implementation is performed in R environment, while students need to perform some critical evaluation of their results.Learning Outcomes Covered in this Assignment:This assignment contributes towards the following Learning Outcomes (LOs): LO3 review the recent business intelligence tools to carry out critical evaluation on methodologies and technologies available for information retrieval, pattern recognition and knowledge discovery;LO4 apply contemporary business intelligence technologies in order enable users to view data patterns by deploying various tools;Handed Out:11/10/2021 Due Date11/11/2021, Submission by 13:00Expected deliverablesSubmit on Blackboard only one pdf file containing the required details. All implemented codes should be included in your documentation together with the results/analysis/discussion.Method of Submission: Electronic submission on BB via a provided link close to the submission time. Type of Feedback and Due Date:Feedback will be provided on BB, on 2nd December 2021 (15 working days)BCS CRITERIA MEETING IN THIS ASSIGNMENTProblem solving strategies‘Knowledge and understanding of mathematical and/or statistical principles’ Assessment regulations Refer to section 4 of the “How you study” guide for undergraduate students for a clarification of how you are assessed, penalties and late submissions, what constitutes plagiarism etc. Penalty for Late Submission If you submit your coursework late but within 24 hours or one working day of the specified deadline, 10 marks will be deducted from the final mark, as a penalty for late submission, except for work which obtains a mark in the range 40 – 49%, in which case the mark will be capped at the pass mark (40%). If you submit your coursework more than 24 hours or more than one working day after the specified deadline you will be given a mark of zero for the work in question unless a claim of Mitigating Circumstances has been submitted and accepted as valid. It is recognised that on occasion, illness or a personal crisis can mean that you fail to submit a piece of work on time. In such cases you must inform the Campus Office in writing on a mitigating circumstances form, giving the reason for your late or non-submission. You must provide relevant documentary evidence with the form. This information will be reported to the relevant Assessment Board that will decide whether the mark of zero shall stand. For more detailed information regarding University Assessment Regulations, please refer to the following website:http://www.westminster.ac.uk/study/current-students/resources/academic-regulations Instructions for this coursework During marking period, all coursework assessments will be compared in order to detect possible cases of plagiarism/collusion. For each question, show all the steps of your work (codes/results/discussion). In addition, students need to be informed, that although clarifications for CW questions can be provided during tutorials, coursework work has to be performed outside tutorial sessions. Coursework Description Clustering Part In this assignment, we consider a set of observations on a number of white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of testing can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled. One dataset (whitewine_v1.xls) is available of which is on white wine and has 4873 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines, one of which is Quality (i.e. the last column), based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters. Description of attributes: fixed acidity: most acids involved with wine or fixed or non-volatile (do not evaporate readily)volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar tastecitric acid: found in small quantities, citric acid can add ‘freshness’ and flavour to winesresidual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/litre and wines with greater than 45 grams/litre are considered sweetchlorides: the amount of salt in the winefree sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of winetotal sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of winedensity: the density of water is close to that of water depending on the percent alcohol and sugar contentpH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scalesulphates: a wine additive which can contribute to sulphur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidantalcohol: the percent alcohol content of the wineOutput variable (based on sensory data): quality (score between 0 and 10) For this clustering part you need to use the first 11 attributes to your “clustering”- based calculations. Do not attempt/apply any dimensionality reduction techniques. 1st Objective (partitioning clustering) You need to conduct the k-means clustering analysis of this white wine dataset problem. As this is a typical multi-dimensional, in terms of features problem, initially, you need to provide a brief discussion of the methodologies used in reducing the dimensionality for such type of problems and the rationale of using them. (Suggestion: consult related literature and add some relevant references). In this specific clustering part, however, the analysis will be performed with all initial features, as the main aim is to assess different clustering results under the initial conditions. Before conducting the k-means, perform the following pre-processing tasks: scaling and outliers removal and briefly justify your answer. (Suggestion: the order of scaling and outliers removal is important. The outlier removal topic is not covered in tutorials, so you need to explore it yourself). As the provided dataset is not balanced (the number of samples per quality classes – i.e. 12th column – varies), you may also, before scaling/outlier tasks, consider to merge adjacent classes which have few samples. For example, quality classes 7 and 8. Initially, the dataset contains 5 classes. If you perform such “merging” task, please provide all details in your report, but the final number of classes cannot be less than 3. Define the number of cluster centres (via manual & automated tools) and perform k-means analysis for each attempt (i.e. different k). For each of the above k-means attempts, check your produced cluster outcome against the information obtained from 12th column and provide the related results/discussion (evidence of a “confusion” matrix and calculation of the accuracy/recall/precision indices from it). Choose the best “winner” clustering case (justify your response) and briefly explain the meaning of accuracy/recall/precision indices. Finally, for the “winner” case, provide the coordinates of each centre for each clustering group. Write a code in R Studio to address all the above issues (codes/results/discussion need to be included in your report). At the end of your report, provide also as an Appendix, the full code developed by you. The usage of kmeans R function is compulsory. (Marks 50) Forecasting Part Time series analysis can be used in a multitude of business applications for forecasting a quantity into the future and explaining its historical patterns. Exchange rate is the currency rate of one country expressed in terms of the currency of another country. In the modern world, exchange rates of the most successful countries are tending to be floating. This system is set by the foreign exchange market over supply and demand for that particular currency in relation to the other currencies. Exchange rate prediction is one of the challenging applications of modern time series forecasting and very important for the success of many businesses and financial institutions. The rates are inherently noisy, non-stationary and deterministically chaotic. One general assumption made in such cases is that the historical data incorporate all those behavior. As a result, the historical data is the major input to the prediction process. Forecasting of exchange rate poses many challenges. Exchange rates are influenced by many economic factors. As like economic time series exchange rate has trend cycle and irregularity. Classical time series analysis does not perform well on finance-related time series. Hence, the idea of applying Neural Networks (NN) to forecast exchange rate has been considered as an alternative solution. NN tries to emulate human learning capabilities, creating models that represent the neurons in the human brain. In this forecasting part you need to use an MLP-NN to predict the next step-ahead exchange rate of EUR/USD. Daily data (exchangeEUR20152016.xlsx) have been collected from February 2015 until September 2016 (400 data). The first 300 of them have to be used as training data, while the remaining ones as testing set. Use only the 3rd column from the .xlsx file, which corresponds to the exchange rates. 2nd Objective (MLP) You need to construct an MLP neural network for this forecasting problem. The definition of the input vector for NNs is a very important component for time-series analysis. Therefore, initially you need to provide a brief discussion of the various schemes/methods used to define this input vector. (Suggestion: consult related literature and add some relevant references). In this specific forecasting part, however, we are going to utilise only the “autoregressive” (AR) approach, i.e. time-delayed exchange rates as input variables. As the order of this AR approach is not known, you need to experiment with various input vectors and for each one of these cases you need to construct an input/output matrix for the MLP (using “time-delayed” rates). Each one of these matrices needs to be normalised, as this is a standard procedure for MLP NN. You need to explain briefly why normalisation procedure is necessary for this specific type of NN. For the training phase, you need to experiment with various MLPs, utilising these input vectors and various internal network structures (such as hidden layers, nodes, learning rate, activation function, etc.). For each case, the testing performance (i.e. evaluation) of the networks will be calculated using the standard statistical indices (RMSE, MAE and MAPE). Create a comparison table of their testing performances (using these specific statistical indices). Briefly explain the meaning of these three stat. indices. From this comparison table, check the “efficiency” of your best one-hidden layer and two-hidden layer networks, by checking the total number of weight parameters per network. Briefly, discuss which approach is more preferable to you and why. Finally, provide for your best MLP network, the related results both graphically (your prediction output vs. desired output) and via the stat. indices. Write a code in R Studio to address all these requirements. Show all your working steps (code & results, including comparison results from models with different input vectors and internal structure). As everyone will have different forecasting result, emphasis in the marking scheme will be given to the adopted methodology and the explanation/justification of various decisions you have taken in order to provide an acceptable, in terms of performance, solution. Full details of your results/codes/discussion are needed in your report. At the end of your report, provide also as an Appendix, the full code developed by you. The usage of neuralnet R function for MLP modelling is compulsory. (Marks 50) Coursework Marking scheme The Coursework will be marked based on the following marking criteria: 1st Objective (partitioning clustering) Brief discussion of methodologies used for reducing the input dimensionality 5Pre-processing tasks (3 marks for scaling and 7 marks for outliers removal) 10Define the number of cluster centres by showing all necessary steps/methods (via manual & automated tools). 7 K-means analysis for each attempt (show all kmeans R-template outputs) 6Evaluation of the produced outputs against 12th column 9Define the final “winner” cluster case and provide brief explanation of evaluation indices .(2 marks for winner and 6 marks for indices) 8 Illustrate the coordinates of each centre for each clustering group 5 2nd Objective (MLP) Brief discussion of the various methods used for defining the input vector in 5 time-series problems (provide relevant references in the text) Evidence of various adopted input vectors and the related input/output matrices 5Evidence of correct normalisation (5 marks) and brief discussion of its necessity (3 marks) 8Implement a number of MLPs, using various structures (layers/nodes) / input parameters 16 / network parameters and show in a table their performances comparison (based on testing data) through the provided stat. indices. (4 marks for structures with different input vectors, 8 marks for different internal NN structures/parameters and 4 for the comparison table). Discussion of the meaning of these stat. indices 6Discuss the issue of “efficiency” with your two best NN structures 4Provide your best results both graphically (your prediction output vs. desired output) 6 and via performance indices (3 marks for the graphical display and 3 marks for showing the requested statistical indices)
Related Posts
Question 1 Janet Brown is 45 and divorced. She has two children who live with her and are dependent on her. Stephen is 12. Sarah is 17 and has been certified as eligible for the disability credit. Janet’s financial information for 2019 and 2020 includes the following: 2020 2019 Salary and taxable benefits $105,000 $100,000 Car expenses deducted in computing employment income
Uncategorized / By
Scenario: You are employed by Pacific IT Solutions as a solutions integrator. Your job description is to implement IT solutions and provide customer support. One of your long-time customers, Western Mining, has their head office in Sydney and is opening a branch office in Brisbane. You have been contracted to setup the network. A meeting has been held to start the project. The minutes of the meeting are as follows:
Uncategorized / By