Assignment Task
Instruction
Once you have provided a screenshot for the tasks below, save this file as a PDF file and submit it together with the Power BI files that you have used for the tasks.
Please note that for each of the tasks below you need to show that
You can work with data to produce a dashboard
The dashboard must demonstrate an understanding of the theory of human visual perception, appropriate chart and design choices for the scenario.
Task 1: Data Modelling & pre-processing
This task uses data in tv shows and movies available on Netflix.
Download the file “Netflix_titles.csv” from Blackboard.
Below there is the set of sub-tasks that you need to complete.
Import the data. This is a .csv file, not an Excel file. Do you remember how to load it?
Delete the ”director” column
Remove all the empty rows in the “cast” column
Create a new table “Type” – this table must contain only the “show_id” column and the “Type” column
Remove the “type” column from the Netflix_titles table
Close and apply your changes, go to report view and check that you have two tables with one relationship as shown below. If not, you have to create the relationship manually.
Then create another table called “Date”, containing only the show_id and date_added columns
Remove the date_added column from the Netflix_titles table.
Close&Apply your changes, you should now have the three tables below:
Then, remove all the “TV Show” types from the dataset by deleting all the rows in the “type” table containing “TV Shows”.
In the “duration” column remove all the rows that contain the word “Season”, e.g. “1-Season”, “2-Seasons”, “1 s”, “2 s”, etc…
Remove the word “min” from each row of the “duration” column. You should leave only the numbers in each row.
Change the data type of “duration” into whole number
The next part of this task will focus on your reporting skills. You will have to explore the different Power BI reporting tools to be able to complete these tasks.
Remember that you can select the chart and use the Format and Analytics tab to change any formatting options.
Create a map similar to the one below that shows the duration by country (please note that your map will look different than this example).
Explore the format tab to complete the tasks below
Centre the title
Filter all values of “duration” to show items greater than 60 minutes. Hint: use the Filters panel
On a new page of your report, create the chart on the right that shows the duration by year.
Change the name of the y-axis into “Duration in minutes”.
Please post a screenshot of the dashboard + Data Model view, that you have created in the space below.
Task 2: Python and Pandas
In this task, you will be required to write Python code to solve some data wrangling tasks. For this exercise, I recommend using Google Colab. The code should go into the table at the end.
Download the Excel workbook (Voting Figures) from Blackboard (you will find it under ICA Support Files). This file contains information about the Brexit referendum votes. To upload the file to Google Colab, click the left-most button in the screeenshot (below Files). Your first task is to import the data into pandas .
Create a new column “Average Remain” to work the proportion of Remain votes out of the total electorate. Proportion means division…so your formula should include a DIVIDE operator (“/”).
You can use display(name_of_dataframe) to see your dataframe in a nice format.
We want to reduce the number of decimal places of the Average Remain column to two. Use the appropriate function to do this (don’t be afraid to Google).
Remove the Remain column.
We want to create a column “Remain Above Average”.
If the value in the Average Remain Column is higher than or equal to the mean, then the column should display “Yes”, else display “No”.
We would like to see the Remain proportion across each Region of the UK. Apply the appropriate transformation to the data.
Task 3: Dashboard Design
In this task, you will be required to perform some data analytics steps to analyse a financial dataset.
To complete the sub-tasks below, open Power BI and select “Try a sample dataset”.
Then select “Load sample data”.
Now, load the financials table into Power BI
Create a Power BI Dashboard to analyse the data and complete the tasks below.
You can use any charts/tools to answer the questions. For this task, you are not required to work on the data (using Transform Data) or change the data model.
For each task, add the answer. At the end, please link to your published dashboard (you need to be signed in to be able to publish).
Business Questions
1. Which product had the highest profit?
2. How many different products are in the dataset?
3. What was the gross sale associated with the “Paseo” product?
4. How many units were sold in 2014?
5. How many units were sold in May (over the two years)?
6. Which month had the highest number of units sold?
7. How many different countries are in the dataset?
8. Which two months had the lowest number of units sold?
Arrange your tiles/charts to create a well-designed dashboard (use the knowledge covered in the lessons). Add a title, change the colours/formatting options of the charts. Once you are happy with the dashboard design, paste the link below.
Use the space below to explain how you did apply any of the theoretical aspects of data visualization covered in this module (for example, you could justify the selection of a specific chart or design choices for the scenario). Please provide three different points.
Task 4: Data Misrepresentation
Imagine you are a data analyst at a reputable company. You come across a situation where data misrepresentation could potentially benefit the company financially but would involve unethical practices, which may also have negative social consequences. How would you handle this dilemma, and what steps would you take to uphold professionalism, responsibility, integrity, and consider the social impact in your role as a data analyst? Please limit your answer to 300 words.
Task 5: Statistics
5.1 You are working with a dataset, and you observe that it has significant outliers. What measures of the central tendency and dispersion are most appropriate to use in this situation?
5.2 You are using a poll to make an inference about the voting population in the UK. This poll was taken by calling a registry of landlines, around Monday morning. What issues do you see with this methodology, and how might you improve on it?
5.3 Write a Python function that takes in an array of numbers and calculates its median; the array may be of arbitrary length (odd or even). You must not use any built-in or third party libraries for this exercise. Some starting code is provided for you