Write My Paper Button

WhatsApp Widget

StudyQuest – Original Essays, Research Help & Free Plagiarism Checker

Plagiarism-Free Papers, Dissertation Editing & Expert Assignment Assistance

We’re going to perform an Exploratory Data Analysis (EDA) to summarize the main characteristics of this dataset by using some data science techniques within a Colab Notebook. Some questions we’ll want to answer and include in the EDA are: a. What is the dataset file size? b. What is the dataset file format? c. How many columns are in the dataset? d. What are the column names in the d

HW 4: Exploratory Data Analysis Report
Summary
Download a public dataset from Kaggle.com and use Python to help you create an Exploratory Data Analysis Report which highlights the key attributes of the given dataset and some preliminary insights of your dataset.
100 points.  

The goal of this assignment is to create a professional-looking report that highlights the key attributes of the given dataset. This should emulate a scenario where a client has asked you to run a preliminary analysis on a dataset to determine if further analysis could uncover actionable insights.
I highly recommend scrolling down to the last instruction to read about the final assignment deliverable before beginning.

A video demonstration of this lab can be found here.

1. We’re going to perform an Exploratory Data Analysis (EDA) to summarize the main characteristics of this dataset by using some data science techniques within a Colab Notebook. Some questions we’ll want to answer and include in the EDA are:

a. What is the dataset file size?
b. What is the dataset file format?
c. How many columns are in the dataset?
d. What are the column names in the dataset?
e. How many records (aka rows) are in the dataset?
f. Are there any missing attributes and what are they?
g. If there is a quantitative value (it’s Score in this dataset), what is the distribution of that value?
h. If there is a qualitative value (it’s Text in this dataset), what are some basic insights we can extract?

Note: The deliverables for this assignment are to create business report along with your Colab Notebook. As you answer the questions above, keep a list of the answers somewhere for later reference.

2. Visit the Kaggle dataset Amazon Fine Food Reviews at https://www.kaggle.com/snap/amazon-fine-food-reviews
3. Sign-in if prompted.
4. Download the reviews dataset (.csv file) and copy to your Google Drive MSDA 683 folder
5. Create a new Google Colab notebook called “MSDA 683 Lab 4 –
6. Import the following libraries.

7. Read the reviews file into a dataframe
8. Use a combination of what Kaggle tells you about the dataset and what you learned in Lab 1 about getting information from dataframes to answer questions a-h from part 1 above.
9. Create a new dataframe with just the Score and Text columns from the original dataframe.

10. View the counts of different scores.

11. Plot the distribution of scores.

12. Add a new cell and copy/paste the code below into it. Note, I found this code on the web for removing URLs. Not something I’ll expect you to know much about other than this is part of text preprocessing.
## Removal of urls
def remove_urls(text):
    url_pattern = re.compile(r’https?://\S+|www\.\S+’)
    return url_pattern.sub(r”, text)

rating_df[‘Text’] = rating_df[‘Text’].apply(lambda text: remove_urls(text))
rating_df.head()
13. Add a new cell and copy/paste the code below into it. Note, I found this code on the web for removing punctuation. Not something I’ll expect you to know much about other than this is part of text preprocessing.
#remove puncutation
import string
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    “””custom function to remove the punctuation”””
    return text.translate(str.maketrans(”, ”, PUNCT_TO_REMOVE))

rating_df[‘Text’] = rating_df[‘Text’].apply(lambda text: remove_punctuation(text))
rating_df.head()

14. Create a new dataframe that consists of all the 5-star reviews.
15. Keeping in mind the name of your new dataframe (fiveStardf), and what you learned above,
a. Show the head of the new dataframe to verify your new dataset has only 5 star reviews.
b. Show a bar chart of your new 5 star dataframe. Should only see one column, right?
c. Show the shape of the new dataframe. Over 360,000 records right?
16. The five star reviews dataframe is extremely large to work with in our browsers, so create a sample of 2000 records called “sampfiveStardf” from the five star reviews dataframe. Check the shape of your new dataframe to ensure it’s just 2000 records with 2 columns.
17. This is a lot of code, but what you need to do is to create two new empty lists – one to store product names and one to store adjectives. Then loop through your fiveStarSamp dataframe and for each text row, test if there’s a product name and if so, set it to lower case, then capitalize it. This is for consistency. Additionally, while you’re looping through reviews you can test adjectives and do the lower case/capitalization as well. Wrap the code in a performance counter and print the processing time and the number of brands found out of your review size. Note, your number of brands will be different than mine because of the random sample.

18. With such a small sample size, it’s difficult to get very many product names, so let’s increase the random sample size to 50000, from step 16, and re-run that code cell and the code cell that follows it, from step 17. This should take approximately 25 times longer (~15 mins) to run but you will end up with a more robust list of products.
19. After step 18 completes, create a sorted list of product names from your FiveStarProds list, then set the top 20 most common to a new dataframe called top_Prodlist. Print the dataframe.

19. It’s obvious that spaCy mis-labeled some terms as products. Let’s clean that up with the code below. Here are two cells of code to help with cleanup. It’s very important to take your results into account before running this, because your list will be different. The first cell creates a copy of your list, so you’ll always have your original list intact if you mess it up. You only need to run the first cell once (unless you accidentally delete something, then you can re-run it to get back to the original list). The second cell deletes the first row in the list, index 0, which for me is ‘Subscribe’. That’s not a product, so I delete it.

20. For each mis-labeled product, re-run the delete function with the appropriate index number, being sure to print the updated list after each delete. Do this until you have a list that looks like all product names. This is what mine looks like after I’ve cleaned it up.

21. Plot a horizontal bar chart with counts of the product names. Additionally, save the chart to an image file that you can use later for your report.
22. Create a top 20 list of adjectives used in 5-star reviews
23. Chart the adjectives in a horizontal bar chart.
24. Great work if you’ve made it this far! There’s one more piece of information that would be good to share in our report with the client, and that’s top adjectives listed for one particular product. I chose Gatorade in my example, but please choose whatever product from your list that you’d like. Just replace the product name “gatorade” with the lower-case name of your product. This will take a while to run as well.
25. Create a top 20 list of adjectives used in your product’s 5-star reviews
26. Chart it.

27. Now would be a good time to ensure that all the information and visuals that you’ve captured above are documented in a draft of your report.

28. STUDENT PROBLEM: After you’ve captured the information about the dataset and the information and visuals on 5-star reviews for all products and the one product that you chose, in a draft of your report, you need to repeat what you did above but with 1-star reviews. My recommendation is that you start at the end of your notebook and re-create the code cells starting with the cell you created in step 14 to setup a dataframe with only 1-star reviews. Give your new dataframe and, any remaining variables appropriate names, to ensure you’re only referencing your 1-star reviews. Just replace anything we named with a “five” to a “one”. For example:
You’ll also want to ensure you rename your chart titles to reflect what you’re showing in your 1-star reviews.

29. Put your report together using the facts, insights, and graphics, you created and gathered in this notebook on both 1 and 5 star reviews for all products and the one product that you chose to analyze.
a. Be sure to address all the questions from step 1.
b. Feel free to enhance your report with any research or other visuals you found or created in Python or any other tool. For example, with the top adjectives and counts for 1 and 5 star reviews, you could easily make a wordcloud to include in your report.
c. The insights should be a paragraph, or more, in your own words, that explains what you discovered in the dataset.
d. Make sure to save your report as a PDF document and upload the file the assignment in Blackboard.
e. Additionally, please share your notebook with kearsing@gmail.com and post a link to the Blackboard assignment.
f. Please be sure that your report looks like a business report and not just a Colab notebook. i.e. I would not expect to see code cells in a business report.
g. For example, here are two excellent reports I’ve seen in the past.
• Traditional
• Presentation style
GOOD LUCK!!

We’re going to perform an Exploratory Data Analysis (EDA) to summarize the main characteristics of this dataset by using some data science techniques within a Colab Notebook. Some questions we’ll want to answer and include in the EDA are: a. What is the dataset file size? b. What is the dataset file format? c. How many columns are in the dataset? d. What are the column names in the d
Scroll to top