HW 3: Text Processing and Visualization with Python
Use a Python natural language processing library (spaCy) in a Colab notebook to process text and create visualizations of results.
100 points
Contents
Preparation: 1
Getting Started: 2
Word Cloud tasks: 2
Part of Speech task 8
Named Entity tasks 8
Information Extraction Tasks: 10
Add a visualization using Matplotlib 12
STUDENT PROBLEM: Perform your own text analysis. 12
Submitting your work 13
SpaCy Named Entity Types: 13
Grading Rubric: 14
Preparation:
1. Browse the article “Top 10 Python Libraries for Natural Language Processing (2018)” https://kleiber.me/blog/2018/02/25/top-10-python-nlp-libraries-2018/
2. Watch the Hands-on lab #3 walkthrough video which covers the entire assignment, here.
Getting Started:
3. Visit the Wikipedia page for the 2016 World Series. Copy the URL. Visit https://www.textise.net/, paste in the URL, press Textise. This might take a couple seconds to run.
4. Copy all the text shown and paste it into a new text document (such as notepad if on Windows). Brackets is a free text editor for Mac if you want something more powerful than the built-in text editor.
5. Delete the header and footer information (all the non-value-added information at the top and bottom of the file) that came from Textise and Wikipedia. Save the text document as 2016WS.txt on your computer then upload it to your Google Drive MSDA 683 folder.
6. Create a new Colab notebook, “MSDA 683 Lab 3 –
Word Cloud tasks:
7. Import the libraries as shown below:
8. Connect to your Google Drive in Colab
9. Read the WS2016.txt file in a variable
10. Create a basic Wordcloud from the Wikipedia text. Note: be sure to set stopwords = ‘ ‘ as seen below. This will tell the word cloud not to exclude stopwords.
11. Create another word cloud, but this time don’t include the stopwords parameter.
12. Import, download, and assign the english NLTK stopwords to a variable.
13. Create a new word cloud with a white background and using the NLTK stopwords.
14. Add additional stop words to the NLTK list to clean up our word cloud.
15. Apply the additional stop words to a new word cloud.
16. Make a new string from the Wikipedia file string and replace the word ‘Cub ‘ with ‘Cubs’ in the new string. Note the space after the word Cub.
17. Pass the new string to the word cloud
18. Create an image array from the homeplatepng file. Apply a new background color, colormap, and title, like below.
Part of Speech task
19. Load spaCy into a variable named nlp. Set the a doc variable equal to the text shown below. Loop through the doc object to print each words part of speech.
Named Entity tasks:
20. Named Entities are parts of speech that represent real word objects. For instance, New York City is an instance of a city, Peter Parker is an instance of a person, etc. SpaCy can identify named entities. Reference the documentation to learn more. Type in the commands below in grey and run to see the results printed.
21. Loop through the document, this time printing out words with their named entity label.
22. SpaCy provides nice visualizer capabilities. Find the documentation here. Let’s use the entity visualizer to highlight the named entities. Show the quote using displacy rendering.
23. Load your Wikipedia file into spaCy.
24. Display your Wikipedia document using displacy label rendering.
25. Loop through the document, this time only showing the displacy rendered labels on ‘PERSON’ entity types. This code can also be commented out, and results hidden, after it runs once.
Information Extraction Tasks:
26. Build a Pandas DataFrame that holds the top 20 names mentioned in the Wiki article.
27. Let’s clean up the dataframe to combine some last names with the full names of two of the players. Note, when you do this with your own Wikipedia article, you’ll need to change the index numbers to reflect your data.
28. Sort your cleaned up dataframe by count.
Add a visualization using Matplotlib
29. Create a bar chart using the popular Matplotlib library to visualize the frequency of the top 20 names listed in the Wikipedia article.
STUDENT PROBLEM: Perform your own text analysis.
30. Using what you learned in this lab, create a new notebook and perform your own text analysis using a Wikipedia document about a large company, historical figure, or major historical event. Try top 100 Wikipedia articles if you need inspiration. Please ensure that your article is a long enough that it will have at 20 unique names within it. You will add three total charts, one using the PERSON entity type, and two that use a different named entity types (spaCy NER types here and also seen below). Be sure to update your chart titles to reflect your new topic. Also include at least one word cloud, using the techniques demonstrated above. Delete any cells that have errors, and ensure that your notebook is fully run so I can see all the results, charts, wordcloud, lists, of your text analysis without needing to run.
Submitting your work:
31. You will share both notebooks with kearsing@gmail.com, and include a link to your notebook in the Blackboard assignment.
SpaCy Named Entity Types:
PERSON – People, including fictional
NORP – Nationalities or religious or political groups
FAC – Buildings, airports, highways, bridges, etc.
ORG – Companies, agencies, institutions, etc.
GPE – Countries, cities, states
LOC – Non-GPE locations, mountain ranges, bodies of water
PRODUCT – Vehicles, weapons, foods, etc. (Not services)
EVENT – Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART – Titles of books, songs, etc.
LAW – Named documents made into laws
LANGUAGE – Any named language
DATE – Absolute or relative dates or periods
TIME – Times smaller than a day
PERCENT – Percentage (including “%”)
MONEY – Monetary values, including unit
QUANTITY – Measurements, as of weight or distance
ORDINAL – “first”, “second”
CARDINAL – Numerals that do not fall under another type
Grading Rubric:
1. (80 pts.) Your first notebook, WS2016, contains all the code required to run error free, as shown above.
2. (3 pts.) Your second notebook is free of any errors.
3. (3 pts.) Your second notebook does not contain any uneccessary or duplicate code.
4. (3 pts.) Your second notebook has been fully run and all results are visible to the me without needing to run.
5. (8 pts.) Your second notebook has 3 new charts and a Word Cloud based on your text analysis.
6. (3 pts.) Your chart labels are updated to reflect your actual results (not the 2016 World Series).