STAT 7008 – Assignment 2 Due Date by 31 Oct 2018 Question 1 (hashtag analysis)
- Tweets1.json corresponds to tweets received before a Presidential debate in 2016 and all other data files correspond to tweets received immediately after the same Presidential debate. Please download the files from the following link:
https://transfernow.net/912g42y1vs78
Please write codes to read the data files tweets1.json to tweets5.json and combine tweets2.json to tweets5.json to a single file, named tweets2.json. Determine the number of tweets in tweets1.json and tweets2.json.
- In order to clean the tweets in each file with a focus on extracting hashtags, we observe that ‘retweeted_status’ is another tweet within a tweet. We select tweets using the following criteria:
- – Non-empty number of hashtags either in ‘entities’ or in its
‘retweeted_status’.
- – There is a timestamp.
- – There is a legitimate location.
- – Extract hashtags that were written in English or convert hashtags
that were written partially in English (Ignore non-english characters).
Write a function to return a dictionary of acceptable tweets, locations and hashtags for both tweets1.json and the tweets2.json respectively.
- – Non-empty number of hashtags either in ‘entities’ or in its
- Write a function to extract the top n tweeted hashtags of a given hashtag list. Use the function to find the top n tweeted hashtags of the tweets1.json and the tweets2.json respectively.
- Write a function to return a data frame which contains the top n tweeted hashtags of a given hashtag list. The columns in the returned data frame are ‘hashtag’ and ‘freq’.
- Use the function to produce a horizontal bar chart of the top n tweeted hashtags of the tweets1.json and tweets2.json respectively.
- Find the max time and min time of the tweets1.json and the tweets2.json respectively.
- For each interval defined by (min time, max time), divide it into 10 equally spaced periods respectively.
- For a given collection of tweets, write a function to return a data frame with two columns, hashtags and their time of creation. Use the function to produce data frames for the tweets1.json and the tweets2.json. Use pandas.cut or else, create a third column ‘level’ in each data frame which cuts the time of creation by the corresponding interval obtained in part 7 respectively.
- Use pandas.pivot or else, create a numpy array or a pandas data frame whose rows are time period defined in part 7 and whose columns are hashtags. The entry for the ith time period and jth hashtag is the number of occurrence of the jth hashtag in ith time period. Fill the entry without data by zero. Do this for tweets1.json and the tweets2.json respectively.
- Following part 9, what is the number of occurrence of hashtag ‘trump’ in the sixth period in the tweets1.json? What is the number of occurrence of hashtag ‘trump’ in the eighth period in the tweets2.json?
- Using the tables obtained in part 9, we can also find the total number of occurrences for each hashtag. Rank these hashtags in decreasing order and obtain a time plot for the top 20 hashtags in a single graph. Rescale the size of the graph so that it is not too small nor too large. Do this for both tweets1.json and the tweets2.json respectively.
- The zip_codes_states.csv contains city, state, county, latitude and longitude of US. Read the file.
- Select tweets in tweets1.json and the tweets2.json with locations only in the zip_codes_states.csv. Remove also the location ‘london’.
- Find the top 20 tweeted locations in both tweets1.json and the
tweets2.json respectively.
- Since there are multiple (lon, lat) pairs for each location, write a
function to return the average lon and the average lat of a given location. Use the function to generate the average lon and the average lat for every locations in tweets1.json and the tweets2.json.
- Combine tweets1.json and tweets2.json. Then, create data frames which contain locations, counts, longitude and latitude in tweets1.json and the tweets2.json.
- Using the sharpfile of US states st99_d00 and the help of the website
https://stackoverflow.com/questions/39742305/how-to-use- basemap-python-to-plot-us-with-50-states, produce the following graphs.
18. (Optional)
Using polygon patches and the help of the website https://stackoverflow.com/questions/39742305/how-to-use- basemap-python-to-plot-us-with-50-states, produce the following graph.
Question 3 (extract hurricane paths)
The website http://weather.unisys.com provides hurricane paths data from 1850. We work to extract hurricane paths for a given year.
- Since the link contains the hurricane information varies with years and the information is contained in multiple pages, we need to know the starting page and the total number of pages for a given year. What is the appropriate starting page for year = ‘2017’?
- In order to solve the second question, we try inputting a large number as the number of pages for a given year. Use an appropriate number, write a function to extract all links each of which holds information on the hurricanes in ‘2017’.
- Some of the collected links provide summary of hurricanes which do not lead to correct tables. Remove those links.
- For each valid hurricane link, it contains four set of information: – Date
- – Hurricane classification
- – Hurricane name
- – A table of hurricane positions over dates
Since the entire information is contained in a text file provided in the corresponding webpage defined by the link, write a function to download the text file and read (without saving it to a local directory) the text file (at this moment, you don’t need to convert the data to other format).
- With the downloaded contents, write a function to convert the contents to a list of dictionaries. Each dictionary in the list contains the following keys: Date, Category of the hurricane, Name of the hurricane and a table of information for the hurricane path. Convert the Date in each dictionary to datetime objects. Since the recorded times for the hurricane paths used the Z-time, convert it to datetime object with the help of https://www.hko.gov.hk/wxinfo/json/yestemp.json?_=1539266409025.
- We find some missing data in the Wind column of some tables. Since the classification of a hurricane at a given moment can be found in the Status column of the same table and the classification also relates to the wind speed at that moment, use the classification to impute the missing wind data. You may want to read the following website https://en.wikipedia.org/wiki/Tropical_cyclone_scales.
- Plot the hurricane paths of year ‘2017’ size by the wind speed and color by the classification status.
If you can produce your graph in a creative way, bonus marks will be given.
- (Optional)
Convert the above functions as function of year so that when we change year, you will be able to generate plot of the hurricane paths in that year easily.