—
title: “STAT 4410/8416 Homework 3”
author: “lastName firstName”
date: “Due on Oct 20, 2020”
output:
pdf_document: default
html_document: default
word_document: default
—
“`{r setup, include=FALSE}
library(knitr)
opts_chunk$set(fig.align=’center’, message=FALSE, cache=TRUE)
output <- opts_knit$get("rmarkdown.pandoc.to")
if(!is.null(output)) {
if(output=="html") opts_chunk$set(out.width = '400px') else
opts_chunk$set(out.width='.6\\linewidth')
}
```
**1.** **Text Data analysis:** Download "lincoln-last-speech.txt" from Canvas which contains Lincoln's last public address. Now answer the following questions and include your codes.
a) Read the text and store the text in `lAddress`. Show the first 70 characters from the first element of the text.
b) Now we are interested in the words used in his speech. Extract all the words from `lAddress`, convert all of them to lower case and store the result in `vWord`. Display first few words.
c) The words like `am`, `is`, `my` or `through` are not much of our interest and these types of words are called stop-words. The package `tm` has a function called `stopwords()`. Get all the English stop words and store them in `sWord`. Display few stop words in your report.
d) Remove all the `sWord` from `vWord` and store the result in `cleanWord`. Display first few clean words.
e) `cleanWord` contains all the cleaned words used in Lincoln's address. We would like to see which words are more frequently used. Find 15 most frequently used clean words and store the result in `fWord`. Display first 5 words from `fWord` along with their frequencies.
f) \label{coord} Construct a bar chart showing the count of each words for the 15 most frequently used words. Add a layer `+coord_flip()` with your plot.
g) What is the reason for adding a layer `+coord_flip()` with the plot in question (2f). Explain what would happen if we would not have done that.
h) The plot in question (2f) uses bar plot to display the data. Can you think of another plot that delivers the same information but looks much simpler? Demonstrate your answer by generating such a plot.
i) In the question (2c), you removed words that are called `stop-words`. Now please answer the following:
a) Count the total stop words from `lAddress` and store it in `stopWordsCount`
b) Count the total words (including stop-words) from `lAddress` and store it in `lAddressCount`
c) Divide `stopWordsCount` by `lAddressCount` and report the percentage
d) Explain in your own words what does the percentage indicate in this context?
**2.** **Regular Expressions:** Write a regular expression to match patterns in the following strings. Demonstrate that your regular expression indeed matched that pattern by including codes and results. Carefully review how the first problem is solved for you.
a) We have a vector `vText` as follows. Write a regular expression that matches `g, og, go or ogo` in `vText` and replace the matches with '.'.
```{r}
vText <- c('google','logo','dig', 'blog', 'boogie' )
```
**Answer:**
```{r}
pattern <- 'o?go?'
gsub(pattern, '.', vText)
```
b) Replace only the 5 or 6 digit numbers with the word "found" in the following vector. Please make sure that 3, 4, or 7 digit numbers do not get changed.
```{r}
vPhone <- c('874','6783','345345', '32120', '468349', '8149674' )
```
c) Replace all the characters that are not among the 26 English characters or a space. Please replace with an empty spring.
```{r}
myText <- "#y%o$u @g!o*t t9h(e) so#lu!tio$n c%or_r+e%ct"
```
d) In the following text, replace all the words that are exactly 3 or 4 characters long with triple dots `...'
```{r}
myText <- "Each of the three and four character words will be gone now"
```
e) Extract all the three numbers embedded in the following text.
```{r}
bigText <- 'There are four 20@14 numbers hid989den in the 500 texts'
```
f) Extract all the words between parenthesis from the following string text and count number of words.
```{r}
myText <- 'The salries are reported (in millions) for every company.'
```
g) Extract the texts in between _ and dot(.) in the following vector. Your output should be 'bill', 'pay', 'fine-book'.
```{r}
myText <- c("H_bill.xls", "Big_H_pay.xls", "Use_case_fine-book.pdf")
```
h) Extract the numbers (return only integers) that are followed by the units 'ml' or 'lb' in the following text.
```{r}
myText <- 'Received 10 apples with 200ml water at 8pm with 15 lb meat and 2lb salt'
```
i) Extract only the word in between pair of symbols `$`. Count number of words you have found between pairs of dollar sign `$`.
```{r}
myText <- 'Math symbols are $written$ in $between$ dollar $signs$'
```
j) Extract all the valid equations in the following text.
```{r}
myText <- 'equation1: 2+3=5, equation2 is: 2*3=6, do not extract 2w3=6'
```
k) Extract all the letters of the following sentence and check if it contains all 26 letters in the alphabet. If not, produce code that will return the total number of unique letters that are included and list the letters that are not present as unique elements in a single vector.
```{r}
myText <- 'there are five wizard boxing matches to be judged'
```
**3.** **Extracting data from the web:** Our plan is to extract data from web sources. This includes email addresses, phone numbers or other useful data. The function `readLines()` is very useful for this purpose.
a) Why are text data `ubiquitous`? and why is it important?
b) Explain in your own words by giving 2 examples why is it challenging to work with `Text Data`?
c) List any 3 meta characters of a Regular Expression? For each of the meta character you listed, explain what they do?
d) Read all the text in http://mamajumder.github.io/index.html and store your texts in `myText`. Show first few rows of `myText` and examine the structure of the data.
e) Write a regular expression that would extract all the http web links addresses from `myText`. Include your codes and display the results that show only the http web link addresses and nothing else.
f) Now write a regular expression that would extract all the emails from `myText`. Include your codes and display the results that show only the email addresses and nothing else.
g) Now we want to extract all the phone/fax numbers in `myText`. Write a regular expression that would do this. Demonstrate your codes showing the results.
h) The link of ggplot2 documentation is http://docs.ggplot2.org/current/ and we would like to get the list of ggplot2 geoms from there. Write a regular expression that would extract all the geoms names (geom_bar is one of them) from this link and display the unique geoms. How many unique geoms does it have?
**4.** **Big data problem:** Download the sample of big data from canvas. Note that the data is in csv format and compressed for easy handling. Now answer the following questions.
a) \label{select-few} Read the data and select only the columns that contains the word 'human'. Store the data in an object `dat`. Report first few rows of your data.
b) The data frame `dat` should have 5 columns. Rename the column names keeping only the last character of the column names. So each column name will have only one character. Report first few rows of your data now.
c) Compute and report the means of each columns group by column b in a nice table.
d) Change the data into long form using id='b' and store the data in `mdat`. Report first few rows of data.
e) The data frame `mdat` is now ready for plotting. Generate density plots of value, color and fill by variable and facet by b.
f) The data set `bigDataSample.csv` is a sample of much bigger data set. Here we read the data set and then selected the desired column. Do you think it would be wise do the same thing with the actual larger data set? Explain how you will solve this problem of selecting few columns (as we did in question 6a) without reading the whole data set first. Demonstrate that showing your codes.
**5.** **Haloween Candy** Read the data from the raw Github Link below:
```
https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv
```
On the data set, follow these conventions for data attributes/values:
* factor value `1` represents Yes
* factor value `0` represents No
* `sugarpercent` is the percentile of sugar it falls under within the data set
* `pricepercent` is the unit price percentile
* `winpercent` is the overall win percentage
* `competitorname` is the candy name
Answer the following questions below and support your answer showing the codes and a plot (if applicable):
a. Find the candy name that has the 3rd most sugar percentile.
b. Is the candy with the most sugar percentile (1st) is also the most expensive per `pricepercent`?
c. Which category (chocolate, fruity, caramel, peanutyalmondy, nougat, crispedricewafer, hard, bar, pluribus) of candy from the `dataset` is related to most `sugarpercent`? In other words, if a candy name is either on above category name (meaning the value is `1`), which candy has the highest `sugarpercent`?
d. Which candy does not belong to either of the 3 categories - `chocolate` or `crispedricewafer` or `caramel` but has the highest `sugarpercent`? Report ONLY the candy name and the value of corresponding `sugarpercent`.
e: Overall, what factor indicates a higher `winpercent`? Basically, what properties or value should a candy name or `competitorname` have to get the highest `winpercent` based on the data set?
f. Generate a side by side box plot of `pricepercent` for the candy name `chocolate`. Explain if `chocolate` category candy are expensive or vice-versa.
g. Do you think that a higher `sugarpercent` indicates a higher `pricepercent`? Generate a plot to support your finding.
**6.** **Optional bonus question (5 points extra)** Download the excel file "clean-dat-before.xls" from canvas It contains time series data for many variables. Among the two columns of the data, the first column represents time and the second column represents the measurement. The challenge is that variable names are also included in the time column. Our goal is to clean and reshape the data. First few rows and columns of the desired output is shown below. Notice each time point is converted into an integer time index to make a uniform elapsed time for all the variables.
| elapse_time| Area| Bulk.Rotation.| ECG| Endo.MA.Circ..Strain| Endo.MA.Radial.Strain|
|-----------:|------:|--------------:|-------:|--------------------:|---------------------:|
| 1| 10.924| 0.000| 0.32157| 0.000| 0.000|
| 2| 10.648| 0.070| 0.58824| -1.495| 0.762|
| 3| 10.574| -0.128| 0.81176| -1.423| 2.619|
| 4| 10.487| 0.097| 0.88627| -0.620| 3.591|
| 5| 10.342| 0.181| 0.87451| -1.142| 3.472|
| 6| 9.995| 0.235| 0.85882| -3.269| 5.812|