程序代写代做代考 html chain algorithm hadoop PLAN FOR TODAY

PLAN FOR TODAY
• Introduce Data Pre-Processing and tidyverse.
• Introduce the dplyr package.
• Show examples of how to use the dplyr verbs.
• Introduce and demonstrate the use of the piping operator.
• Describe and classify missing data.
• Introduce the procedures for dealing with missing data.
• Discuss deletion methods.
• Discuss simple and multiple imputation.
• Demonstrate how to import and handle missing data in R.
Seminar: Get started with dplyr and dealing with missing data in R

DATA PRE-PROCESSING
• Real world datasets often contain noisy, missing and inconsistent data.
• This is usually caused by the data being formed from multiple, sources or poor data collection techniques.
• It is generally regarded that data pre-processing takes up about 80% of an analysts time and only 20% is spent on the actual analysis.

DATA PRE-PROCESSING
• Data pre-processing includes:
• Data transformation and reshaping
• Calculating variables that are functions of existing variables • Aggregation
• Dealing with missing data
• Data pre-processing can be done in base R but many people prefer to use functions from the dplyr package.
• The dplyr package is one of the “tidyverse􏰀 packages – a series of packages that are designed to make data wrangling and exploration simple and intuitive.

TIDYVERSE PACKAGES
This book discusses the following tidyverse packages:
• dplyr
• tidyr
• stringr
• lubridate
• ggplot2
• broom
• tidytext
general data manipulation and exploration
reshaping data
working with strings
working with dates and times
data visualisation package
tidying output from statistical models
working with unstructured text data

DPLYR PACKAGE
The dplyr package is mostly based upon the following five verbs:
• filter
• select
• arrange
• mutate
• summarise
keep rows matching criteria
select and drop columns
reorder rows
add new variables to existing data frame without changing its shape
reduce variables to values

DPLYR: FILTER
• Written like: filter( data , how to filter )
• & means and
• | means or
• For example:

DPLYR: FILTER
• Or..

DPLYR: SELECT
• Written like: select( data , what columns you want )
• For example:

DPLYR: ARRANGE
• Written like: arrange( data , columns to sort by )
• For example:

DPLYR: MUTATE
• Written like: mutate( data , col.name = calculation )
• For example:

DPLYR: SUMMARISE
• Written like: summarise( data , arguments )
• For example:

PIPING OPERATOR
• In R, you may come across some code like this: %>%
• This is called a piping operator and can make your code
more readable.
• You simply read %>% as ‘then􏰁.
• It pipes left side arguments into right side function calls. For example:
mtcars %>% group_by(gear) %>% summarise(mean(mpg))
# take dataframe, then
# group it by gear, then
# summarise the mean mpg for each level of gear

MISSING DATA
• Missing data is a common problem and challenge for analysts.
• There are many reasons why data could be missing, including:
Respondents forgot to answer questions.
Respondents refused to answer certain questions.
Respondents failed to complete the survey.
A sensor failed.
Someone purposefully turned off recording equipment.
There was a power cut.
The method of data capture was changed.
An internet connection was lost.
A network went down.
A hard drive became corrupt.
A data transfer was cut short.

MISSING DATA
Missing data can usually be classified into:
• Missing Completely at Random (MCAR):
• If missingness doesn􏰁t depend on the values of the data set.
• e.g. a random sample of patients who had their blood pressure
measured also had their weight measured.
• Missing at Random (MAR):
• If missingness does not depend on the unobserved values of the data set but does depend on the observed.
• e.g. patients with high blood pressure had their weight measured.
• Not Missing at Random (NMAR):
• If missingness depends on the unobserved values of the data set.
• e.g. overweight patients had their weight measured.

MISSING DATA
Another example: Survey data on drug use. • Missing Completely at Random (MCAR):
• You removed 10% of the respondents data randomly. • Missing at Random (MAR): (most common type)
• People who come from poorer families might be less inclined to answer questions about drug use, and so the level of drug use is related to family income.
• Not Missing at Random (NMAR):
• Students skipped the question on drug use because they feared that they would be expelled from school.

MISSING DATA
• Generally the procedure for dealing with missing data is:
1. Identify the missing data.
2. Identify the cause of the missing data.
3. Either:
A. Remove the rows containing the missing data
• Also called the naïve approach.
• Make sure missing data isn􏰁t biased!
B. Replace missing values with alternative values.
• Impute the missing values.
• There are a number of approaches.
Deciding between A and B depends on which outcome you think will produce the most reliable and accurate results.

REMOVING MISSING DATA ROWS
• The two most common methods for removing missing data are:
Listwise deletion
(complete case analysis)
Pairwise deletion
Description:
Analyse the data rows where there is complete data for every column.
Analyse the data rows where the variables of interest have data present.
Advantages:
• Simple
• Easily compare across
analyses.
• Uses all possible information.
Limitations:
• Could be biased (if the data is not MCAR).
• Lower n, reduces statistical power.
• Separate analyses cannot be compared as the data / sample will be different.

REMOVE MISSING DATA ROWS
• Last week we looked at the representation of missing values in R: NA Not Available (placeholder for a missing value).
NULL Empty value.
Inf Infinity.
• It is possible to use is.na(), is.null() and is.infinite() functions in R to
identify missing, empty and infinite values in datasets.
• The function complete.cases() can be used to identify the data rows in a matrix or data frame that are / aren’t complete.
• Only NA and NULL are regarded as missing, Inf is treated as valid.

EXAMPLE MISSING DATASET
• I will be using the following sleep dataset as an example.
• It contains the following data on 62 species of mammals:
Column
Description
Dream
Length of dreaming sleep
NonD
Non-dreaming sleep
Sleep
Sum of Dream and NonD
BodyWgt
Body weight (kg)
BrainWgt
Brain weight (g)
Span
Life span (yrs)
Gest
Gestation time in days
Pred
Degree to which species were preyed upon (1-low to 5-high scale)
Exp
Degree of their exposure while sleeping (1-low to 5-high scale)
Danger
Overall danger (1-low to 5-high scale)
• Various data is missing in the dataset (NA values).

REPLACING MISSING DATA
• The two most common methods for replacing missing data are:
Simple Imputation
Multiple Imputation
Description:
Missing values are replaced with the mean, median or mode value.
Estimates missing data through repeated simulations.
Stochastic:
No
Yes
Advantages:
• Simple.
• Variability more accurate.
Limitations:
• Could be biased (if the data is not MCAR).
• Underestimates standard errors.
• Could distort correlations among variables.
• Algorithms are more complex.
• Normally would require complex coding (R library available).

SIMPLE IMPUTATION

Simply replace the missing values with the mean, median or mode:

• mean(x)
• median(x)
• names(sort(-table(x)))[1]
For example, using mean(sleep$x, na.rm=TRUE): 1.972 (Mean of Dream)
8.672917 (Mean of NonD)
19.87759 (Mean of Span)

Replace NA values with sleep$x[is.na(sleep$x)] <- value MULTIPLE IMPUTATION • The idea of Multiple Imputation is to replace each missing value with multiple acceptable values that represent a distribution of possibilities. • This results in a number of complete datasets (usually 3-10): Analyse Dataset with missing values Dataset with multiple imputation applied Imputed datasets Analysis results of datasets MULTIPLE IMPUTATION The general procedure for the chained equation approach to multiple imputation (used in mice()) is: 1. A simple imputation is performed for every missing value. 2. One of the missing variables are set back to missing. 3. Regression is performed (linear, logistic, polynomial etc.), the missing variable being the forecast variable and all other variables in the dataset being the predictor variables. 4. Missing values are replaced with predictions (imputations) from the regression. 5. Repeat steps 2-4 for each variable that has missing data (one cycle). 6. Repeat for a number of cycles then retain results as one imputed dataset. MULTIPLE IMPUTATION • We will focus on using the mice package. • The mice package has many built in imputation techniques including: • Example: mice(data, meth=c('sample','pmm','logreg','norm')) LIVE DEMO IN R OTHER LEARNING RESOURCES • Introduction to dplyr tutorial: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html • Video introduction to the dplyr package using R Studio: https://www.youtube.com/watch?v=jWjqLW-u3hc • Paper on Tidy Data in R: www.jstatsoft.org/article/view/v059i10/v59i10.pdf • R in Action book - Chapter 18 - Advanced methods for missing data • Book on Introduction to data cleaning with R: http://goo.gl/bOiivm • Book Chapter on Data Preprocessing: http://goo.gl/XcuFww • Missing Data article on Quick-R: http://goo.gl/h8TzBu Next week: Big Data Systems: Hadoop, MapReduce and Spark TODAYS SEMINAR Today's seminar is titled: “Get Started with Data Pre-Processing using the dplyr package then apply missing data methods to the Titanic passengers dataset in R” You will learn: • • • • You will learn to use the 5 dplyr verbs on the mtcars dataset. • You can then test your new skills with a number of questions. How to check data for complete and incomplete cases. How to explore a dataset in R. How to apply simple and multiple imputation in R.