程序代写 DSME5110F: Statistical Analysis

DSME5110F: Statistical Analysis
Lecture 1 Introduction and R Instructions

About the Lecturer

Copyright By PowCoder代写 加微信 powcoder

• Name: Junfei Huang
• Office: Rm 914, CYT Building
• Phone: 3943-1654
• Office hour: 5:00-6:00pm, Friday
or by appointment

About the Course
• Lecture notes: Will be available.
• Course website:
https://blackboard.cuhk.edu.hk
• Textbook:
– Statistics with R: A Beginner’s Guide by ,
2018, 1st Edition • Reference book:
– Data Mining for Business Analytics: Concepts, Techniques and Applications in R by , . Bruce, , . Patel, . Lichtendahl, Jr., 2018, 1st Edition
– An Introduction to R by

Course Assessment • Class participation: 5%
• Homework (3, team-based): 15%
– Each team can have up to 4 members
– Pls send me the names of your teammates by Dec 28
• Project (team-based): 20%
• Final Exam (Individual): 60%

About the Project • Collect a data set
• Apply the concepts and R commands from this course to analyze the data set
• Write a report (no more than 4 pages)
– Send to by Feb 28, 2022 • Report
• Data set • Code

About the Final Exam
• Multiple Choice, Problem Solving • Individual effort
• No make-up exam
• Time: 120 minutes
• Maximum marks: 60

University Policy on Scholastic Dishonesty
• The Chinese University of places very high importance on honesty in academic work submitted by students, and adopts a policy of zero tolerance on cheating and plagiarism.
• Attention is drawn to University policy and regulations on honesty in academic work, and to the disciplinary guidelines and procedures applicable to breaches of such policy and regulations.
• Details may http://www.cuhk.edu.hk/policy/academichonesty/.
be found at 7

• An Introduction to Statistics
• Basic Terminology
• An Introduction to R – Vector
– Data Frame
– Import and Export Data

Puzzling Statistics: Example 1
• The following table shows the batting averages of two “switching hitters” in 1991, (LA Dodgers) and (Pitts. Pirates). Who was the more valuable player, with respect to batting statistics, in 1991? How could Murray beat Merced both as a left-handed and a right- handed batter and still have a lower batting average?
– Batting Average = no. of hits divided by the number of plate appearances — or at bats.
– Eddie is 35/100 when hitting with left-hand and 15/75 with right-hand. Hence, his overall hitting average is 50/175.
– Orlando is 34/100 when hitting with left-hand and 7/40 with right-hand. Thus, his overall hitting average is 41/140.
Batting Average

Right-hand

Puzzling Statistics: Example 2
• One study published in a prominent medical journal (name of the article not cited) showed a strong positive correlation between per capita consumption of tobacco and the incidence of lung cancer over a number of countries. The author then concluded that
– “Smoking causes cancer”.
• Another researcher used the same data on per capita consumption of tobacco for the same countries but substituted the incidence rate of cholera. He obtained a negative correlation that was stronger than the positive correlation revealed in the first paper. This author then concluded that
– “Smoking prevents cholera”.
• He sent his paper to the journal published the first paper and the paper
was rejected.
Reference: , How to lie with Statistics, W. W. Norton & Company Inc. , 1954.

What is Statistics?
• Statistics can refer to numerical facts (e.g., averages, medians) that help us understand a variety of business and economic situations.
Number of Students
Average Age
• Statistics is the science that deals with collecting, analyzing, and interpreting data.

Why to Learn “Statistics”?
• …the sexy job in the next 10 years will be statisticians. And I am not kidding.
– , the chief economist at Google Inc.
• Data Scientist: The Sexiest Job of the 21st Century
– . Davenport and D. J. Patil, Harvard Business Review, October 2012.
• To have any hope of extracting anything useful from big data, . . . effective inferential skills are vital. That is, at the heart of extracting value from big data
lies statistics
• Most of my life I went to parties and heard a little groan when people
– . Hand, 2014 – , a statistics professor at Stanford University, Times,
January 26, 2012.
heard what I did. Now they are all excited to meet me • For Today’s Graduate, Just One Word: Statistics.

Why Should Everyone Learn Statistics?
• First, Statistics involves using relatively small sample to make inference about large population.
– As such, proper use of statistics will allow us to make reasonably good estimation about the population without spending too much time and money.
• Second, even people who do not use statistics should know something about statistics because we are bombarded with all kinds of statistical figures and reports everyday – many of which could be misleading and confusing.
– Having basic knowledge in Statistics will allow us to interpret those figures and reports more correctly.

Applications of Statistics
• Statistics has been used in almost every research field.
• List below are just some examples:
– Management Science: Manufacturing firms often use statistical methods for quality control. For example, a tire manufacturer may take a sample of tires produced to determine the average lifetime of tires.
– Marketing Research: Degree of acceptance to a new product, TV/Radio channels exposure rating, and demand for a certain product given prices.
– Economics: Use of Time Series analysis to forecast future economy.
– Political Science: Predict election outcome.
– Education: Study whether a particular teaching method is more effective than the other one.
– Sociology: Study gender and racial difference in behavior.
– Medical Study: Study the effectiveness of a new medicine, the relationship between
smoking and lung cancer, etc.
– Psychology: Study how animals respond to different given conditions in an experiment.

Example 1: Target’s prediction

Example 1: Target’s prediction

Example 1: Target’s prediction
• About a year after Pole created his pregnancy-prediction model, a man walked into a Target outside Minneapolis and demanded to see the manager. He was clutching coupons that had been sent to his daughter, and he was angry, according to an employee who participated in the conversation.
• “My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”
• The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.
• On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
— From How Companies Learn Your Secrets https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=1&_r=1&hp

Example 2: Moneyball
• A 2011 American sports film
• An account of the Oakland Athletics baseball team’s 2002 season and their general manager ‘s attempts to assemble a competitive team.
• Beane ( ) and assistant GM Peter Brand ( ), faced with the franchise’s limited budget for players, build a team of undervalued talent by taking a sophisticated sabermetric approach to scouting and analyzing players.
• Nominated for six Academy Awards
• Former Green Bay Packers vice president stated that the film “persuasively exposed front office tension between competing scouting applications: the old school “eye-balling” of players and newer models of data-driven statistical analysis … Moneyball—both the book and the movie—will become a time capsule for the business of sports”.

Introduction and R Instructions
Descriptive Statistics: I
Descriptive Statistics: II
Probability (and Association Rule)
Bayes’ Theorem and Naive Bayes Classifier
Point Estimation and Sampling Distributions
Confidence Interval Estimation
Hypothesis Testing
Simple Linear Regression
Multiple Regression and Review
Final Exam

• An Introduction to Statistics
• Basic Terminology
• An Introduction to R – Vector
– Data Frame
– Import and Export Data

Basic Terminology
• Data are the facts or measurements that are collected, analyzed, presented, and interpreted
• All the data collected in a particular study are referred to as the data set for the study

Basic Terminology
• An element is a unit of data which is represented as a set of attributes or measurements
• A variable is an attribute of an element that may assume different values
• An observation is the set of values on the variables for a single element

Basic Terminology
• A population consists of all the elements of interest in a study that have some quality (or qualities) in common
– All students enrolled in a class
– All potential voters in a presidential election
– Population parameters are the characteristics of interest in a study. They are constants (but usually unknown).
• A sample is a subset of the population, often randomly chosen and preferably representative of the population as a whole
– Randomly select 3 students
– Opinion polls conducted by various institutions such as Gallup and Harris polls.
– Sample statistics are the characteristics of interest derived from a sample rather than a population. They are random variables (the values vary from sample to sample.).

Probability and Statistics
• When all the members of a population are known, we can calculate the probability of getting a particular sample.
– If a single card is drawn from a deck of 52 cards, what is the probability it will be a two?
– If you roll a six-sided die, what is the probability you will roll an even number?
– Interestingly, sometimes, even when we don’t know much about the population, we can still have very good idea about the probability of getting a particular sample statistic (central limit theorem, will learn).
• On the other hand, if we are using sample statistics to draw conclusions about unknown population’s parameters, we are solving a statistical problem.
Population
Probability
Statistics

Example: Hudson Auto Repair
• The manager of Hudson Auto Repair would like to have a better understanding of the cost of parts used in the engine tune-ups performed in her shop. She examines 50 customer invoices for tune-ups. The costs of parts ($) are listed in the following table:

A Statistical Problem

Descriptive Statistics
• Descriptive statistics consists of a wide variety of methods of organizing, summarizing, and reporting data characteristics
– Tabular methods organize and report the data patterns in tables (e.g. cross-tabulations and frequency distributions)
– Graphical methods include pictorial displays and data visualizations techniques (e.g. histograms, bar graphs, and scatterplots)
– Numerical methods measure and summarize, often with a single number such as the mean or standard deviation, the various characteristics such as measures of location, central location, dispersion, and association

Example: Hudson Auto Repair

Example: Hudson Auto Repair

• An Introduction to Statistics
• Basic Terminology
• An Introduction to R – Vector
– Data Frame
– Import and Export Data

Getting Started with R and RStudio
• R is a free and powerful programming language for statistical computing and data visualization.
• Why learning R?
– Risopensource,soit’sfree.
– R is cross-platform compatible, so it can be installed on Windows, macOS and Linux
– R provides a wide variety of statistical techniques and graphical capabilities.
– R provides the possibility to make a reproducible research by embedding script and results in a single file.
– Rhasavastcommunitybothinacademiaandinbusiness
– R is highly extensible, and it has thousands of well-documented extensions (named R packages) for a very broad range of applications in the financial sector, health care,…
– It’seasytocreateRpackagesforsolvingparticularproblems

Install R and RStudio
• Download and Install R: https://mirror-hk.koddos.net/CRAN/
• Download and Install RStudio:
https://rstudio.com/products/rstudio/download/
– Before installing RStudio, you must first have R installed
– The RStudio IDE is split into four panes, and you can customize the content and layout to
suit your preferences.
– You write R code in the built-in editor and execute it in the console. You can also view plots in the plot pane and many other things in the other panes.

Installing R on macOS • If there is a warning message:
Setting LC_CTYPE failed, using “C”
• Try the following
1. Open Terminal
2. Write or paste in:
defaults write org.R-project.R force.LANG en_US.UTF-8
3. Close Terminal 4. Start R

Get Started
• The standard R prompt is a “>” sign.
• R commands:
– plot(rnorm(1000))
– The “<−" and "=" are both assignment operators. Better to use "<−". – Assignavaluetoavariable: – Simplemathcalculations • Display the names of the objects in the environment (memory) • Remove one variable: rm("a") – Remove all variables: rm(list=ls()) Rules for Names in R • Any combination of letters, numbers, underscore, and "." • May not start with numbers, underscore. – R is case-sensitive: x and X are not the same thing • Variable names should be short, but descriptive – Camel caps: MyMathScore <- 95 – Underscore: my_math_score <- 95 – Dot separated: my.math.score <- 95 Shortcut Keys • Run selected/current line and go to the next line: – “Ctrl” + “Enter” (Windows) – “Command” + “Enter” (macOS) – “Alt” + “-” (Windows) – “Option” + “-” (macOS) R Help Functions • If you know the name of the function or object on which you want help: – help(read.csv) – help("read.csv") – ?read.csv – ?"read.csv" • If you do not know the name of the function or object on which you want help: – help.search(" read.csv ") – ??read.csv – RSiteSearch("read.csv") • Do NOT forget our friend: Google – Be careful: REBOL programming language uses the same suffix – Enter this (assume you want to search permutation): filetype:R permutation -rebol • Get your computer connected to internet • Download the package: install.packages("wordcloud") • Load the library: library(wordcloud) • Apply a function in the package wordcloud(x, max.words = 20, scale = c(2, 0.5)) • I recommend everyone to install the package ‘tidyverse’, which is actually a bundle of packages that contain a lot of useful functions and data sets. install.packages("tidyverse") Data Structures • The basic data structures in R: • One can think of vectors as a single column in an Excel spreadsheet that contains elements of the same type (numbers, characters (texts), logical, etc). – DataFrame Data frames are rectangular data set, similar to Excel data tables. A data frame contains rows and columns, each column has a name and the values within each column are of a single type (e.g., a column of numbers, characters, logical, etc). Lists are arbitrary collections of items, with essentially no rules. Lists are used frequently in the outputs of R’s statistical functions. – Matrix • We will primarily work with vectors and data frames. Lists will also be covered in some cases. Getting Data • Before we can do any analysis with R, we need data. We can either create the data ourselves or import it from somewhere. – We will first show how to create data –We will then show how to import the most commonly used data files: • Excel file • Comma Separated Values (CSV) file • An Introduction to Statistics • Basic Terminology • An Introduction to R – Vector – Data Frame – Import and Export Data Creating a Vector • To entering data to a vector, you can use the c() expression, which is referred to as the “concatenate” function and it combines all the elements in the parentheses into a vector. > name <- c('John', 'Eric', 'Michael', 'Vincent', 'William') # entering names into the vector “name” > height <- c(168, 178, 180, 172, 166) # entering heights into the vector “height” > weight <- c(70, 73, 71, 68, 73) # entering weights into the vector “weight” > c(1, 3, 5) # join listed elements into a vector
• More methods
> 1:10 # enter all integers from 1 to 10
> seq(0, 20, by = 2) # enter a sequence of integers from 0 to 20 with an increment of 2.
> rep(1:2, times = 3) # repeat the vector 3 times
> rep(1:2, each = 3) # repeat each element of the vector 3 times.
> rep(1:3, c(2, 3, 5)) # repeat each element of the first vector by the corresponding number of times specified in the second vector.

Creating a Vector
• Logical vectors can be generated by conditions: – E.g.,x<-5>4
– Logicaloperatorsare:<,<=,>,>=,==,!=
– Logicalexpressions:&(and),|(or),!(not)
• To generate data values that are distributed according to probability distributions:
– Use function rnorm() to draw 10000 data values from normal distribution with mean 100, standard deviation of 15
– Use function runif() to draw 20000 data values from uniform distribution running from 75 to 125

Indexing Vector Elements
• In programming, an index is used to refer to a specific element or set of elements in a vector (or other data structure).
• The format is vector1[vector2]
– weight[1]works
– weight[1,2]: error
– weight[c(1,2)] and weight[1:2] work
• Negative subscripts mean that we want to exclude the given elements
– weight[-1]
• A logical vector indicating whether each item should be included. – weight[weight>70]
– which() gives the positions of which the condition occurs.

Indexing Vector Elements
> height[3] # returning the 3rd element of height (a vector)
> height[-3] # returning all but the 3rd element of height
> height[c(2, 3, 4)] # returning all elements between 2nd and 4th elements of height
> height[2:4] # returning all elements between 2nd and 4th elements of height
> height[-(2:4)] # returning all elements except those that are between 2nd and
> name[name == ‘Michael’] # returning all elements which are equal to ‘Michael’
4th elements of height
> height[height < 170] # returning all elements which are less than 170 > name[name %in% c(‘Eric’, ‘Adam’, ‘William’)] # returning all elements whose
values are in the set; %in% is the match function
> names(height) <- name # naming each element of the height vector with the person's name > height[‘William’] # returning the height of William – this method applies to only vectors with named element.

• An Introduction to Statistics
• Basic Terminology
• An Introduction to R – Vector
– Data Frame
– Import and Export Data

Spreadsheet Format

Creating a Data Frame
• If the three vectors mentioned previously already exist, you can combine them into a data frame:
> bio<-data.frame(name,height,weight,stringsAsFactors=FALSE) • If not, enter the following to create a data frame: > bio2 <- data.frame(name = c('John', 'Eric', 'Michael', 'Vincent', 'William'), height = c(168, 178, 180, 172, 166), weight = c(70, 73, 71, 68, 73), stringsAsFactors=FALSE) # to create a data frame > bio2 # to view

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com