CS代写 MIE1624H – Introduction to Data Science and Analytics Lecture 3

Lead Research Scientist, Financial Risk Quantitative Research, SS&C Algorithmics Adjunct Professor, University of Toronto
MIE1624H – Introduction to Data Science and Analytics Lecture 3 – Basic Statistics
University of Toronto January 25, 2022

Lecture outline
Basic statistics
▪ Before you analyze your data
▪ Sources of uncertainty
▪ Summarizing and interpreting your data
❑ Quantitative data
❑ Categorical data ▪ Distributions
▪ Law of Large Numbers and Central Limit Theorem

 Before You Analyze Your Data

Where does your data come from?
▪ Do you have access to complete data, or only a sample?
Sample of sales transactions • How was the subset selected?
• Systematically, randomly?
Entire database of sales transactions
Data for a subset of employees • Randomly selected?
• Voluntary response?
HR data about all employees
▪ How the data was collected will drive what kind of conclusions we may be able to draw, and how confident we can be in those conclusions.
Conclusions about all NYC users of the service? Conclusions about all NYC inhabitants?
Complete demographic data of NYC users of web service

Election polling
▪ In many cases margins of error reported by pollsters substantially over-states the precision of poll-based forecasts
❑ Usually reported margin of error is 3% (for a random and representative sample of around 1000 people)
❑ Trump vs. Clinton election, why polls were wrong? ▪ Current polling practice
❑ Low response rates (less than 10%)
❑ Inadequate coverage
❑ Hidden dependence (who tends to answer phone?)
❑ Question design and the order in which questions are asked:
− who would you vote for?
− would you go and vote?
❑ Pollster’s methodology often produces results that lean to one side of politics or the other ❑ Opinion polls tell us a historical fact on the date people were polled
▪ Sampling approach does not randomly select people from the entire population ▪ Segments of the population are excluded

US presidential elections 2016

US presidential elections 2016
Source: BBC poll of polls

US presidential elections 2016

US presidential elections 2020
9 Source: BBC poll of polls https://www.bbc.com/news/election-us-2020-53657174

US presidential elections 2020
10 Source: https://www.bbc.com/news/election-us-2020-54094119
RV – registered voters LV – likely voters

US presidential elections 2020
Source: https://www.bbc.com/news/election-us-2020-54094119

What kind of data are we dealing with?
▪ Types of data
• Quantitative
• Categorical (ordered, unordered)
▪ Data collection
• Independent observations (one observation per subject)
• Dependent observations (repeated observation of the same subject, relationships
within groups, relationships over time or space)
▪ Type of data drives the direction of your analysis
• How to plot
• How to summarize
• How to draw inferences and conclusions
• How to issue predictions

Uncertainty stemming from the data collection process
No uncertainty
Complete data
e.g., census (in theory), database of all business transactions in the past, Big Data (in some cases)
Greater uncertainty
e.g., survey data, sensor data, experiments
Uncertainty due to data from only a sample, in addition to uncertainty in the measurement tool
Sparse data

Sources of uncertainty
Uncertainty in descriptive statistics, predictions and forecasts ▪ Average vs. Individual (Standard Deviation)
▪ Data vs. Reality (Confidence Interval, Margin of Error) ▪ Prediction/Forecast (Prediction Intervals)
Uncertainty from data collection
Uncertainty in model

Quantitative Data

Quantitative data
▪ Examples: temperature, age, income
▪ Quick check: “Does it makes sense to calculate an average?”
▪ Appropriate summary statistics:
– Mean and Median – Standard Deviation – Percentiles
▪ More advanced predictive methods: Regression, Time Series Analysis, …
▪ Plot your data!

Summarizing quantitative data
▪ One-number summaries – Mean
Average, obtained by summing all observations and dividing by the number of obs.
The center value, below and above which you will find 50% of the observations.
▪ Summarizing your data with one number may not tell the whole story:
Median = 19.8
Median = 19.8
Median = 10.5

Flaw of averages
Average depth 3 ft
“Plans based on average assumptions are wrong on average”

Standard deviation
If the data is normally distributed
95 % of observations
Standard Deviation = 4.2
“Most observations fall within ±2 standard deviations of the mean.”
~95% of observations between 11.4 and 28.2

Distributions: Normal distribution

Distributions: Non-Normal distribution

Descriptive statistics – example
▪ Random sample of 5000 customers of a credit card company
Amount spent on primary card last month
Debt to income ratio (x100)
Valid Missing
Std. Deviation Minimum Maximum
5000 0 1683.7340 1690.0670 210.26680 .00 2482.72
5000 0 9.9578 8.8000 6.42317 .00 43.10

Percentiles
▪ Generalizations of the median (50th percentile).
▪ The pth is the data point below which p percent of the observations fall.
▪ Often used to compare a single observation to a general population.
▪ Examples:
– Standardized test scores
If you scored in the 93th percentile, your score was higher than that of 93% of test takers.
– Child growth percentiles
– Stock market/Options trading
“The call/put volume ratio of 2.15 stands in the 82nd annual percentile, pointing to a heightened demand for long calls during the last two weeks.”

Percentiles – example
▪ Percentiles can be another way of describing how spread out data values are.
Example: 5-Number Summary
Minimum – 25th percentile – Median – 50th percentile – Maximum
Amount spent on primary card last month
Debt to income ratio (x100)
Percentiles 50 75
.00 1567.4658 1690.0670 1814.5430 2482.72
.00 5.1250 8.8000 13.5000 43.10

Quantifying uncertainty – confidence intervals
▪ Unless we have complete data, we cannot be sure that the mean in the sample is equal to the true underlying mean (of the theoretically underlying complete data).
One-Sample Test
95% Confidence Interval of the Difference
Debt to income ratio (x100)
Amount spent on primary card last month
9.7797 1677.9044
10.1359 1689.5636
“We are 95% percent confident that the average Debt-to-Income ratio (x100) is between 9.78 and 10.14.”
“The average Debt-to-Income ratio (x100) is 9.96 with a margin of error of .18”
▪ Confidence Intervals (CI) and Margins of Error (MoE) tell us how close we think the mean is to the true value, with a certain level of confidence.
▪ Generally, CIs and MoEs are calculated for 95% percent confidence. Other levels of confidence are labeled explicitly.

Quantifying uncertainty – confidence intervals

Comparing means of two groups
▪ If two groups have different means in our data, can we conclude that the means would be different if we had complete information?
▪ In statistical terms, we want to test if the observed difference is statistically significant.
▪ Once again, we consider the fact that there is uncertainty in our data.
▪ Example:
In our sample of customers, women have higher Debt-to-Income ratio, but spent less on their primary credit card.
Are these differences statistically significant?
Group Statistics
Std. Deviation
Debt to income ratio (x100) Male Female
Amount spent on primary Male card last month Female
9.9852 356.6068 323.3435
6.47251 263.40686 231.93672

Comparing means of two groups ▪ Example: Independent samples t-test
Group Statistics
Std. Deviation
Debt to income ratio (x100) Male Female
Amount spent on primary Male card last month Female
9.9852 356.6068 323.3435
6.47251 263.40686 231.93672
Independent Samples Test
Equal variances not
Amount spent on primary card Equal variances not
last month assumed
Debt to income ratio (x100)
-.308 4.732
t-test for Equality of Means
4994.814 4862.365
Sig. (2-tailed)
Mean Difference
-.05599 33.26335
▪ A statistical test tells us whether an observed difference is statistically significant:
P-value <.05: The difference observed in the data is most likely not due to chance. We conclude the difference is also present in the unobserved population. The difference is statistically significant. P-value >.05: The difference observed could easily be simple due to chance. It is not safe to conclude that the difference is present in the underlying (unobserved) population.

Comparing means of two groups ▪ Example: Independent samples t-test
Independent Samples Test
Equal variances not
Amount spent on primary card Equal variances not
last month assumed
Debt to income ratio (x100)
-.308 4.732
t-test for Equality of Means
4994.814 4862.365
Sig. (2-tailed)
Mean Difference
-.05599 33.26335
▪ In the case of Debt-to-Income ratio, we conclude that there is no significant difference between men and women (P-value = .758 >.05, not significant).
▪ In the case of Amount spent on primary card, we conclude that men tend to charge more on their primary card (P-value <.05, statistically significant). ▪ Note: The larger the sample, the more likely the difference of a given size will be significant. ▪ Caveat: Make sure all your observations are truly independent (repeated observations are cheating!) ▪ For any data scenario, there are different tests, that make their respective mathematical assumptions. When in doubt, consult your favorite statistician. 30 Comparing means of two groups Hypothesis testing Null hypothesis: means of two groups are not different Alternative hypothesis: means of two groups are different Categorical Data Categorical data ▪ Examples: gender, age groups, product category ▪ Summarize using frequencies and percentages in crosstabs ▪ More advanced predictive methods: Logistic Regression, Classification, ... ▪ Example: IOS vs. Android users Operating Android System IOS 200 154 354 219 149 368 % within Operating System Data source: www.forrester.com Operating Android System IOS 14.0% 16.1% 14.8% 30.1% 33.0% 31.3% 32.9% 31.9% 32.5% 14.0% 10.1% 12.4% 8.0% 6.0% 7.2% 1.1% 3.0% 1.9% 100% 100% 100% Margin of error for categorical data ▪ Confidence intervals and Margins of Error can be calculated for categorical data as well ▪ For this survey, the margin of error was 1.32% for 95% confidence. % within Operating ▪ However, this data was based on a online survey, so the results might be biased! Operating Android System IOS 14.0% 16.1% 14.8% 30.1% 33.0% 31.3% 32.9% 31.9% 32.5% 14.0% 10.1% 12.4% 8.0% 6.0% 7.2% 1.1% 3.0% 1.9% 100% 100% 100% Comparative statistics for categorical data ▪ Is the distribution of one categorical variable independent of another categorical variable? ▪ Example: Is the distribution of age groups the same for IOS and Android users? It looks like IOS users tend to be younger than Android users. Is this difference statistically significant? % within Operating System Operating Android System IOS 31.9% 32.5% 10.1% 12.4% 1.1% 3.0% 1.9% 100% 100% 100% Comparative statistics for categorical data ▪ Example: Is the distribution of age groups the same for IOS and Android users? It looks like IOS users tend to be younger than Android users. Is this difference statistically significant? % within Operating System Operating Android System IOS 31.9% 32.5% 10.1% 12.4% 1.1% 3.0% 1.9% 100% 100% 100% Chi-Square Test Asymp. Sig. (2-sided) N of Valid Cases 12.123a 1132  Distributions Distributions Distributions Continuous distributions Distributions Estimate of the probability distribution of global mean temperature resulting from a doubling of CO2 relative to its pre-industrial value, made from 100000 simulations Central Limit Theorem Central Limit Theorem Arithmetic means from a sufficiently large number of random samples from the entire population will be Normally distributed around the population mean (regardless of the distribution in the population) If and for all i (and independent) then: Central Limit Theorem – example On the right are shown the resulting frequency distributions each based on 500 means. For n = 4, 4 scores were sampled from a uniform distribution 500 times and the mean computed each time. The same method was followed with means of 7 scores for n = 7 and 10 scores for n = 10. When n increases: 1. The distributions becomes more and more Normal 2. The spread of the distributions decreases Source: http://davidmlane.com/hyperstat/A14461.html Central Limit Theorem – example (bootstrapping) Central Limit Theorem ▪The sampling distribution of the mean roughly follows a Normal distribution ▪95% of the time, an individual sample mean should lie within 2 (actually 1.96) standard deviations of the mean P(Z>=2.0) = 0.0228 P(Z>=1.96) = 0.025
-1σ μ +1σ +2σ +3σ P(-2<=Z<=+2) = 1 – 2*0.0228 = 0.9544 P(-1.96<=Z<=+1.96) = 1 – 2*0.025 = 0.95 Central Limit Theorem ▪The standard deviation s of the sampling distribution of the mean of x is: Rearranging margin of error Central Limit Theorem – election poll example • Suppose we conduct a poll to try and get the outcome of an upcoming election with two candidates. We poll 1000 people, and 550 of them respond that they will vote for candidate A • How confident can we be that a given person will cast their vote for candidate A? • In this case we are working with a binomial distribution (i.e., a voter can choose Candidate A or B, which is a binomial function) • We have a probability estimator from our sample, where the probability of an individual in our sample voting for candidate A was found to be 550/1000=0.55 • For the binominal distribution • Margin of error = 1.96 * 0.0157 = 0.031 = 3% Election Statistics Revisited US presidential elections 2016 Source: BBC poll of polls Election polling – quantitative issues relevant to the 2016 elections ▪ Errors in sampling and polling ❑ Low response rates (less than 10%) ❑ Inadequate coverage ▪ Uncertainty ▪ Correlated errors Source: http://senseaboutscienceusa.org/biggest-stats-lesson-2016/ Random variables – mean and variance ◼ Random variable ❑ x is a random variable, takes finite number of values, xi for i=1,2,...,n ❑ a probability (associated with each event) represents the relative chance of an occurrence of xi. such that relative frequency ◼ Expected value (mean value or mean) ❑ average value obtained by regarding probabilities as frequencies ◼ Variance - measure of possible deviation from the mean finite number of possibilities Expected value of squared variable how much x tends to vary from its mean Random variables – covariance and correlation ◼ Covariance of two random variables and ◼ Correlation between two random variables and sign defines direction of the relationship positively correlated uncorrelated (maybe independent) negatively correlated perfect +(-) correlation Voter turnout at elections – Canada, USA, Ukraine, Poland, Bulgaria Source: http://euromaidanpress.com/2016/09/21/statistical-method-measures-voting-fraud-of-russias-pro-putin-party/ Voter turnout at elections – Russia election fraud Source: http://euromaidanpress.com/2016/09/21/statistical-method-measures-voting-fraud-of-russias-pro-putin-party/ Voter turnout at elections – Russia election fraud Source: http://euromaidanpress.com/2016/09/21/statistical-method-measures-voting-fraud-of-russias-pro-putin-party/ Number of votes for political parties – Russia election fraud % of votes for each political party Source: https://keleg.livejournal.com/353195.html Communist Party Yabloko Sprav Rossiya Number of polling districts Summary of Lecture 3 Summary – good practices for data analysis ▪ Be aware of where your data comes from and how it was collected ▪ Plot your data ▪ Choose the appropriate summary statistics for your type of data ▪ Statistics generally have uncertainty associated with them – Keep standard deviation and confidence intervals in mind when interpreting results – Perform statistical tests to see if the difference in the data indicate a statistically significant difference ▪ Get familiar with distributions 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts