程序代写 MATH1302

SIMPLE STATISTICAL CONCEPTS
Assumed knowledge for MATH1302

DESCRIPTIVE STATISTICS- GRAPHS: FREQUENCY HISTOGRAM

Copyright By PowCoder代写 加微信 powcoder

o A frequency histogram is a bar graph that represents the frequency distribution of a data set.
1. The horizontal scale is quantitative and measures the data values.
2. The vertical scale measures the frequencies of the classes.
3. Consecutive bars must touch.
Class boundaries are the numbers that separate the classes without forming gaps between them.
The horizontal scale of a histogram can be marked with either the class boundaries or the midpoints.

CLASS BOUNDARIES
o Example:
o Find the class boundaries for the “Ages of Students”
frequency distribution.
Ages of Students
The distance from the upper limit of the first class to the lower limit of the second class is 1.
 Class  Frequency, f
 Class Boundaries
Half this distance is 0.5.
 25.5 − 33.5  33.5 − 41.5
18–25 26–33 34–41 42–49 50–57
 13 8 4 3  2
 17.5−25.5
 41.5 − 49.5  49.5−57.5

FREQUENCY HISTOGRAM
oDraw a frequency histogram for the “Ages of Students” frequency distribution. Use the class
boundaries.
Broken axis
Ages of Students
41.5 Age (in years)

FREQUENCY POLYGON
oA frequency polygon is a line graph that emphasizes the continuous change in frequencies.
Line is extended to the x-axis.
Broken axis
29.5 37.5 Age (in years)
45.5 53.5 61.5
Ages of Students

Relative frequency (portion of students)
RELATIVE FREQUENCY HISTOGRAM
oA relative frequency histogram has the same shape and the same horizontal scale as the corresponding frequency histogram.
0.5 0.4 0.3 0.2
41.5 Age (in years)
Ages of Students

Cumulative frequency (portion of students)
CUMULATIVE FREQUENCY GRAPH
oA cumulative frequency graph or ogive, is a line graph that displays the cumulative frequency of each class at its upper class boundary.
Ages of Students
The graph ends at the upper boundary of the last class.
33.5 41.5 49.5 57.5 Age (in years)

STEM-AND-LEAF PLOT
In a stem-and-leaf plot, each number is separated into a stem (usually the entry’s leftmost digits) and a leaf (usually the rightmost digit). This is an example of exploratory data
The following data represents the ages of 30 students in a statistics class. Display the data in a stem-and-leaf plot.
Ages of Students
18 20 21 27 29 20
19 30 32 19 34 19
24 29 18 37 38 22 30 39 32 44 33 46 54 49185121 21
Continued.

STEM-AND-LEAF PLOT
Ages of Students
Key: 1|8 = 18
2 0011124799 3 002234789
Most of the values lie between 20 and 39.
This graph allows us to see the shape of the data as well as the actual values.

STEM-AND-LEAF PLOT
Construct a stem-and-leaf plot that has two lines for each stem.
Ages of Students
Key: 1|8 = 18
From this graph, we can conclude that more than 50% of the data lie between 20 and 34.
4 6 9 514 5

In a dot plot, each data entry is plotted, using a point, above a horizontal axis.
Use a dot plot to display the ages of the 30 students in the
statistics class.
Ages of Students
18 20 21 27 29 20 19 30 32 19 34 19 24 29 18 37 38 22 30 39 32 44 33 46 54 49 18 51 21 21
Continued.

Ages of Students
15 18 21 24 27 30 33 36 39 42 45 48 51 54 57
From this graph, we can conclude that most of the values lie between 18 and 32.

(Source: US Dept. of Transportation)
4,600 4,200
A pie chart is a circle that is divided into sectors that represent categories. The area of each sector is proportional to the frequency of each category.
Accidental Deaths in the USA in 2002
Motor Vehicle Falls
Poison Drowning Fire
Ingestion of Food/Object Firearms
2,900 1,400
Continued.

To create a pie chart for the data, find the relative frequency (percent) of each category.
Poison Drowning Fire
Ingestion of Food/Object Firearms
2,900 1,400
0.039 0.019
Type Frequency
Relative Frequency
Motor Vehicle 43,500 Falls 12,200
0.578 0.162
n = 75,200
Continued.
4,600 4,200
0.061 0.056

Next, find the central angle. To find the central angle, multiply the relative frequency by 360°.
Type Frequency
Relative Frequency
Motor Vehicle 43,500 Falls 12,200
0.578 0.162 0.085 0.061 0.056 0.039 0.019
208.2° 58.4° 30.6° 22.0° 20.1° 13.9°
Ingestion of Food/Object Firearms
6,400 4,600 4,200 2,900 1,400
Continued.

Drowning 6.1%
Poison 8.5%
Motor vehicles 57.8%
Firearms 3.9% 1.9%
Falls 16.2%

SCATTER PLOT
When each entry in one data set corresponds to an entry in another data set, the sets are called paired data sets.
In a scatter plot, the ordered pairs are graphed as points in a coordinate plane. The scatter plot is used to show the relationship between two quantitative variables.
The following scatter plot represents the relationship between the number of absences from a class during the semester and the final grade.
Continued.

SCATTER PLOT
Final 100 grade 90 (y) 80
Absences Grade x y
70 60 50 40
8 78 2 92 5 90
0 2 4 6 8 10 12 14 16 Absences (x)
From the scatter plot, you can see that as the number of absences increases, the final grade tends to decrease.
12 58 15 43 9 74 6 81

NUMERICAL METHODS FOR DESCRIBING DATA
The chief advantage to using a graphical method to represent the data is its visual representation. Many times, however, we are restricted to reporting the data verbally, thus no use of graphical method.
The greatest disadvantage to a graphical method of describing data is its unsuitability for making inferences, our main goal.

A measure of central tendency is a value that represents a typical, or central, entry of a data set. The three most commonly used measures of central tendency are the mean, the median, and the mode.
 The mean of a data set is the sum of the data entries divided by the number of entries.
Population mean: μ = ∑ x Sample mean: x = ∑ x Nn
“mu” “x-bar”

 The median of a data set is the value that lies in the middle of the data when the data set is ordered. If the data set has an odd number of entries, the median is the middle data entry. If the data set has an even number of entries, the median is the mean of the two middle data entries.
Calculate the median age of the seven employees.
53 32 61 57 39 44
To find the median, sort the data.
32 39 44 53 57 57
The median age of the employees is 53 years.

 The mode of a data set is the data entry or category that occurs with the greatest frequency. If no entry is repeated, the data set has no mode. If two entries occur with the same greatest frequency, each entry is a mode and the data set is called bimodal.
 Example:
 Find the mode of the ages of the seven employees.
53 32 61 57 39 44 57
The mode is 57 because it occurs the most times.
 An outlier is a data entry that is far removed from the other entries in the data set.

COMPARING THE MEAN, MEDIAN AND MODE
A 29-year-old employee joins the company and the ages of the employees are now:
53 32 61 57 39 44 57 29 Recalculate the mean, the median, and the mode. Which measure
of central tendency was affected when this new age was added?
Mean = 46.5 Median = 48.5
The mean takes every value into account, but is affected by the outlier.
The median and mode are not influenced by extreme values.

WEIGHTED MEAN
A weighted mean is the mean of a data set whose entries have varying weights. A weighted mean is given by
x = ∑(x ⋅w) ∑w
where w is the weight of each entry x.
Grades in a statistics class are weighted as follows:
Tests are worth 50% of the grade, homework is worth 30% of the grade and the final is worth 20% of the grade. A student receives a total of 80 points on tests, 100 points on homework, and 85 points on his final. What is his current grade?
Continued.

WEIGHTED MEAN
Begin by organizing the data in a table.
Score, Weight, Xw xw
Tests Homework Final
80 0.50 40 100 0.30 30 85 0.20 17
x = ∑(x ⋅w ) = 87 = 0.87 ∑w 100
The student’s current grade is 87%.

MEAN OF A FREQUENCY DISTRIBUTION
The mean of a frequency distribution for a sample is approximated by
x=∑(x⋅f) Notethatn=∑f n
where x and f are the midpoints and frequencies of the classes.
The following frequency distribution represents the ages of 30 students in a statistics class. Find the mean of the frequency distribution.
Continued.

MEAN OF A FREQUENCY DISTRIBUTION
Class 18–25 26–33 34–41 42–49 50–57
(x · f ) 279.5 236.0 150.0 136.5 107.0
Class midpoint
21.5 13 29.5 8 37.5 4 45.5 3 53.5 2
n=30 Σ=909.0 x = ∑(x ⋅ f ) = 909 = 30.3
The mean age of the students is 30.3 years.

SHAPES OF DISTRIBUTIONS
A frequency distribution is symmetric when a vertical line can be drawn through the middle of a graph of the distribution and the resulting halves are approximately the mirror images.
A frequency distribution is uniform (or rectangular) when all entries, or classes, in the distribution have equal frequencies. A uniform distribution is also symmetric.
A frequency distribution is skewed if the “tail” of the graph elongates more to one side than to the other. A distribution is skewed left (negatively skewed) if its tail extends to the left. A distribution is skewed right (positively skewed) if its tail extends to the right.

SUMMARY OF SHAPES OF DISTRIBUTIONS
Skewed right
Skewed left
Mean > Median > Mode
Mean < Median < Mode Mean = Median The range of a data set is the difference between the maximum and minimum date entries in the set. Range = (Maximum data entry) – (Minimum data entry) The following data are the closing prices for a certain stock on ten successive Fridays. Find the range. Stock 565657 58 61 63 63 676767 The range is 67 – 56 = 11. VARIANCE AND STANDARD DEVIATION The population variance of a population data set of N entries is 2 Population variance = σ 2 = ∑(x − μ) . “sigma N The population standard deviation of a population data set of N entries is the square root of the population variance. Population standard deviation = σ = σ 2 = “sigma” 2 ∑(x − μ) . FINDING THE POPULATION STANDARD DEVIATION Guidelines In Symbols 1. Find the mean of the population data set. 2. Find the deviation of each entry. x−μ (x − μ)2 3. Square each deviation. 4. Add to get the sum of squares. SS x = ∑ (x − μ)2 σ 2 = ∑ (x − μ)2 5. Divide by N to get the population variance. 6. Find the square root of the variance to get the population standard deviation. 2 ∑(x − μ) FINDING THE SAMPLE STANDARD DEVIATION Guidelines In Symbols 1. Find the mean of the sample data set. 2. Find the deviation of each entry. SSx =∑(x−x)2 3. Square each deviation. 4. Add to get the sum of squares. 5. Divide by n – 1 to get the sample variance. s 2 = ∑ (x − x )2 n −1 6. Find the square root of the variance to get the sample standard deviation. ∑(x −x)2 n −1 The three quartiles, Q1, Q2, and Q3, approximately divide an ordered data set into four equal parts. Q1 is the median of the data below Q2. Q3 is the median of the data above Q2. FINDING QUARTILES The quiz scores for 15 students is listed below. Find the first, second and third quartiles of the scores. 28 44 48 51 43 30 55 44 50 33 45 37 37 42 38 Order the data. Lower half Upper half 28 30 33 37 37 38 42 43 44 44 45 48 50 51 55 About one fourth of the students scores 37 or less; about one half score 43 or less; and about three fourths score 48 or less. INTERQUARTILE RANGE The interquartile range (IQR) of a data set is the difference between the third and first quartiles. Interquartile range (IQR) = Q3 – Q1. Example: The quartiles for 15 quiz scores are listed below. Find the interquartile range. Q1 = 37 (IQR) = Q3 – Q1 Q2 = 43 Q3 = 48 = 48 – 37 = 11 The quiz scores in the middle portion of the data set vary by at most 11 points. BOX AND WHISKER PLOT A box-and-whisker plot is an exploratory data analysis tool that highlights the important features of a data set. The five-number summary is used to draw the graph. • The minimum entry • Q2 (median) • The maximum entry Use the data from the 15 quiz scores to draw a box-and- whisker plot. 28 30 33 37 37 38 42 43 44 44 45 48 50 51 55 Continued. BOX AND WHISKER PLOT Five-number summary • The minimum entry • Q2 (median) 28 37 43 48 55 • The maximum entry 37 43 48 55 Quiz Scores 28 32 36 40 44 48 52 56 NONSTATISTICAL HYPOTHESIS TESTING... A criminal trial is an example of hypothesis testing without the statistics. In a trial a jury must decide between two hypotheses. The null hypothesis is H0: The defendant is innocent The alternative hypothesis or research hypothesis is H1: The defendant is guilty The jury does not know which hypothesis is true. They must make a decision on the basis of evidence presented. NONSTATISTICAL HYPOTHESIS TESTING... In the language of statistics convicting the defendant is called rejecting the null hypothesis in favor of the alternative hypothesis. That is, the jury is saying that there is enough evidence to conclude that the defendant is guilty (i.e., there is enough evidence to support the alternative hypothesis). If the jury acquits it is stating that there is not enough evidence to support the alternative hypothesis. Notice that the jury is not saying that the defendant is innocent, only that there is not enough evidence to support the alternative hypothesis. That is why we never say that we accept the null hypothesis, although most people in industry will say “We accept the null hypothesis” NONSTATISTICAL HYPOTHESIS TESTING... There are two possible errors. A Type I error occurs when we reject a true null hypothesis. That is, a Type I error occurs when the jury convicts an innocent person. We would want the probability of this type of error [maybe 0.001 – beyond a reasonable doubt] to be very small for a criminal trial where a conviction results in the death penalty, whereas for a civil trial, where conviction might result in someone having to “pay for damages to a wrecked auto”,we would be willing for the probability to be larger [0.49 – preponderance of the evidence ] P(Type I error) = α [usually 0.05 or 0.01] NONSTATISTICAL HYPOTHESIS TESTING... A Type II error occurs when we don’t reject a false null hypothesis [accept the null hypothesis]. That occurs when a guilty defendant is acquitted. In practice, this type of error is by far the most serious mistake we normally make. For example, if we test the hypothesis that the amount of medication in a heart pill is equal to a value which will cure your heart problem and “accept the hull hypothesis that the amount is ok”. Later on we find out that the average amount is WAY too large and people die from “too much medication” [I wish we had rejected the hypothesis and threw the pills in the trash can], it’s too late because we shipped the pills to the public. NONSTATISTICAL HYPOTHESIS TESTING... The probability of a Type I error is denoted as α (Greek letter alpha). The probability of a type II error is β (Greek letter beta). The two probabilities are inversely related. Decreasing one increases the other, for a fixed sample size. In other words, you can’t have α and β both real small for any old sample size. You may have to take a much larger sample size, or in the court example, you need much more evidence. TYPES OF ERRORS... A Type I error occurs when we reject a true null hypothesis (i.e. Reject H0 when it is TRUE) Reject Reject A Type II error occurs when we don’t reject a false null hypothesis (i.e. Do NOT reject H0 when it is FALSE) NONSTATISTICAL HYPOTHESIS TESTING... The critical concepts are theses: 1. There are two hypotheses, the null and the alternative hypotheses. 2. The procedure begins with the assumption that the null hypothesis is true. 3. The goal is to determine whether there is enough evidence to infer that the alternative hypothesis is true, or the null is not likely to be true. 4. There are two possible decisions: Conclude that there is enough evidence to support the alternative hypothesis. Reject the null. Conclude that there is not enough evidence to support the alternative hypothesis. Fail to reject the null. CONCEPTS OF HYPOTHESIS TESTING (1)... The two hypotheses are called the null hypothesis and the other the alternative or research hypothesis. The usual notation is: H0: — the ‘null’ hypothesis H1: — the ‘alternative’ or ‘research’ hypothesis The null hypothesis (H0) will always state that the parameter equals the value specified in the alternative hypothesis (H1) CONCEPTS OF HYPOTHESIS TESTING... Consider mean demand for computers during assembly lead time. Rather than estimate the mean demand, our operations manager wants to know whether the mean is different from 350 units. In other words, someone is claiming that the mean time is 350 units and we want to check this claim out to see if it appears reasonable. We can rephrase this request into a test of the hypothesis: Thus, our research hypothesis becomes: H1: ≠ 350 Recall that the standard deviation [σ]was assumed to be 75, the sample size [n] was 25, and the sample mean [ ] was calculated to be 370.16 CONCEPTS OF HYPOTHESIS TESTING... For example, if we’re trying to decide whether the mean is not equal to 350, a large value of (say, 600) would provide enough evidence. If is close to 350 (say, 355) we could not say that this provides a great deal of evidence to infer that the population mean is different than 350. CONCEPTS OF HYPOTHESIS TESTING ... The two possible decisions that can be made: Conclude that there is enough evidence to support the alternative hypothesis (also stated as: reject the null hypothesis in favor of the alternative) Conclude that there is not enough evidence to support the alternative hypothesis (also stated as: failing to reject the null hypothesis in favor of the alternative) NOTE: we do not say that we accept the null hypothesis if a statistician is around... CONCEPTS OF HYPOTHESIS TESTING ... The testing procedure begins with the assumption that the null hypothesis is true. Thus, until we have further statistical evidence, we will assume: H0: = 350 (assumed to be TRUE) The next step will be to determine the sampling distribution of the sample mean assuming the true mean is 350. is normal with 350 75/SQRT(25) = 15 IS THE SAMPLE MEAN IN THE GUTS OF THE SAMPLING DISTRIBUTION?? THREE WAYS TO DETERMINE THIS: Unstandardized test statistic: Is in the guts of the sampling distribution? Depends on what you define as the “guts” of the sampling distribution. If we define the guts as the center 95% of the distribution [this means α = 0.05], then the critical values that define the guts will be 1.96 standard deviations of X-Bar on either side of the mean of the sampling distribution [350], or UCV = 350 + 1.96*15 = 350 + 29.4 = 379.4 LCV = 350 – 1.96*15 = 350 – 29.4 = 320.6 UNSTANDARDIZED TEST STATISTIC APPROACH STANDARDIZED TEST STATISTIC APPROACH P-VALUE APPROACH STATISTICAL CONCLUSIONS: Unstandardized Test Statistic: Since LCV (320.6) < (370.16) < UCV (379.4), we 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com