Descriptive statistics for continuous features
Fundamentals of Machine Learning for Predictive Data Analytics,
Appendix A
Central tendency
Central tendency refers to the value that is typical of the sample Arithmetic mean (or sample mean, or mean)
Median and mode
Median: the middle value when you order the values from lowest to highest
Mode: most commonly occurring value
Variation
Range = max – min
• Very sensitive to outliers
Range(sample_Fig1) = 163-140 = 23 Range(sample_Fig3) = 192-102=90
Variance
Variance
(for the sample in Fig 1) (for the sample in Fig 3)
Standard deviation
Standard deviation
sd(sample_Fig1) = 8.08 sd(sample_Fig3) = 31.94
Percentiles
ith percentile:
proportion of 𝑖 of the values in a sample are equal or lower
than the i
th
100
percentile
1st and 3rd quartile
• Lower quartile (or 1st quartile)
the median of the lower half of the data
• Upper quartile (3rd quartile)
the median of the upper half of the data
Descriptive statistics for categorical features
Frequency count and proportion
Mode
Mode – most frequent level
Second mode – 2nd most frequent level
• Example
Mode → “guard”
2nd mode → “forward”