Introduction to Statistics and Data Science
Definition
Commonality is to improve decision making through the analysis of data!
Statistics
Copyright By PowCoder代写 加微信 powcoder
Data Science
Machine Learning
Data Mining
Introduction to Statistics and Data Science
17~18 centuries: the foundation of probability theory
19 century : used probability distribution (Laplace, Gauss…)
1940’s : the first neural network is introduced as a mathematical model 1950’s : classification, pattern recognition problems are solved
1956~1960 : ‘machine-learning’ and ‘artificial intelligence’ are developed 1960~ : deep learning, vision, natural-language processing
1980~ : ‘data mining’ is used with big data
2000~ : visualization
Introduction to Statistics and Data Science
Definition
Statistics is the branch of science that deals with the collection and analysis of data
“Statistics = Data Science?” (C.F. ’s , 1997)
-. Availability of large/complex data sets in massive database -. Growing use of computational algorithms and models
-. Statistics can be renamed “data science”
Introduction to Statistics and Data Science
What you have to learn…
Communication
Data ethics & regulation
Computer science & High performance computing
Machine learning
Statistics & Probability
Data visualization
Domain Expertise
Introduction to Statistics and Data Science
Population and Sample
Images from
Population: all the subjects of interests -. Infinite population
-. Finite population
Sample: a part of population
Introduction to Statistics and Data Science
Data types
Discrete data
: Countable data, categorical data
Example: gender (male or female), number of defects
Continuous data
: uncountable data and typically expressed as real numbers
Example: height, weight
Introduction to Statistics and Data Science
Representative measurements
𝑋 = 𝑛 𝑥
𝑚= 𝑥 +𝑥
𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛 𝑛𝑢𝑚𝑏𝑒𝑟
: the most frequent number
Introduction to Statistics and Data Science
Dispersion measurements Variance
Standarddeviation Range
Inter-quantile Range (IQR)
𝑆 =𝑛−1(𝑥 − 𝑥̅)
𝑠= 𝑆= 𝑛−1(𝑥−𝑥̅)
𝑅 = 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 − 𝑚𝑖𝑛𝑖𝑚𝑢𝑚
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
Introduction to Statistics and Data Science
count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
np.sum(x, axis)
-. axis = 0: column-wise sum
-. axis = 1: row-wise sum
-. axis = None : total sum of all elements
np.mean(x, axis)
-. axis = 0: column-wise mean
-. axis = 1: row-wise mean
-. axis = None : mean of all elements
Introduction to Statistics and Data Science
count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
np.var(x, axis, ddof)
-. axis = 0: column-wise variance
-. axis = 1: row-wise variance
-. axis = None : total variance of all elements
-. ddof : delta degrees of freedom, the divisor in calculation ddof = 0: n
ddof = 1: n-1
np.std(x, axis, ddof) : standard deviation
Introduction to Statistics and Data Science
count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
np.percentile(x, q, axis)
-. axis = 0: column-wise variance
-. axis = 1: row-wise variance
-. axis = None : total variance of all elements
-. q : a sequence of percentiles between 0 and 100
Introduction to Statistics and Data Science
count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
def fname(x): …
-. def : a function -. x : inputs
-. y : outputs
Introduction to Statistics and Data Science
Practice
Introduction to Statistics and Data Science
Practice : computation
Introduction to Statistics and Data Science
Practice : computation using axis
Introduction to Statistics and Data Science
Practice
Introduction to Statistics and Data Science
Practice : 5 measures
Introduction to Statistics and Data Science
Frequency Table For discrete data,
The frequency table is the numeric table summarized by frequencies per class
Class : distinctive values or factor
Frequency : how many times the given values or factors are
appeared in the data
Relative Frequency (RF) :
𝑅𝐹 = 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑛
Introduction to Statistics and Data Science
pd.crosstab(index, columns, colnames, margins, margins_name)
-. index : values to group by in the rows
-. columns: values to group by in the columns
-. colnames: name of the column
-. margins : row / column’s margin
-. margins_name: name of the row or column that will contain the total
Introduction to Statistics and Data Science
Practice
Introduction to Statistics and Data Science
Practice : Relative frequency
Introduction to Statistics and Data Science
Visualization Bar graph
The frequency table is the numeric table summarized by frequencies per class
Introduction to Statistics and Data Science
Practice
Introduction to Statistics and Data Science
Practice
Introduction to Statistics and Data Science
Visualization Pie graph
Each class is represented as the slice of the circle.
The bigger slice indicates the bigger relative frequency of the
Generally, relative frequency is shown as percentage (%)
Introduction to Statistics and Data Science
Practice
Introduction to Statistics and Data Science
Visualization Pareto graph
Sorted by the frequency in descending order
Also show cumulative relative frequencies as percentages
Introduction to Statistics and Data Science
Practice
Introduction to Statistics and Data Science
Practice
Introduction to Statistics and Data Science
Practice : Pareto chart
Introduction to Statistics and Data Science
Frequency table
For continuous data
Class interval : A interval of a class, lower limit and upper limit should be shown.
Class representative value : median value of a class interval
Frequency : the number of observations in the class interval
Relative Frequency (RF)
𝑅𝐹 = 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Introduction to Statistics and Data Science
Frequency table
Introduction to Statistics and Data Science
Practice
Introduction to Statistics and Data Science
Visualization Histogram
X axis is representative values of bins
Y axis is frequencies or relative frequencies
Introduction to Statistics and Data Science
Practice : Histogram
Introduction to Statistics and Data Science
Visualization
Box-whisker plot
Box is made with Q1, Q2, Q3 values
Whisker is made with the length of (IQR * 1.5)
Often, mean value is also shown.
Introduction to Statistics and Data Science
Practice : Histogram
Introduction to Statistics and Data Science
Visualization Stem-leaf plot
Stem is the bigger unit of the numbers. Leaf is the other parts of the numbers.
Leaf does not have to be shown in order Stem can be different by the users
Introduction to Statistics and Data Science
Practice
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com