CS代写 Introduction to Statistics and Data Science

Introduction to Statistics and Data Science
 Definition
 Commonality is to improve decision making through the analysis of data!
Statistics

Data Science
Machine Learning
Data Mining
Introduction to Statistics and Data Science
 17~18 centuries: the foundation of probability theory
 19 century : used probability distribution (Laplace, Gauss…)
 1940’s : the first neural network is introduced as a mathematical model  1950’s : classification, pattern recognition problems are solved
 1956~1960 : ‘machine-learning’ and ‘artificial intelligence’ are developed  1960~ : deep learning, vision, natural-language processing
 1980~ : ‘data mining’ is used with big data
 2000~ : visualization

Introduction to Statistics and Data Science
Definition
 Statistics is the branch of science that deals with the collection and analysis of data
 “Statistics = Data Science?” (C.F. ’s , 1997)
-. Availability of large/complex data sets in massive database -. Growing use of computational algorithms and models
-. Statistics can be renamed “data science”
Introduction to Statistics and Data Science
What you have to learn…
Communication
Data ethics & regulation
Computer science & High performance computing
Machine learning
Statistics & Probability
Data visualization
Domain Expertise

Introduction to Statistics and Data Science
 Population and Sample
Images from
 Population: all the subjects of interests -. Infinite population
-. Finite population
 Sample: a part of population
Introduction to Statistics and Data Science
 Data types
 Discrete data
: Countable data, categorical data
Example: gender (male or female), number of defects
 Continuous data
: uncountable data and typically expressed as real numbers
Example: height, weight

Introduction to Statistics and Data Science
 Representative measurements
𝑋 = 𝑛 􏰁 𝑥􏰂
𝑚= 𝑥􏰃 +𝑥􏰃􏰆􏰅 􏰇􏰇
𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛 𝑛𝑢𝑚𝑏𝑒𝑟
: the most frequent number
Introduction to Statistics and Data Science
 Dispersion measurements  Variance
Standarddeviation  Range
 Inter-quantile Range (IQR)
𝑆􏰇 =𝑛−1􏰁(𝑥􏰂 − 𝑥̅)􏰇
𝑠= 𝑆􏰇= 𝑛−1􏰁(𝑥􏰂−𝑥̅)􏰇
𝑅 = 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 − 𝑚𝑖𝑛𝑖𝑚𝑢𝑚
𝐼𝑄𝑅 = 𝑄3 − 𝑄1

Introduction to Statistics and Data Science
count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
np.sum(x, axis)
-. axis = 0: column-wise sum
-. axis = 1: row-wise sum
-. axis = None : total sum of all elements
np.mean(x, axis)
-. axis = 0: column-wise mean
-. axis = 1: row-wise mean
-. axis = None : mean of all elements
Introduction to Statistics and Data Science
count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
np.var(x, axis, ddof)
-. axis = 0: column-wise variance
-. axis = 1: row-wise variance
-. axis = None : total variance of all elements
-. ddof : delta degrees of freedom, the divisor in calculation ddof = 0: n
ddof = 1: n-1
np.std(x, axis, ddof) : standard deviation

Introduction to Statistics and Data Science
count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
np.percentile(x, q, axis)
-. axis = 0: column-wise variance
-. axis = 1: row-wise variance
-. axis = None : total variance of all elements
-. q : a sequence of percentiles between 0 and 100
Introduction to Statistics and Data Science
count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
def fname(x): …
-. def : a function -. x : inputs
-. y : outputs

Introduction to Statistics and Data Science
 Practice
Introduction to Statistics and Data Science
 Practice : computation

Introduction to Statistics and Data Science
 Practice : computation using axis
Introduction to Statistics and Data Science
 Practice

Introduction to Statistics and Data Science
 Practice : 5 measures
Introduction to Statistics and Data Science
 Frequency Table  For discrete data,
 The frequency table is the numeric table summarized by frequencies per class
 Class : distinctive values or factor
 Frequency : how many times the given values or factors are
appeared in the data
 Relative Frequency (RF) :
𝑅𝐹 = 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑛

Introduction to Statistics and Data Science
pd.crosstab(index, columns, colnames, margins, margins_name)
-. index : values to group by in the rows
-. columns: values to group by in the columns
-. colnames: name of the column
-. margins : row / column’s margin
-. margins_name: name of the row or column that will contain the total
Introduction to Statistics and Data Science
 Practice

Introduction to Statistics and Data Science
 Practice : Relative frequency
Introduction to Statistics and Data Science
 Visualization  Bar graph
 The frequency table is the numeric table summarized by frequencies per class

Introduction to Statistics and Data Science
 Practice
Introduction to Statistics and Data Science
 Practice

Introduction to Statistics and Data Science
 Visualization  Pie graph
 Each class is represented as the slice of the circle.
 The bigger slice indicates the bigger relative frequency of the
 Generally, relative frequency is shown as percentage (%)
Introduction to Statistics and Data Science
 Practice

Introduction to Statistics and Data Science
 Visualization  Pareto graph
 Sorted by the frequency in descending order
 Also show cumulative relative frequencies as percentages
Introduction to Statistics and Data Science
 Practice

Introduction to Statistics and Data Science
 Practice
Introduction to Statistics and Data Science
 Practice : Pareto chart

Introduction to Statistics and Data Science
 Frequency table
 For continuous data
 Class interval : A interval of a class, lower limit and upper limit should be shown.
 Class representative value : median value of a class interval
 Frequency : the number of observations in the class interval
 Relative Frequency (RF)
𝑅𝐹 = 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Introduction to Statistics and Data Science
 Frequency table

Introduction to Statistics and Data Science
 Practice
Introduction to Statistics and Data Science
 Visualization  Histogram
 X axis is representative values of bins
 Y axis is frequencies or relative frequencies

Introduction to Statistics and Data Science
 Practice : Histogram
Introduction to Statistics and Data Science
 Visualization
 Box-whisker plot
 Box is made with Q1, Q2, Q3 values
 Whisker is made with the length of (IQR * 1.5)
 Often, mean value is also shown.

Introduction to Statistics and Data Science
 Practice : Histogram
Introduction to Statistics and Data Science
 Visualization  Stem-leaf plot
 Stem is the bigger unit of the numbers.  Leaf is the other parts of the numbers.
 Leaf does not have to be shown in order  Stem can be different by the users

Introduction to Statistics and Data Science
 Practice

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts