Foundations of Data Analytics and Machine Learning
Summer 2022
• AlgorithmsandBigONotation • DataRetrieval
• DataPreparation
Copyright By PowCoder代写 加微信 powcoder
• PlottingandVisualization
• MakingPredictions Ali
10 min quick review!
Important definitions
Ø Task : Flower classification Ø Target (label, Class)
Ø Features
Iris setosa
Iris versicolor
Iris virginica
Important definitions
Ø Task : Flower classification
Ø Target (label, Class) : Setosa, Versicolor, Virginica. Ø Features: Petal len, Petal wid, Sepal len, Sepal wid. Ø Model
Ø Prediction
Ø Data point (sample)
Prediction:
Versicolor
Supervised/Unsupervised Learning
ØMachine Learning systems can be classified according to the amount and type of supervision they get during training.
Ø Supervised
Øk-Nearest Neighbours, Linear Regression, Logistic Regression,
Decision Trees, Neural Networks, and many more Ø Unsupervised
ØK-Means, Principal Component Analysis ØReinforcement Learning
ØReward for a series of action
Ø Solve the maze to get the banana!
Classification vs. Regression
ØClassification: Discrete target
ØSeparate the Dataset ØApples or oranges?
ØDog or Cat?
ØHandwritten digit recognition
ØRegression: Continues Target
ØFit the dataset
ØPrice of a house ØRevenue of a company ØAge of a tree
Feature # 1
Feature # 1
Feature # 2
Overfitting vs Underfitting
High training error Acceptable training error Perfect training error (zero) and High test error (Underfit) and test error and high test error (Overfit)
Validation Set
ØWe still want to track the loss/accuracy on a data set not used for training
ØIdea: set aside a separate data set, called the validation set ØTrack validation accuracy in the training curve
ØMake decisions about model architecture using the validation set
Cross-Validation
ØSplitting training and validation data into several folds during training
ØThis is known as k-fold Cross-Validation
ØModel parameters selected based on average achieved over k folds
Source: scikit-learn
ØToday’s focus is on Data Exploration
—Python Checklist
—Algorithms and Big O Notation
—Data Retrieval —NumPy
—Data Preparation
—Plotting and Visualizing Data
—Matlplotlib —Pandas
Some Review
Basic Python Checklist
Checklist: Python Basics
qData Types
qSingle: int, float, bool qMultiple: str, list, set, tuple, dict qindex [], slice [::], mutability
qConditionals qif, elif, else
qFunctions
qdef, return, recursion, default vals
qfor, while, range qlist comprehension
qOperations
qarithmetic: +,*,-,/,//,%, **
qboolean: not, and, or qrelational: ==, !=, >, <, >=, <=
qprint, end, sep
qopen, close, with qread, write qCSV
qObject-Oriented Programming (OOP) qclass, methods, attributes
q__init__, __str__, polymorphism
Python Review
ØCoursera - University of Toronto MOOCs ØLearn to Program: The Fundamentals
(https://www.coursera.org/learn/learn-to-program)
ØLearn to Program: Crafting Quality Code (https://www.coursera.org/learn/program-code)
ØGoogle is your (BEST) friend? ØAPS1070 Board
Algorithms and Big O Notation
What is an Algorithm?
ØAlgorithm: is a finite sequence of well-defined, computer-implementable instructions, typically to solve a class of problems or perform a computation.
ØExample: Add 1 to 1000 ØSum = 0
ØCount = 0
ØAs long as count < 1000
Ø Sum = Sum + Count Ø Count = Count +1
Ø Computational cost: 2000 adds
More than one way to solve a problem...
ØSmarter way?
1 2 3 4 ... 996 997 998 999 1000
1000/2 * (1000+1)
ØComputation Cost: 1 div + 1 mult + 1 add
ØOne of the main selection criteria for algorithms in machine learning is computational efficiency.
How efficient is your code?
There are many contributors to computational complexity: ØHardware: processor(s), memory, cache, etc.
ØOS, version of Python, libraries, drivers
ØPrograms running in the background
ØImplementation dependent
ØChoice of input ØWhich inputs to test
Is there a better way to compare the computational complexity of two abstract algorithms?
Algorithm Analysis
As the “size” of an algorithm’s input grows: ØTime: How much longer does it run?
ØSpace: How much memory does it use?
ØHow do we answer these questions? ØFor now, we will focus on time only.
Basic Lesson
Evaluating an implementation?
Use empirical timing
Evaluating an algorithm?
Use asymptotic analysis
Assumptions in Asymptotic Alg. Analysis
Basic operations take constant time ØArithmetic
ØAssignment
ØAccess one array index
ØComparing two simple values (is x < 3)
Other operations are summations or products ØConsecutive statements are summed
ØLoops are (cost of loop body) ╳ (number of loops)
Wort-Case Analysis
ØIn general, we are interested in three types of performance ØBest-case / Fastest
ØAverage-case
ØWorst-case / Slowest
ØWhen determining worst-case, we tend to be pessimistic ØIf there is a conditional, count the branch that will run
the slowest
ØThis will give a loose bound on how slow the algorithm may run
Analyzing Code
ØWhat are the run-times for the following code?
1. for i in range(n): x=x+1
2. for i in range(n): for j in range(n):
3. for i in range(n):
for j in range(i+1):
Answers are
≈ 3(1+2+...+n) ≈ 3n(n+1)/2
≈ 1.5n2+1.5n
No Need to be so Exact
ØConstants do not matter
ØConsider 6N2 and 20N2
ØWhen N >> 20, the N2 is what is driving the function’s increase
ØLower-order terms are also less important
ØN*(N+1)/2 vs. just N2/2
ØThe linear term is inconsequential
We need a better notation for performance that focuses on dominant terms only
Big-Oh Notation (Formally)
ØGiven two functions f(n) & g(n) for input n, we say f(n) is in O(g(n)) if there exist positive constants c and n0 such that
f(n) £ cg(n) foralln3n0
ØEventually g(n) is always an upper bound on f(n) ignoring constants
The Gist of Big-Oh
ØConsider only the most significant (dominant) term in the operation count and remove constant multipliers:
§ 5n + 3 → O(n)
§ 7n + 0.5n2 + 2000 → O(n2)
§ 300n + 12 + nlogn → O(n log n) §7n+0.5n2+2000+2n →O(2n)
True or false?
1. 4+3n is O(n)
2. n+2 logn is O(log n)
3. logn+2 is O(1)
4. n50 is O(1.1n)
True (exponential always beats polynomial eventually, but this would be a loose upper bound)
Big Oh: Common Categories
From fastest to slowest
§ O(log n)
§ O(n log n) § O(n2)
constant (or O(k) for constant k) logarithmic
polynomial (where is k is constant) exponential (where constant k > 1)
Complexity Analysis of Code
for i in range(N): sum += i
for i in range(N): for j in range(N):
sum += i + j
for i in range(N): for j in range(N):
for k in range(N): sum += i + j + k;
O(N) – linear time O(N2) – quadratic time
O(N3) – cubic time
Data Exploration
Scientific Computing Tools for Python
Ø Scientific computing in Python builds upon a small core of packages:
Ø NumPy, the fundamental package for numerical computation. It defines the numerical array and matrix types and basic operations on them.
Ø The SciPy library, a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and much more.
Ø Matplotlib, a mature and popular plotting package, that provides publication-quality 2D plotting as well as rudimentary 3D plotting
Ø Data and computation:
Ø pandas, providing high-performance, easy to use data structures.
Ø scikit-learn is a collection of algorithms and tools for machine learning.
Source: https://www.scipy.org/about.html
ØLet’s start with NumPy. Among other things, NumPy contains:
Ø A powerful N-dimensional array object.
Ø Sophisticated (broadcasting/universal) functions.
Ø Tools for integrating C/C++ and Fortran code.
Ø Useful linear algebra, Fourier transform, and random number
capabilities.
ØBesides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data.
ØMany other python libraries are built on NumPy
ØProvides vectorization of mathematical operations on arrays and matrices which significantly improves the performance
ØThe key to NumPy is the ndarray object, an n-dimensional array of homogeneous data types, with many operations being performed in compiled code for performance.
ØThere are several important differences between NumPy arrays and the standard Python sequences:
ØNumPy arrays have a fixed size. Modifying the size means creating a new array.
ØNumPy arrays must be of the same data type, but this can include Python objects.
ØMore efficient mathematical operations than built-in sequence types
ØTo begin, NumPy supports a wider variety of data types than are built-in to the Python language by default. They are defined by the numpy.dtype class and include:
Ø intc (same as a C integer) and intp (used for indexing) Ø int8, int16, int32, int64
Ø uint8, uint16, uint32, uint64
Ø float16, float32, float64
Ø complex64, complex128
Ø bool_, int_, float_, complex_ are shorthand for defaults.
ØThere are a couple of mechanisms for creating arrays in NumPy:
ØConversion from other Python structures (e.g., lists, tuples).
ØBuilt-in NumPy array creation (e.g., arrange, ones, zeros, etc.).
ØReading arrays from disk, either from standard or custom formats (e.g. reading in from a CSV file).
Øand others …
ØThere are a couple of mechanisms for creating arrays in NumPy:
ØConversion from other Python structures (e.g., lists, tuples).
ØBuilt-in NumPy array creation (e.g., arrange, ones, zeros, etc.).
ØReading arrays from disk, either from standard or custom formats (e.g. reading in from a CSV file).
Øand others …
ØIn general, any numerical data that is stored in an array-like container can be converted to an ndarray through use of the array() function. The most obvious examples are sequence types like lists and tuples.
ØCollection of algorithms for linear algebra, differential equations, numerical integration, optimization, statistics and much more
ØPart of Sci ØBuilt on NumPy
ØWith SciPy an interactive Python session becomes a data- processing and system-prototyping environment rivaling systems such as MATLAB, IDL, Octave, R-Lab, and SciLab.
ØSciPy’s functionality is implemented in a number of specific sub-
modules. These include:
Ø Special mathematical functions (scipy.special) — airy, elliptic, bessel, etc.
Ø Integration (scipy.integrate)
Ø Optimization (scipy.optimize)
Ø Interpolation (scipy.interpolate)
Ø Fourier Transforms (scipy.fftpack)
Ø Signal Processing (scipy.signal)
Ø Linear Algebra (scipy.linalg)
Ø Statistics (scipy.stats)
Ø Multidimensional image processing (scipy.ndimage) Ø Data IO (scipy.io)
Ø and more!
ØAdds data structures and tools designed to work with table-like data (similar to Series and Data Frames in R)
ØProvides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation etc.
ØAggregation – computing a summary statistic for groups
Ømin, max, count, sum, prod, mean, median, mode, mad, std, var
ØAllows for handling missing data
Source: http://pandas.pydata.org/
Matplotlib
ØMatplotlib is an incredibly powerful (and beautiful!) 2-D plotting library. It’s easy to use and provides a huge number of examples for tackling unique problems.
ØSimilar to MATLAB
ØAt the center of most matplotlib scripts is pyplot.
ØThe pyplot module is stateful and tracks changes to a figure. All pyplot functions revolve around creating or manipulating the state of a figure.
ØThe plot function can actually take any number of arguments.
ØThe format string argument associated with a pair of sequence objects indicates the color and line type of the plot (e.g. ‘bs’ indicates blue squares and ‘ro’ indicates red circles).
ØGenerally speaking, the x_values and y_values will be numpy arrays and if not, they will be converted to numpy arrays internally.
ØLine properties can be set via keyword arguments to the plot function. Examples include label, linewidth, animated, color, etc…
Jupyter Notebook
ØAll of these libraries come preinstalled on Google Colab ØGoogle Colab uses a Jupyter notebook environment
that runs in the cloud and requires no setup to use
ØRun in Python 3
ØIncludes all the commonly used machine learning (data science) libraries
Øi.e. NumPy, SciPy, Matplotlib, Pandas, PyTorch, Tensorflow, etc.
ØAlternatively, can use Jupyter notebook on your computer
Let’s take a look at the Intro notebook!
Nearest neighbour classifier
• Output is a class (here, blue or yellow)
• Instance-based learning, or lazy learning: computation only happens once called
• Flexible approach – no assumptions on data distribution
What class is this?
K- nearest neighbour classifier
How to compute distance?
Euclidean distance
Standardization is the key
x:(Area, # bedrooms, # Wash )
x:(2500, 5, 5)
x:(2400, 2, 3)
x:(2350, 5, 5)
Euclidean distance
K- nearest neighbour classifier
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
K- nearest neighbour classifier
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
ØThursday: Tutorial 1 Ø Python for Data Science
ØProject 1 Deadline June 4th , 11 PM
ØWeek 3 Lecture – Fundamentals of Learning
ØMachine Learning Taxonomy
ØSupervised Learning ØKNN
ØDecisions Trees
ØUnsupervised Learning ØK-Mean Clustering
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com