编程代写 Lecture 4: Probability Theory and Probabilistic Modeling

Lecture 4: Probability Theory and Probabilistic Modeling
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.

Copyright By PowCoder代写 加微信 powcoder

Last time… Concepts and KNN classification
• data, features, classes
• K Nearest Neighbors algorithm • Application to classification
Today… Probability
• basics / refresher
• distributions and parameterizations • why probability in ML?
Estimating confidence in different possible outcomes

Probability Theory
“The calculus of probability theory provides us with a formal framework for considering multiple possible outcomes and their likelihood. It defines a set of mutually exclusive and exhaustive possibilities, and associates each of them with a probability — a number between 0 and 1, so that the total probability of all possibilities is 1. This framework allows us to consider options that are unlikely, yet not impossible, without reducing our conclusions to content-free lists of every possibility.”
From Probabilistic Graphical Models: Principles and Techniques (2009; Koller and Friedman) http://pgm.stanford.edu/intro.pdf

(Very) Basics of Probability Theory
P(A=a): the probability that random variable A takes value a
0 <= P(A=a) <= 1 the fraction of times the event A=a is true in independent trials P(True) = 1 P(False) = 0 (Very) Basics of Probability Theory P(A=a): the probability that random variable A takes value a 0 <= P(A=a) <= 1 Given a deck of 52 cards the fraction of times the event A=a is true in independent trials P(True) = 1 P(False) = 0 • 13 ranks (ace, 2-10, jack, queen, king) • of each of four suits (clubs, spades = black; hearts, diamonds = red) • A is a random variable denoting the value of a randomly selected card. https://commons.wikimedia.org/wiki/File:English pattern playing cards deck.svg (Very) Basics of Probability Theory P(A=a): the probability that random variable A takes value a 0 <= P(A=a) <= 1 Given a deck of 52 cards the fraction of times the event A=a is true in independent trials P(True) = 1 P(False) = 0 • 13 ranks (ace, 2-10, jack, queen, king) • of each of four suits (clubs, spades = black; hearts, diamonds = red) • A is a random variable denoting the value of a randomly selected card. P(A = queen) = P(A = red) = P(A = heart) = Joint Probability P(A, B): joint probability of the probability of both A and two events A and B B occurring = P(A ∩ B) P(A = ace,B = heart) = P(A = heart,B = red) = Joint Probability P(A, B): joint probability of the probability of both A and two events A and B B occurring = P(A ∩ B) P(A = ace,B = heart) = P(A = heart,B = red) = P(A, B) = P(A) ∗ P(B) iff A and B are independent Conditional Probability P(A|B): conditional probability the probability of A=a given the observation B=b = P(A∩B) P(B) P(A = ace|B = heart) = P(A = heart|B = red) = 1. P(A = x) probability that random variable A takes on value x 2. P(A) probability distribution over random variable A 3. P(x) I’ll often use this as a short-hand for 1. if clear from the context What type of probability? https://imgs.xkcd.com/comics/conditional risk.png Rules of Probability I • Independence: A and B are independent iff P(A ∩ B) = P(A)P(B) • Disjoint events: The probability of two disjoint events, such that A ∩ B = ∅, is P(A or B) = P(A) + P(B) • Product rule: P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A) • Chain rule: P(A1 ∩...∩An)=P(A1)P(A2|A1)P(A3|A2 ∩A1)...P(An|∩n−1 Ai) i=1 Rules of Probability I • Independence: A and B are independent iff P(A ∩ B) = P(A)P(B) • Disjoint events: The probability of two disjoint events, such that e.g., draw an ace or a king: A= A ∩ B = ∅, is P(A or B) = P(A) + P(B) draw an ace; B=draw a king. • Product rule: P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A) • Chain rule: P(A1 ∩...∩An)=P(A1)P(A2|A1)P(A3|A2 ∩A1)...P(An|∩n−1 Ai) i=1 Rules of Probability I • Independence: A and B are independent iff P(A ∩ B) = P(A)P(B) • Disjoint events: The probability of two disjoint events, such that e.g., draw an ace or a king: A= A ∩ B = ∅, is P(A or B) = P(A) + P(B) draw an ace; B=draw a king. • Product rule: P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A) • Chain rule: P(A1 ∩...∩An)=P(A1)P(A2|A1)P(A3|A2 ∩A1)...P(An|∩n−1 Ai) again, we can choose the factorization, e.g., : P(July, 5◦C, sick) = P(July) × P(5◦C|July) × P(sick|5◦C, July) = P(5◦C) × P(sick|5◦C) × P(July|5◦C, sick) makes sense Rules of Probability II Bayes Rule P(A|B) = P(A)P(B|A) (cf., P(A|B)= P(A∩B) ) P(B) Basic rule of probability • Bayes’ Rule allows us to compute P(A|B) given knowledge of the ‘inverse’ probability P(B|A). More philosophically, • Bayes’ Rule allows us to update prior belief with empirical evidence Rules of Probability II Bayes Rule Posterior Probability P(A|B) • the degree of belief having accounted for B. Prior Probability P(A) • the initial degree of belief in A. P(A|B) = P(A)P(B|A) P(B) (cf., P(A|B)= P(A∩B) ) P(B) • the probability of A occurring, given no additional knowledge about A Likelihood P(B|A) • the support B provides for A Normalizing constant (‘Evidence’) P(B) = �A P(B|A)P(A) Rules of Probability II Bayes Rule Estimate the probability of a student being smart given that (s)he achieved H1 score, P(smart|H1) from the following information: P(Smart) = 0.3 P(H1|Smart) = 0.6 P(H1) = 0.2 (What if P(H1) = 0.4?) P(A|B) = P(A)P(B|A) (cf., P(A|B)= P(A∩B) ) prior rate of smart students empirically measured H1|smart emprirically measured Gaussian Distribution Models the probability of an event that follows a typical pattern, but with some random variation. • A continuous probability distribution over real-valued values. • E.g., It has two parameters • The mean μ: what is the most typical value • The standard deviation σ: the spread around the mean (σ2 is the variance) 1 − 1 ( x−μ )2 •P(X=x|μ,σ)=σ√2πe 2 σ https://en.wikipedia.org/wiki/Normal_distribution#/media/File:Normal_Distribution_PDF.svg Binomial Distributions • A binomial distribution results from a series of independent trials with only two outcomes (aka Bernoulli trials) e.g. multiple coin tosses (�H, T , H, H, ..., T �) • The probability P of an event with probability p occurring exactly m out of n times is given by P(m,n,p) = �mn�pm(1−p)n−m P(m,n,p) = n! pm (1−p)n−m m!(n − m)! ���� � �� � m successes n − m failures possible distributions of m successes over n trials What is the probability of getting times 2 heads out of 3 tosses of a fair coin? Binomial Example: Coin Toss Go through solution: What is the probability of getting times 2 heads out of 3 tosses of a fair coin? Binomial Example: Coin Toss [SOLUTION] What is the probability of getting times 2 heads out of 3 tosses of a fair coin? 1. m = 2 successes (heads) when flipping coin n = 3 times; P(X = 2) 2. number of possible outcomes e from 3 coin flips: 2∗2∗2=23 =8 eachwithP(e)= 81 3. Choose2outof3: C(3,2)= 3! =3 2!1! 4. 3 possible outcomes, 18 for each: P(X = 2) = 38 � 1� 3! �1�2 �1�3−2 P 2,3,2 = 2!(3−2)! 2 2 P(m, n, p) = n! pm(1 − p)n−m m!(n−m)! �1��1� =3 4 2 Multinomial Distributions • A multinomial distribution models the probability of counts of different events from a series of independent trials with more than two possible outcomes, e.g., - a fair 6-sided dice is rolled 5 times - what is the probability of observing exactly 3 ’ones’ and 2 ’fives’? - what is the probability of observing 5 ’threes’? • TheprobabilityofeventsX1,X2,...,Xn withprobabilitiesp=p1,p2,...,pn occurringexactlyx1,x2,...,xn times,respectively,isgivenby (�ixi)! x1 x2 xn P(X1 =x1,X2 =x2,...,Xn =xn;p)= � p1 ×p2 ×···×pn x1 !...xn ! i i x1!...xn! � = ( i xi)! pxi Categorical Distributions • The categorical distribution models the probability of events resulting from a single trial with more than two possible outcomes, e.g., - we roll a fair-sided dice once - what is the probability of observing a ’five’? • EventsX1...Xn haveassociatedprobabilitiesp=p1,...,pn. • The probability of a single event is given by P(Xi =1|p)=pi • Equivalently, let’s represent the one observation as a one-hot vector X1 = x1,X2 = x1,...,Xn = xn where exactly one of the xi is 1 and the others are 0. We can write probability of the observation P(X1 =x1,X2 =x2,...,Xn =xn;p)=�px1 ×px2 ×···×pxn 12n Now the relation to the Multinomial distribution is more explicit. Marginalization We want to know the probability of an event A irrespective of the outcome of another event B. We can obtain it, by summing over all possible outcomes B of B. • Take an event B. The set of all possible individual outcomes of B, B is the partition of the outcome space • E.g., B ={head, tail} for a coin flip; B ={king, heart, diamond, spades} for card suits • We can marginalize over the set of outcomes of B as follows P(A) = �P(A,B = b) or equivalently (remember the product rule?) P(A) = � P(A|B = b)P(B = b) b∈B and even for conditional probabilities P(A|C) = � P(A|C, B = b)P(B = b|C) b∈B Marginalization We want to know the probability of success of movies of a specific genre (A = {comedy , thriller , romance}). But we only have data on movie success probabilities in a specific market, na�mely (B = {EU, NA, AUS}). P(A) = P(A,B = b) b∈B A B P(A,B) romance romance romance thriller thriller thriller comedy comedy comedy EU 0.05 NA 0.1 AUS 0.3 EU 0.1 NA 0.2 AUS 0.1 EU 0.1 NA 0.025 AUS 0.025 Marginalization We want to know the probability of success of movies of a specific genre (A = {comedy , thriller , romance}). But we only have data on movie success probabilities in a specific market, na�mely (B = {EU, NA, AUS}). P(A) = P(A,B = b) b∈B A B P(A,B) A P(A) romance romance romance thriller thriller thriller comedy comedy comedy EU 0.05 romance NA 0.1 AUS 0.3 EU 0.1 NA 0.2 AUS 0.1 EU 0.1 NA 0.025 AUS 0.025 thriller comedy Marginalization We want to know the probability of success of movies of a specific genre (A = {comedy , thriller , romance}). But we only have data on movie success probabilities in a specific market, na�mely (B = {EU, NA, AUS}). P(A) = P(A,B = b) b∈B A B P(A,B) A P(A) romance romance romance thriller thriller thriller comedy comedy comedy EU 0.05 NA 0.1 AUS 0.3 EU 0.1 NA 0.2 AUS 0.1 EU 0.1 NA 0.025 AUS 0.025 romance 0.45 thriller comedy Marginalization We want to know the probability of success of movies of a specific genre (A = {comedy , thriller , romance}). But we only have data on movie success probabilities in a specific market, na�mely (B = {EU, NA, AUS}). P(A) = P(A,B = b) b∈B A B P(A,B) A P(A) romance romance romance thriller thriller thriller comedy comedy comedy EU 0.05 NA 0.1 AUS 0.3 EU 0.1 NA 0.2 AUS 0.1 EU 0.1 NA 0.025 AUS 0.025 romance 0.45 thriller comedy Marginalization We want to know the probability of success of movies of a specific genre (A = {comedy , thriller , romance}). But we only have data on movie success probabilities in a specific market, na�mely (B = {EU, NA, AUS}). P(A) = P(A,B = b) b∈B A B P(A,B) A P(A) romance romance romance thriller thriller thriller comedy comedy comedy EU 0.05 NA 0.1 AUS 0.3 EU 0.1 NA 0.2 AUS 0.1 EU 0.1 NA 0.025 AUS 0.025 romance 0.45 thriller comedy Marginalization We want to know the probability of success of movies of a specific genre (A = {comedy , thriller , romance}). But we only have data on movie success probabilities in a specific market, na�mely (B = {EU, NA, AUS}). P(A) = P(A,B = b) b∈B A B P(A,B) A P(A) romance romance romance thriller thriller thriller comedy comedy comedy EU 0.05 NA 0.1 AUS 0.3 EU 0.1 NA 0.2 AUS 0.1 EU 0.1 NA 0.025 AUS 0.025 romance 0.45 thriller comedy Marginalization We want to know the probability of success of movies of a specific genre (A = {comedy , thriller , romance}). But we only have data on movie success probabilities in a specific market, na�mely (B = {EU, NA, AUS}). P(A) = P(A,B = b) b∈B A B P(A,B) A P(A) romance romance romance thriller thriller thriller comedy comedy comedy EU 0.05 NA 0.1 AUS 0.3 EU 0.1 NA 0.2 AUS 0.1 EU 0.1 NA 0.025 AUS 0.025 romance 0.45 thriller 0.4 comedy 0.15 Please go to for a quick quiz on probabilities! https://pollev.com/iml2022 Probability and Machine Learning We probably all agree that probabilities are useful for thinking about card games or coin flips ... but why should we care in machine learning? Consider typical classification problems • document → {spam, no spam} • hand-written digit → {0,1,2,3,4,5,6,7,8,9} • purchase history → recommend {book a, book b, book c, ...} Probability and Machine Learning We probably all agree that probabilities are useful for thinking about card games or coin flips ... but why should we care in machine learning? Consider typical classification problems • document → {spam, no spam} • hand-written digit → {0,1,2,3,4,5,6,7,8,9} • purchase history → recommend {book a, book b, book c, ...} • uncertainty, due to few observations, noisy data, ... • model features as following certain probability distributions • soft predictions (“we are 60% confident that Bob will like given his purchase history”) Probabilistic Models “All models are wrong, but some are useful.” ( , Statistician) Probabilistic Models • allow to reason about random events in a principled way. • allow to formalize hypotheses as different types of probability distributions, and use the laws of probability to derive predictions Example: Spam classification • An email is a random event with two possible outcomes: spam, not spam • The probability of observing a spam email P(spam) = θ, and trivially P(not spam) = 1 − θ. • We might care about a random variable X as the number of spam emails in an inbox of 100 emails. X is distributed according to the binomial distribution, and depends on the parameters θ and N = 100 X ∼ Binomial(θ,N = 100) Learning Probabilistic Models I X is distributed according to the binomial distribution, and depends on the parameters θ and N = 100 X ∼ Binomial(θ,N = 100) • In order to make predictions of X we need to know the parameters θ. How do we learn them? • Typically, θ is unknown, but if we have data available we can estimate θ • One common choice is to pick θ that maximizes the probability of the observed data θˆ= argmaxP(X;θ,N) θ That is the maximum likelihood estimate (MLE) of θ. • Once we have estimated θ we can use it to predict values for unseen data Learning Probabilistic Models II The maximum likelihood principle θˆ= argmaxP(X;θ,N) (1) • Consider a data set consisting of 100 emails, 20 of which are spam. • Following from the binomial distribution L ( θ ) = P ( X ; θ , N ) = � mn � θ x ( 1 − θ ) N − x the likelihood of the data1 is ∝ θ20(1 − θ)100−20 • What do you think would be a good value for θ = p(spam = 1)? Why? • Next lecture, we will see how to derive this value in a principled way 1∝ means ‘proportional to’. �mn� can be ignored because it is independent of θ. Learning Probabilistic Models III Maximum likelihood is only one choice of estimator among many • Consider a data set of one inbox with no spam email. MLE: θ = 1, and hence P(not spam) = θ = 1 and P(spam) = 1 − θ = 0. → “spam emails don’t exist” • We could modify this estimate with our prior belief. E.g., we might believe that about 80 of 100 emails are not spam. We ‘nudge’ θ from θ = 1 towards θ = 0.80 • We can combine our prior belief with the estimate from the data to arrive at a posterior probability distribution over θ: P(θ). P(θ|x) = P(θ)P(x|θ) ∝ P(θ)P(x|θ) P(x) • The maximum a posteriori estimate is then θˆ= argmaxP(θ)P(x|θ) θ (looks familiar)? Probability underlies many modern knowledge technologies • estimate the (conditional, joint) probability of observations • Bayes rule • Expectations and marginalization • Probabilistic models • Maximum likelihood estimation (taster) • Maximum aposteriori estimation (taster) Next Lecture(s): • Optimization • Naive Bayes Classification References Chris Bishop. Pattern Recognition and Machine Learning. Chapters: 1.2 (intro), 1.2.3, 2 (intro), 2.1 (up to 2.1.1), 2.2 (up to 2.2.1) Optional / If time permits: Expectations The expectation of a function (like a probability distribution) is the weighted average of all possible outcomes, weighted by their respective probability. • For functions with discrete outputs E[f(x)] = � f(x)P(x) x∈X • For functions with continuous outputs E[f(x)] = � f(x)P(x)dx Optional / If time permits: Expectations The expectation of a function (like a probability distribution) is the weighted average of all possible outcomes, weighted by their respective probability. • On sunny days Bob watches 1 movie • On rainy days Bob watches 3 movies • Bob lives in Melbourne, it rains on 70% of all days • What is the expected number of movies Bob watches per day? Optional / If time permits: Expectations The expectation of a function (like a probability distribution) is the weighted average of all possible outcomes, weighted by their respective probability. • On sunny days Bob watches 1 movie • On rainy days Bob watches 3 movies • Bob lives in Melbourne, it rains on 70% of all days • What is the expected number of movies Bob watches per day? 1∗0.3+3∗0.7 = 2.4 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com