程序代写代做代考 algorithm game C Bayesian graph CMPUT 366 F20: Probabilistic Inference

CMPUT 366 F20: Probabilistic Inference
James Wright & Vadim Bulitko
October 29, 2020
CMPUT 366 F20: Probabilistic Inference
1

Lecture Outline
Variable elimination algorithm
PM 8.4
CMPUT 366 F20: Probabilistic Inference 2

Checking Correctness of Belief Networks
A belief network claims that:
all nodes are independent of their non-descendants given their parents
To check correctness of a belief network identify all sets for nodes for which the network claims independence. Check each claim against your domain knowledge.
All correct except the bottom one
T
AB
T
AB
T
AB
T
AB
CMPUT 366 F20: Probabilistic Inference 3

Answering Queries
Want to answer various queries
Such as P(Tampering | Smoke = true, Report = true)
Could compute a joint distribution
P(Tampering, Fire, Alarm, Smoke, Leaving, Report) =
P(Report | Leaving)P(Leaving | Alarm)P(Smoke | Fire)P(Alarm | Tampering, Fire)P(Tampering)P(Fire) and then use it to answer any query
But we can do better
variable elimination algorithm
uses dynamic programming exploits conditional independence
CMPUT 366 F20: Probabilistic Inference 4

Recap: Conditional Probability Tables
Specify P(Y | X1,…,Xk)
satisfy:∀x,…,x 􏰊􏰈 P(Y=y|X =x,…,X =x)=1􏰋
1ky 11kk
CMPUT 366 F20: Probabilistic Inference 5

Factors
A factor is a function from random variables to scalars:
f(X1 = x1,…,Xn = xn) gives a scalar
P(Y | X1,…,Xk) can be viewed as a factor f(Y,X1,…,Xk)
but has to sum up to 1 over Y: ∀x1,…,xk 􏰈y f(Y = y,X1 = x1,…,Xk = xk) = 1
Operations on factors:
conditioning: lock down some variables, get a new factor
product: a new factor on the union of the variables of two factors
summing out: a new factor on a subset of the variables of the original factor
CMPUT 366 F20: Probabilistic Inference 6

Conditioning Example
CMPUT 366 F20: Probabilistic Inference 7

Product Example
CMPUT 366 F20: Probabilistic Inference 8

Summing Out Example
CMPUT 366 F20: Probabilistic Inference 9

Variable Elimination
Given evidence variables Y1, . . . , Yk and their observed values y1, . . . , yk, we
want to answer the query P(Q | Y1 = y1,…,Yk = yk)
P(Q | Y1 = y1, . . . , Yk = yk) = P(Q,Y1=y1,…,Yk=yk) = P(Q,Y1=y1,…,Yk=yk) P(Y1=y1,…,Yk=yk) 􏰈q P(Q=q,Y1=y1,…,Yk=yk)
View P(Q,Y1 = y1,…,Yk = yk) as a factor to compute
Basic idea of variable elimination algorithm: 1. Conditiononobservationsbyconditioning
2. Constructjointdistributionfactorbymultiplication
3. Removeunwantedvariables(neitherquerynorobserved)bysummingout 4. Normalizeattheend
Doing these steps in order is correct but not efficient Efficiency comes from interleaving the order of operations
CMPUT 366 F20: Probabilistic Inference 10

Variable Elimination
Suppose a belief network has variables X1, . . . , Xn but our query refers to a subset Q,Y1,…,Yk of them:
{Z1, . . . , Zm} = {X1, . . . , Xn} \ {Q, Y1, . . . , Yk} need to eliminate Z1, . . . , Zm
their order is called elimination ordering
Eliminate Zi by summing them out:
P(Q,Y1 =y1,…,Yk =yk)=􏰈Z1,…,Zm P(X1,…,Xn)Y1=y1,…,Yk=yk
Use the belief network structure: P(X1, . . . , Xn) = 􏰉i P(Xi | parents(Xi)) Take advantage of the fact that xy + xz = x(y + z)
Example: multiply factors f1(Q, A, B, C) and f2(C, D, E); sum out A,E naive: f3(Q, A, B, C, D, E) = f1(Q, A, B, C) × f2(C, D, E) then
f4(Q,B,C,D) = 􏰈A,E f3(Q,A,B,C,D,E)
more efficient: f3(C, D) = 􏰈E f2(C, D, E), f3′ (Q, B, C) = 􏰈A f1(Q, A, B, C) then f 4 ( Q , B , C , D ) = f 3′ ( Q , B , C ) × f 3 ( C , D )
CMPUT 366 F20: Probabilistic Inference 11

Variable Elimination Algorithm
CMPUT 366 F20: Probabilistic Inference 12

Example
Query: P(Tampering|Smoke = true, Report = true) Variable ordering: Smoke, Report, Fire, Alarm, Leaving
Belief network factorization: P(Tampering, Fire, Alarm, Smoke, Leaving, Report) = P(Tampering)P(Fire)P(Alarm|Tampering, Fire)P(Smoke|Fire) P(Leaving|Alarm)P(Report|Leaving)
Construct factors for each CPT:
{f0(Tampering), f1(Fire), f2(Tampering, Alarm, Fire), f3(Smoke, Fire), f4(Leaving, Alarm), f5(Report, Leaving)}
Condition on Smoke: f6 = (f3)Smoke=true: {f0(Tampering), f1(Fire), f2(Tampering, Alarm, Fire), f6(Fire), f4(Leaving, Alarm), f5(Report, Leaving)}
Condition on Report: f7 = (f5)Report=true:
{f0(Tampering), f1(Fire), f2(Tampering, Alarm, Fire), f6(Fire), f4(Leaving, Alarm), f7(Leaving)}
CMPUT 366 F20: Probabilistic Inference 13

Example: Continued
Query: P(Tampering|Smoke = true, Report = true) Variable ordering: Smoke, Report, Fire, Alarm, Leaving
List of factors so far:
{f0(Tampering), f1(Fire), f2(Tampering, Alarm, Fire), f6(Fire), f4(Leaving, Alarm), f7(Leaving)}
Sum out Fire from product of f1, f2, f6: f8 = 􏰈Fire(f1 × f2 × f6) {f0(Tampering), f8(Tampering, Alarm), f4(Leaving, Alarm), f7(Leaving)}
Sum out Alarm from product of f8, f4: f9 = 􏰈Alarm(f8 × f4) {f0(Tampering), f9(Tampering, Leaving), f7(Leaving)}
Sum out Leaving from product of f9, f7: f10 = 􏰈Leaving(f9 × f7) {f0(Tampering), f10(Tampering)}
Take a product of remaining factors: f11 = f0 × f10 {f11(Tampering)}
Normalize by division: query(Tampering) = f11(Tampering)/ 􏰈Tampering f11(Tampering)
CMPUT 366 F20: Probabilistic Inference 14

Optimizing Elimination Order
Variable elimination exploits efficient sums of products on a factored joint distribution
The elimination order of the variables affects the efficiency of the algorithm Finding an optimal elimination ordering is NP-hard
Heuristics (rules of thumb) for good orderings:
Min-factor: At every stage, select the variable that constructs the smallest new factor
Problem-specific heuristics
CMPUT 366 F20: Probabilistic Inference 15

Summary
Variable elimination is an algorithm for answering queries based on a belief network
Operates by using three operations on factors to reduce graph to a single posterior distribution
1. Conditioning 2. Multiplication 3. Summingout
Distributes operations more efficiently than taking full product and then summing out
Optimal order of operations is NP-hard to compute
Additional optimization techniques: heuristic ordering, pruning, precomputation
CMPUT 366 F20: Probabilistic Inference 16

CMPUT 366 F20: Supervised Learning
James Wright & Vadim Bulitko
October 29, 2020
CMPUT 366 F20: Supervised Learning
1

Lecture Outline
Supervised Learning
PM 7.1-7.2
CMPUT 366 F20: Supervised Learning 2

How to Specify Desired Behaviour of an Agent
Desired action for each state → supervised learning Desired goal state(s) → search
Good/bad feedback per action → reinforcement learning Survival/death → evolution
CMPUT 366 F20: Supervised Learning 3

Supervised Learning
A supervised learning problem consists of A set of input features X1,…,Xn
A set of target features Y1,…,Yk
A set of training examples whose Xi and Yj used by a supervised learning algorithm to construct a mapping f from features X ̄ to targets Y ̄
A set of test examples whose Xi are run through f to predict the corresponding Yj
Types:
Classification: Yi are discrete Regression: Yi are real-valued
In teams, come up with examples of tasks where supervised learning is preferred to search (one example) and to RL (another example)
CMPUT 366 F20: Supervised Learning 4

Regression Example
The goal is to predict the value of target Y based on features X Both X and Y are real-valued
Exact values of both targets and features may not be in the training set
Example on the right:
e8 is an interpolation problem: X is within the range of the
training example values
e9 is an extrapolation problem: X is outside the range of the training example values
CMPUT 366 F20: Supervised Learning 5

Data Representation
For real-valued features, we typically record the feature values For discrete features, there are multiple options:
Binary features: can code false, true as 0, 1 or −1, 1
Non-binary features: can record numeric values for each possible value
cardinal values: differences are meaningful
ordinal values: order is meaningful
categorical values: neither differences nor order meaningful
Vector of indicator variables: one per feature value, exactly one is true (known as one-hot encoding)
CMPUT 366 F20: Supervised Learning 6

Classification Example
An agent wants to learn a person’s preference for the length of holidays Holiday can be for 1,2,3,4,5 or 6 days
Two possible representations:
CMPUT 366 F20: Supervised Learning 7

Generalization
What does it mean for supervised learning to perform well?
We want f to make correct predictions on all data find underlying patterns in the training data
encode these patterns in f
hope the patterns remain true for all data
We estimate generalization performance by evaluating f on a test set
The learning algorithm has access to features (X) of the test data but not the targets (Y)
Discuss why measuring performance of f on training data only is a bad idea. If we do so, how can supervised learning game the task?
CMPUT 366 F20: Supervised Learning 8

Generalization Example
Consider the following data set: {(i, e)} where i is a positive integer and e is true when i is even and false when i is odd
the training data is a collection of 10 such pairs where i is drawn randomly between 1 and 1000
a training sample (i, true) is called positive
a training sample (i, false) is called negative
Consider two binary classifiers P and N:
P classifies all positive samples in the training data as true and all others as false N classifies all negative samples in the training data as false and all others as true
Discuss: (i) which classifier performs better on the training data and (ii) which classifier generalizes better, (iii) what would be a better classifier than P or N
CMPUT 366 F20: Supervised Learning 9

Bias
The hypothesis is the function f that we learn
The hypothesis space is the set of possible hypotheses
The preference for one hypothesis over another is called bias bias can be helpful in machine learning
example: preference for simple hypotheses (e.g., f(i) = isEven(i)) which bias works best for generalization is an empirical question
CMPUT 366 F20: Supervised Learning 10

Learning as Search
Given training data, a hypothesis space, an error measurement and a bias, learning can be reduced to search
The learning process searches the hypothesis space trying to find the hypothesis that best fits the training data given the bias
the hypothesis space is too large to search exhaustively local search (PM 4.7) methods are used instead
start with a random hypothesis
attempt to improve it by tweaking it; sometimes make random tweaks pick a new random hypothesis once in a while
CMPUT 366 F20: Supervised Learning 11

Minimizing Cost
The learning algorithm chooses its hypothesis f by 1. itserror(orloss)onthetrainingdata
2. somepreferenceoverthespaceofhypotheses(i.e.,thelearningbias) Together they represent a cost function defined over the hypothesis space
the learning algorithm is attempting to minimize the cost Consider them in turn
CMPUT 366 F20: Supervised Learning 12

0/1 Error
The 0/1 error for a dataset E of examples and hypothesis f is the number of
examples for which the prediction was incorrect:
􏰅1 when f(e) ̸= Y(e) 0 otherwise
e∈E
Works well for binary or categorical data
Does not work well for real-valued data:
usually cannot predict the exact value
0/1 error does not take magnitude of the misprediction into account
􏰆
CMPUT 366 F20: Supervised Learning 13

Absolute Error
The absolute error for a dataset E of examples and hypothesis f is the sum of absolute distances between the prediction and the actual target:
􏰆|f(e)−Y(e)| e∈E
Works well for cardinal and possibly ordinal data takes into account the magnitude of misprediction
Does not make sense for categorical data
CMPUT 366 F20: Supervised Learning 14

Squared Error
The squared error for a dataset E of examples and hypothesis f is the sum of squared distances between the prediction and the actual target:
􏰆(f(e)−Y(e))2 e∈E
Works well for cardinal data
takes into account the magnitude of misprediction
large errors contribute non-linearly more than small errors
Does not make sense for categorical data
CMPUT 366 F20: Supervised Learning 15

Worst-case Error
The worst-case error for a dataset E of examples and hypothesis f is absolute distance between the prediction and the actual target in the worst case:
max |f(e) − Y(e)| e∈E
Works well for cardinal data
takes into account the magnitude of misprediction but only on a single sample (i.e., the worst case)
Does not make sense for categorical data
CMPUT 366 F20: Supervised Learning 16

Probabilistic Predictions
Rather than predicting exactly what a target value will be, many common algorithms predict a probability distribution over possible values
especially for classification tasks
One-hot encoding is the common data representation for this scheme: target features of training examples have a single 1 for the true value target values predicted by f are probabilities that sum to 1
CMPUT 366 F20: Supervised Learning 17

Likelihood
The likelihood for a dataset E of examples and hypothesis f is the probability of independently observing the examples according to the probabilities assigned by the hypothesis:
P(E|f) = 􏰇 f(e = Y(e)) e∈E
This has a clear interpretation in a Bayesian sense
Numerical stability issues: product of probabilities shrinks exponentially floating point underflows almost immediately
CMPUT 366 F20: Supervised Learning 18

Log-Likelihood
The log-likelihood for a dataset E of examples and hypothesis f is the log-probability of independently observing the examples according to the probabilities assigned by the hypothesis:
logP(E|f) = log􏰇f(e = Y(e)) = 􏰆logf(e = Y(e)) e∈E e∈E
The underflow issue is remedied:
sum of logs shrinks more slowly than product of probabilities
Log is monotonic so maximizing log-likelihood imposes the same preference over hypotheses as maximizing likelihood:
P(E|f1)>P(E|f2) ⇐⇒ logP(E|f1)>logP(E|f2)
CMPUT 366 F20: Supervised Learning 19

Baseline Predictors
Constant
f(X) is the same for any X
Special case: majority class in classification tasks
f(X) gives the most common Y found in the training set, regardless of X
good performance on unbalanced classification data poor performance on balanced classification data
CMPUT 366 F20: Supervised Learning 20

Summary
Supervised learning is constructing a hypothesis function from training examples
maps from input features to target features classification: discrete target features regression: real-valued target features
Preferences among hypotheses are called bias an important component of learning
A cost function combines prediction error and bias over hypotheses supervised learning aims to find a hypothesis which minimizes cost different measures for prediction error for different cases
Trivial predictors can be used as baselines
CMPUT 366 F20: Supervised Learning 21