https://xkcd.com/1725/
Announcements
Mar week 4 – no lecture, Canberra day!
Copyright By PowCoder代写 加微信 powcoder
Assignment 1, submit pdf + code – gradescope entry will
More about this 1630-1700 today
● You’re encouraged to typeset the solution with Latex, since making your computer ”speak math” (and manage bibliography) is one of the important skills for AI/ML. We liken this to using source control in COMP1100/1110.
→ 2 bonus marks for solutions made with Latex
● Grading expectations for COMP4670 students – max{COMP4670 scheme, COMP8600 scheme}
Linear models for
We covered:
Conditional gaussians
Bayesian regression
Predictive distributions
Info theory 101
dimensionality
regression 2
We’ll get there via 3 stepping stones – conjugate prior, conditional Gaussian, Bayes theorem for Gaussian variables
What about the conditional distribution?
Intuitions for the conditional and marginal of jointly Gaussian variables
Covariance
“Completing the square”
useful view
for manipulating Gaussian distributions
N (x | μ, Σ) )
= const + log (
exp ( quadratic expression of x) ) )
“Completing the square”
From the quadratic term:
From the linear term:
Bayes theorem for Gaussian variables
General direction: find the joint of p(x, y), and then use results from conditional+marginal Gaussians.
Bayes theorem for Gaussian variables
we have covered
Prelude: Conjugate prior, conditional of
Gaussian variables, Bayes Theorem of
Gaussian distributions
Bayesian regression
Predictive
distributions
Curse of dimensionality
Info theory 101
estimate the posterior
of w, not just wML
Isotropic Gaussian prior
we have covered
Prelude: Conjugate prior, conditional of
Gaussian variables, Bayes Theorem of
Gaussian distributions
Bayesian regression
Predictive
distributions
Curse of dimensionality
Info theory 101
Predictive distribution
Goal of regression: estimate y(x; w)
at unobserved
values of x
Curse of dimensionality
Fig 1.22 Eq (1.76)
Discussions on fixed basis functions
Pros: assumption of linearity in
the parameters (1) led to a range of useful properties
including closed-form
solutions to the least-squares problem, as well as a tractable
Bayesian treatment. (2) for a suitable choice of basis functions,
we can model
arbitrary nonlinearities in
the mapping from input variables to targets.
Cons: fixed basis functions before training
is observed, the number of
basis functions needs
to grow rapidly, often exponentially, with
the dimensionality D
of the input space. Fixes:
(1) localized basis functions that
scatter only in
containing data –> RBF networks, NN: adaptive basis functions. (2) target variables
have significant dependence on only a small number of possible directions.
NN: choose response
directions.
we have covered
Prelude: Conjugate prior, conditional of
Gaussian variables, Bayes Theorem of
Gaussian distributions
Bayesian regression
Predictive
distributions
Curse of dimensionality
Info theory 101
Information
much information is
theory 101
a random variable x?
the amount of information : ‘degree of surprise’ on learning the value of x., expressed in function h(x)
Assumptions:
● h(x) is monotonic in the probability p(x)
● For unrelated x~p(x), and y~p(y). h(x, y) = h(x) + h(y)
(Information) Entropy [Shannon, 1948]
Consider random variable x, with
How to encode
How to encode
of information
8 equally likely states.
the value of x in binary digits? {000, 001, 010, 011, 100, 101, 110, 111}
Noiseless coding theorem (Shannon 1948) the entropy is a
of a random variable.
entropy? – the
lower bound on the number of bits needed to transmit the state
Reading: connection to entropy in physics
Recommended: “The Bit Player” (2018 documentary)
Entropy for
continuous variables
What is a distribution p(x) that maximises H[x], provided that p(x) has given mean and variance?
Reading: deriving differential entropy
is well defined, Lagrange multipliers!
https://piazza.com/class/kx6v4tv3f7h3u4?cid=77
Convex functions
Jesen’s inequality:
– a convex function
KL divergence
Apply Jensen’s inequality, -ln() is convex
Equality will only hold iff. p(x) = q(x) for all x
If we have “incorrectly” represented p(x) with q(x), how much more
information do we need to recover p(x)?
Not symmetric, not a
distance measure!
But, has interesting algebraic
and geometric properties, see Assignment 1
“The Venn diagram of
How much information is carried in x about y? Two different metrics: conditional entropy and mutual information.
and information”
Recap: Linear
Prelude: Conjugate prior, Gaussian distributions
Bayesian regression
Predictive distributions
for regression
conditional of Gaussian variables, Bayes
discussions
Curse of dimensionality + Info theory 101
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com