CS代写 CIFAR-10 (10 classes)

https://xkcd.com/410/

announcements
Quiz 1 – open until Fri

Assignment 1

(a bite-sized intro to) Generalisation
High level
questions for today:
works? Why over-parameterisation works?
The notion of generalization gap
Overparameterization: empirical phenomena
Prelude: three inequalities
Theories of generalization
Algorithmic stability
Model complexity and uniform convergence
Generalization from algorithms
The MLStory book
https://mlstory.org/generalization.html

Professor, UC Berkeley

Notations – Loss function and risks
Stretched notation, loss on one data point (x,y)
E – over true data distribution (X, Y)
– we don’t have access to this
A sample as ordered tuples
Empirical risk for this sample

minimisation (ER
seeks to find a predictor
f* in a in a specified class Ƒ that minimizes the empirical risk
min. training
error, training loss
loss on seen
(and seen)
We expect this to
be worse (larger loss/risk)

Generalisation gap
Aka generalisation
excess risk
“If we manage to make the empirical risk small through optimization (most of this class and other ML classes), then all that remains to worry about is generalization gap.”

from ML practice: overparam
eterization
Model size / complexity (informally):
number of trainable
parameters, for a
model family.
Traditional view of generalization
old theory: over-parameterisation is
M – number of parameters; N – number of points

from ML practice: double descent
● Complex models also can simultaneously achieve close to zero training loss and still generalize well
● Risk continues to decreases as model complexity grows down to (nearly) zero training loss=
● Empirical relationship between overparameterization and risk appears to be robust and manifests in
numerous model classes, including overparameterized linear models, ensemble methods, and neural
● Increasing model complexity in the overparameterized regime continues to decrease risk indefinitely,
albeit at decreasing marginal returns,
. Belkin, D. Hsu, S. Ma, S. Mandal, Reconciling modern machine-learning practice and the
and training data are interpolated exactly
some convergence point.
Loog, M., Viering, T.J., Mey, A., Krijthe, J.H., & Tax, D.M. (2020). A brief prehistory of
double descent.
Proceedings of the National Academy of Sciences, 117
, 10625 – 10626.
ar, Y., Muthukumar, V., & Baraniuk, R. (2021). A Farewell to the Bias-Variance
adeoff? An Overview of the Theory of Overparameterized Machine
assical bias–variance trade-off. Proc. Natl. Acad. Sci. U.S.A. 116, 15849–15854 (2019).

Single descent:
work better …
e.g. ResNet [He et al
2016] in computer vision
*Sometimes we see multiple bumps too …

Optimisation versus
Training/optimisation is “easy” i.e.
BUT, overparameterization puts burden
Experiment: Training with random (!) labels on R[f] is known … test accuracy should be 1/10
generalisation
Training error is driven to zero by the optimisation algorithm → overfits
(similar observations hold for many overparameterized architectures in literature)
when # parameters
on generalisation.
CIFAR-10 (10 classes)
> data points
→ proof of convergence
in optimisation may not reveal insights into the nature of generalisation.

Inception illustrated

regularization?
regularisation (regression, large
● Data augmentation (e.g. random cropping and
Experiment on CIFAR-10 (50K training examples) + Inception (1.5M param)
neural nets)
rotation of training images)
→ yes regularizations
help, but is by no means necessary for strong generalisation.

Why learning
Four views presented in this chapter
● Algorithmic stability: generalization arises when models are insensitive to perturbations in the data on which they are trained.
● VC dimension and Rademacher complexity: how small the complexity of models we wish to fit to data.
● Margin bounds: whenever the data is easily separable, good generalization will occur.
● Optimization: how choice of an algorithmic scheme itself can yield models with desired generalization properties
The four different views of generalization can all arrive at similar results – the UPPER bound on 𝚫gen(f) depends on n (decreases as n increase) and the complexity of the ideal predictor (increases as
complexity increase)
Generalization is multifaceted and
predictive systems.
generalization gaps can arise
multiple perspectives are useful when designing data-driven
when we restrict

Recall CLT (Central Limit Theorem)
[ | R[f] – Rs[f] | ] ≥ ε ]
E.g. kn, nk, O(n), log(n)
expect the gap to scale?
For a fixed prediction function f, with infinite amount of data. E [empirical risk] = population risk R[f].
Z is a random variable with bounded variance then, then its sample mean converges in distribution a Gaussian random variable with mean zero and variance on the order of 1/n.
Upper bound on 𝚫gen(f), we want it to be small with
does 𝚫gen(f) shrink w.r.t. number of data points n?
P [ | R[f] –
Rs[f] | ] ≤ ε ] ≥ 1-δ
high probability

(a bite-sized intro to) Generalisation
The notion of generalization
Overparameterization: empirical phenomena
Prelude: three inequalities –
of generalization
● Algorithmic stability
● Model complexity and uniform convergence
● Margin bounds
● Generalization from
that we’ll need later
algorithms

Concentration inequalities
[thanks: UW CS312]
Second form

a weighted coin
A coin is weighted so that its probability of landing
Suppose the coin is flipped 20 times.
Use Markov’s inequality to bound the probability it lands on heads at least 16 times. Also use Chebyshev’s inequality to upper bound the same probability.
on heads is 20%,
independently of other flips.
[thanks: UW CS312]

Explains why sample averages are good estimates of the mean.

Proof sketch
[thanks: UW CS312]

a weighted coin
A coin is weighted so that its probability of landing
other flips. Suppose the coin is flipped 20 times.
Use Markov’s inequality to bound the probability it lands on heads at least 16 times.
Chebyshev’s inequality to upper bound the same probability.
(continued)
on heads is 20%,
independently of
[thanks: UW CS312]

when random variables are bounded, sample averages concentrate around their mean value exponentially quickly.
With probability at least 1-δ,
two-sided:
one-sided:

Example application of a concentration inequality
person’s height (0, 9] (unit: feet,
Sample 30,000 individuals (randomly!) {h1, … h30,000 }
Hoeffding’s inequality
83% probability, sample mean μ’h is within one inch of true mean μh
When random variables have:
= 12 inches
or are tightly bounded, small experiments quickly reveal insights about the
Low variance
population.
Large variances or effectively unbounded, the number of samples required for high precision
estimates might be impractical and our estimators and algorithms and predictions may need to be rethought.

scaling regimes
RS[f] large – generalisation
RS[f] small- generalisation gap decreasing at 1/n
Consider a single prediction
Note: this is not a random
Hoeffding’s inequality →
With probability 1-δ,
gap decreasing at 1/sqrt(n)
f, chosen independently of the sample S
Looser bound

Two scaling regimes
large – generalisation gap
decreasing
at 1/sqrt(n)
small- generalisation gap decreasing at 1/n

Algorithmic
generalization arises when models are insensitive to perturbations in the data on which they are trained.
Specifically: how
sensitive an algorithm is to changes in a
single training example.
Three ways of data perturbation, all yielding similar
generalisation bounds.
● Resample a single data point
● Leave one data point out
● A single data point is arbitrarily corrupted (adversarial scenario)

Some notations
sample of n data points
A labeled example
n hybrid samples
A learning
data samples
space of function f
Z ~ distribution
A is assumed to minimize Rs – optimisation bounds will be
mentioned later

Average stability and its
[proof omitted, see book]
Two interpretations
change w.r.t. one example;
change w.r.t seen vs unseen examples

Uniform stability
Average stability
The two sups here – worst
case scenario
Why is this called uniform? – will hold for every A, (X, Y)
z has nothing to do with S or S’, but is sampled from the same distribution (X, Y)
The worst-case difference in the predictions of the learning algorithm run
on two arbitrary datasets that differ in exactly one point.
Uniform stability upper bounds generalization gap (in expectation)
See book for proof.

Strongly convex functions
Convex functions: lower-bounded
by tangent lines.
Strictly convex functions
Strongly convex functions, μ>0
Nice properties:
Lower-bounded by another quadratic function. There is a unique global minimum
It’s “fast” to find
[thanks: EPFL OptML, eq numbers therein]

Strongly convex and L-Lipschitz loss
Goal: show that strong convexity of the loss function is sufficient for the uniform stability of empirical risk minimization.
Two assumptions needed:
Loss function differentiable and strongly convex
Φ:Rd→R is μ-strongly convex and w* is a stationary point (and hence global minimum)

Stability of empirical risk minimisation
See proof in MLstory
There is no explicit reference to model
class. But what is implied here?
https://math.stackexchange.co m/questions/1106154/any-exa mple-of-strongly-convex-functi
What about regularisation?
L2 regularisation turns convex loss into a μ-strongly convex one
ons-whose-gradients-are-lipsc
hitz-continuou

Model complexity
Assume loss
function bounded in [0, 1],
and uniform
Uniform convergence: bounding the generalization gap from above for all functions in a function class
Model complexity: counting the number of different functions that can be described with the given model parameters.
apply Hoeffding’s
convergence
For data-independent
With probability
prediction function f

Dimension (Vapnik-Chervonenkis)
Uniform convergence:= bounding the generalization gap from above What happens if |F| infinite?
The VC-dimension measures the ability of the model class to conform to an arbitrary labeling of a set of points.
Example: linear models over Rd has a parameters.
dimension of d – same as
for all functions in a function

inequalities
Consider all hyperplanes in Rd with norm at most 𝛾-1, data has The VC dimension of these hyperplanes is D2/𝛾2
bounded norm ||x|| <= D Slower rate here! d does not appear, ‘free’ from curse of dimensionality of risk and bounds Algorithmic stability: generalization arises when models are insensitive to perturbations in the data on which they are trained. VC dimension and Rademacher complexity: how small generalization gaps can arise when we restrict the complexity of models we wish to fit to data. Margin bounds: whenever the data is easily separable, good generalization will occur. Optimization: how choice of an algorithmic scheme itself can yield models with desired generalization properties The notion of generalization Prelude: three inequalities – - (a bite-sized intro to) : empirical phenomena of generalization ● Algorithmic stability ● Model complexity and uniform convergence ● Margin bounds ● Generalization from that we’ll need algorithms 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts