CS计算机代考程序代写 deep learning flex distributed system algorithm COMP3221:

COMP3221:
Distributed Systems

Federated Learning

Dr Nguyen Tran
School of Computer Science

Outline

1. Motivation

2. The First Federated Learning Algorithm: FedAvg

3. Personalization: pFedMe

3

6

Fact

Why Federated Learning ?

raw data predictions

✓ Quickly incorporate new data
✓ Privacy

4

PROS:

Federated Learning: Applications

Language modeling for voice recognition on mobile phones

Adapting to pedestrian behavior in autonomous vehicles

Predicting low blood sugar via wearable devices

Assumptions? ✓ local data is important ✓ labels available ✓ privacy is a concern
5

Beyond the cloud (data center): Federated Learning

Communication is expensive
► millions of devices, slower networks

Performance is highly variable
► systems: device heterogeneity, network unreliability, fault tolerance
► statistical: unbalanced data

Personalization is key
► non-IID data, underlying statistical structure

7

Challenges:

Outline

1. Motivation

2. The First Federated Learning Algorithm: FedAvg

3. Personalization: pFedMe

8

RECALL

Empirical risk minimization

A popular approach in supervised ML
Given a loss and data (x1, y1), … (xn, yn), we estimate a predictor f by
minimizing the empirical risk
We typically restrict this predictor to lie in some class, F

• Could reflect our prior knowledge about the task
• Or may be for computational convenience

To train: solve the ERM objective via an iterative optimization method
9

Mini-batch SGD
Objective we want to solve: min

w

f(w) where f(w) :=

n

!
j=1

fj(w)

Gradient Descent Update: wi+1 = wi ! !i “f(wi)

Stochastic Gradient Descent
(SGD) Update:

wi+1 = wi ! !i “fj(wi)
with j sampled at random

wi+1 = wi ! !i “f!i(wi)

10

RECALL

SGD, GD, and Mini-batch SGD

reduce: m=1w = w + ∑
M Δwm

SGD: one observation

reduce: m=1w = w + ∑
M Δwm

gradient descent: all observations

m=1
reduce: w = w + ∑M Δwm

mini-batch SGD: some observations

10

One-shot averaging

map: find
optimal local

model w⋆m

average: å =
*=

M

m m1
w

M
1

:w

11

Methods for distributed optimization (recall)

less local computation, more
communication

more local computation, less
communication

stochastic gradient
descent

gradient descent divide-and-conquer / one-
shot averaging

mini-batch SGD

local-updating methods

12

RECALL

Local-updating methods

Pros:

• Fully flexible in terms of computation
vs. communication

• Allows re-use of local solvers

Cons:

• Another parameter to tune
• Not well-understood for non-convex

objectives

Question: How can we adapt local updating methods for Federated Learning?

Need to address:
• Non-convex objectives
• Systems heterogeneity
• Statistical heterogeneity

11

Federated Learning (A Local-Updating Method)

• A fast-developing decentralized ML
technique

• One global model, many local models in
a network

• Pros: no need to send data, preserve
privacy

12

Federated Averaging (FedAvg)

At each communication round:

• Server randomly selects a subset of devices
& sends the current global model wt

• Each selected device k updates wt for

Works well in many settings!
epochs of SGD to optimize Fk
the new local model back

& sends (especially non-convex)

13

Federated Averaging (FedAvg)

14

FedAvg vs. Mini-batch SGD
CIFAR-10

X— A /

FedAvg, r?=0.05
FedAvg, 77=0.15
FedAvg, 7/=0.25
FedSGD, r?=0.45
FedSGD, 77=0.6
FedSGD, r?=0.7

500 1000 1500 2000 2500 3000
Communication Rounds

• image recognition benchmark

• simple 4-layer NN
• data partitioned over 100 devices
• > 10x speedup over SGD

15

Outline

1. Motivation

2. Federated optimization: FedAvg

3. Personalization: pFedMe

16

Issue of FedAvg

– Statistical heterogeneity: devices’ local data are non-identically distributed

– Global model is not well-generalized to clients’ data

– Question: How can we leverage the global model to find a “personalized model”
that is stylized each client’s data?

– pFedMe (Personalized Federated Learning with Moreau Envelope):

– Formulate bi-level optimization problem for personalized FL

– Use the Moreau envelope as the local regularized loss (pFedMe)

Personalized Federated Learning with Moreau Envelopes 17

Personalized Federated Learning with Moreau Envelopes

Local Data

Local Data

Local Data

Local Data

Model update Mod
el u

pda
te

Mod
el u

pda
te Model update

Network connection
Local data

Broadcast global model
Update local model
Aggregation phase

Server

Personalized Federated Learning: Applications

18

pFedMe: Problem Formulation

Conventional Federated Learning:

There are N clients communicating with a server.

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

Reformulated loss function:

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

is the personalized model of each client

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

controls the how much global model affects personalized model

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020 19

Problem Formulation

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Main idea:

• Global model is a “central point” where all clients meet
• Personalized models adapt to clients’ heterogeneous data from global model

Bi-level optimization problem:

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

20

Algorithm Design: pFedMe

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Key steps:
• Server broadcasts global model to all clients
• All clients perform local updates
• Server uniformly samples subset of clients for model averaging

Practical aspects:
• Sample the local gradients using mini-batches

• Iteratively solve for personalized model (rather than closed-form)

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm

1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason of using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)162

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.163
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di164

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)165
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)166
with high accuracy. Defining167

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which168
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated169
gradient descent) to obtain ✓̃i(wti,r) such that170

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[49], where d is171

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation172
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show173
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy174
level ⌫.175
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have176

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis177

In this section, we present the convergence of pFedMe. We first prove an intermediate result.178
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have179

(a) Let Assumption 1(a) hold, then we have180

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

181

5

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm

1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason of using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)162

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.163
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di164

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)165
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)166
with high accuracy. Defining167

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which168
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated169
gradient descent) to obtain ✓̃i(wti,r) such that170

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[49], where d is171

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation172
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show173
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy174
level ⌫.175
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have176

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis177

In this section, we present the convergence of pFedMe. We first prove an intermediate result.178
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have179

(a) Let Assumption 1(a) hold, then we have180

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

181

5

21

pFedMe: Algorithm Design

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm

1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason of using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)162

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.163
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di164

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)165
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)166
with high accuracy. Defining167

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which168
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated169
gradient descent) to obtain ✓̃i(wti,r) such that170

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[49], where d is171

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation172
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show173
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy174
level ⌫.175
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have176

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis177

In this section, we present the convergence of pFedMe. We first prove an intermediate result.178
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have179

(a) Let Assumption 1(a) hold, then we have180

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

181

5

Note: when 𝛽 = 1 we have FedAvg’s model averaging

22

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Algorithm: Repeat until convergence

1. Server sends global model to all clients

23

pFedMe: Algorithm Design

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Algorithm: Repeat until convergence

1. Server sends global model to all clients

2. Each client solves for personalized model, , in

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

24

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Algorithm: Repeat until convergence

1. Server sends global model to all clients

2. Each client solves for personalized model, , in

3. Each client updates its local model, , in

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

25

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Algorithm: Repeat until convergence

1. Server sends global model to all clients

2. Each client solves for personalized model, , in

3. Each client updates its local model, , in

4. Server selects a subset of clients, and these clients send
their updated local models to the server for aggregation

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

26

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Algorithm: Repeat until convergence

1. Server sends global model to all clients

2. Each client solves for personalized model, , in

3. Each client updates its local model, , in

4. Server selects a subset of clients, and these clients send
their updated local models to the server for aggregation

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)89

3.1 pFedMe: Problem Formulation90

In conventional FL, there are N clients communicating with a server to solve the following problem:91

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over92
the data distribution of the client i:93

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a94
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from95
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,96
the distributions of ⇠i and ⇠j , i 6= j, are distinct.97

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized98
loss function with l2-norm for each client as follows99

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls100
the strength of w to the personalized model. While large � can benefit clients with unreliable101
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize102
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,103
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different104
directions, but not to stay far away from the “reference point” w, to which every client contributes.105
Based on this, the personalized FL can be formulated as a bi-level problem:106

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer107
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded108
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,109
which facilitates several learning algorithm designs [39, 40]. The optimal personalized model, which110
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the111
literature, is defined as follows:112

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:113

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [31], Per-FedAvg aims to find a global model w which client i can114
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own115
loss function to obtain its personalized model ✓i(w).116

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead of117
using w as the initialization, we parallelly pursue both the personalized and global models by solving118
a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for one-step119
gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which means120
(3) can be solved using any iterative approach with multi-step updates. Second, by re-writing the121
personalized model update of Per-FedAvg as122

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

Algorithm 1 pFedMe: Personalized Federated Learning using Moreau Envelope Algorithm
1: input: T , R, S, �, ⌘, �, w0
2: for t = 0 to T � 1 do . Global communication rounds
3: Server sends wt to all clients
4: for all i = 1 to N do
5: wti,0 = wt
6: for r = 0 toR� 1 do . Local update rounds
7: Sample a fresh mini-batch Di with size |D| and minimize h̃i(✓i;wti,r,Di), defined in

(7), up to an accuracy level according to (8) to find a �-approximate ✓̃i(wti,r)
8: wti,r+1 = w

t
i,r � ⌘�(w

t
i,r � ✓̃i(w

t
i,r))

9: Server uniformly samples a subset of clients St with size S, and each of the sampled client
sends the local model wti,R, 8i 2 S

t, to the server

10: Server updates the global model: wt+1 = (1� �)wt + �
P

i2St
wti,R
S

(c.f. line 8). The reason for using the �-approximate ✓̃i(wti,r) is two-fold. First, obtaining ✓̂i(w
t
i,r)

according to (3) usually needs the gradient rfi(✓i), which, however, requires the distribution of ⇠i.
In practice, we use the following unbiased estimate of rfi(✓i) by sampling a mini-batch of data Di

rf̃i (✓i,Di) :=
1

|Di|

X
⇠i2Di

rf̃i(✓i; ⇠i)

such that E[rf̃i (✓i,Di)] = rfi(✓i). Second, in general, it is not straightforward to obtain ✓̂i(wti,r)
in closed-form. Instead we usually use iterative first-order approach to obtain an approximate ✓̃i(wti,r)
with high accuracy. Defining

h̃i(✓i;w
t
i,r,Di) := f̃i(✓i;Di) +

2
k✓i � w

t
i,rk

2, (7)

suppose we choose � such that h̃i(✓i;wti,r,Di) is strongly convex with a condition number  (which
quantifies how hard to optimize (7)), then we can apply gradient descent (resp. Nesterov’s accelerated
gradient descent) to obtain ✓̃i(wti,r) such that

krh̃i(✓̃i;w
t
i,r,Di)k

2
 ⌫, (8)

with the number of rh̃i computations K := O

 log


d

���
resp. O

�p
 log


d

���
[50], where d is

the diameter of the search space, ⌫ is an accuracy level, and O(·) hides constants. The computation
complexity of each client in pFedMe is K times that in FedAvg. In the following lemma, we show
how � can be adjusted by controlling the (i) sampling noise using mini-batch size |D| and (ii) accuracy
level ⌫.
Lemma 1. Let ✓̃i(wti,r) be a solution to (8), we have

E
h
k✓̃i(w

t
i,r)� ✓̂i(w

t
i,r)k

2
i
 �2 :=

8
< : 2 (�+µ)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(a) holds; 2 (��L)2 ⇣ �2f |D| + ⌫ ⌘ , if Assumption 1(b) holds, and � > L.

4 pFedMe: Convergence Analysis

In this section, we present the convergence of pFedMe. We first prove an intermediate result.
Lemma 2. Recall the definition of the Moreau envelope Fi in pFedMe, we have

(a) Let Assumption 1(a) hold, then we have

1

N

XN
i=1

krFi(w)�rF (w)k
2
 4LF (F (w)� F (w

⇤)) + 2
1

N

XN
i=1

krFi(w
⇤)k2

| {z }
=:�2F,1

.

5

called MOCHA, addressing both systems and statistical heterogeneity. For more details about FL, its
challenges, and personalization approaches, we refer the readers to comprehensive surveys in [38,39].

3 Personalized Federated Learning with Moreau Envelopes (pFedMe)

3.1 pFedMe: Problem Formulation

In conventional FL, there are N clients communicating with a server to solve the following problem:

min
w2Rd

n
f(w) :=

1

N

XN
i=1

fi(w)
o

(1)

to find a global model w. The function fi : Rd ! R, i = 1, . . . , N , denotes the expected loss over
the data distribution of the client i:

fi(w) = E⇠i

f̃i(w; ⇠i)


,

where ⇠i is a random data sample drawn according to the distribution of client i and f̃i(w; ⇠i) is a
loss function corresponding to this sample and w. In FL, since clients’ data possibly come from
different environments, contexts, and applications, clients can have non-i.i.d. data distributions, i.e.,
the distributions of ⇠i and ⇠j , i 6= j, are distinct.

Instead of solving the traditional FL problem (1), we take a different approach by using a regularized
loss function with l2-norm for each client as follows

fi(✓i) +

2
k✓i � wk

2, (2)

where ✓i denotes the personalized model of client i and � is a regularization parameter that controls
the strength of w to the personalized model. While large � can benefit clients with unreliable
data from the abundant data aggregation, small � helps clients with sufficient useful data prioritize
personalization. Note that � 2 (0,1) to avoid extreme cases of � = 0, i.e., no FL, or � = 1, i.e.,
no personalized FL. Overall, the idea is allowing clients to pursue their own models with different
directions, but not to stay far away from the “reference point” w, to which every client contributes.
Based on this, the personalized FL can be formulated as a bi-level problem:

pFedMe : min
w2Rd

n
F (w) :=

1

N

XN
i=1

Fi(w)
o
, where Fi(w) = min

✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
.

In pFedMe, while w is found by exploiting the data aggregation from multiple clients at the outer
level, ✓i is optimized with respect to (w.r.t) client i’s data distribution and is maintained a bounded
distance from w at the inner level. The definition of Fi(w) is the well-known Moreau envelope,
which facilitates several learning algorithm designs [40, 41]. The optimal personalized model, which
is the unique solution to the inner problem of pFedMe and also known as the proximal operator in the
literature, is defined as follows:

✓̂i(w) := proxfi/�(w) = argmin
✓i2Rd

n
fi(✓i) +

2
k✓i � wk

2
o
. (3)

For comparison, we consider Per-FedAvg [8], which arguably has the closest formulation to pFedMe:

min
w2Rd

n
F (w) :=

1

N

XN
i=1

fi

✓i(w)

�o
, where ✓i(w) = w � ↵rfi(w). (4)

Based on the MAML framework [32], Per-FedAvg aims to find a global model w which client i can
use as an initialization to perform one more step of gradient update (with step size ↵) w.r.t its own
loss function to obtain its personalized model ✓i(w).

Compared to Per-FedAvg, our problem has a similar meaning of w as a “meta-model”, but instead
of using w as the initialization, we, in parallel, pursue both the personalized and global models by
solving a bi-level problem, which has several benefits. First, while Per-FedAvg is optimized for
one-step gradient update for its personalized model, pFedMe is agnostic to the inner optimizer, which
means (3) can be solved using any iterative approach with multi-step updates. Second, by re-writing
the personalized model update of Per-FedAvg as

✓i(w) = w � ↵rfi(w) = argmin
✓i2Rd

n
hrfi(w), ✓i � wi+

1

2↵
k✓i � wk

2
o
, (5)

3

27

Experiments

Datasets:
• Synthetic: 10 classes, 60 dimensions
• Real: MNIST
• 75% training, 25% test

Models: convex and non-convex

Federated setting:
• 100 clients for synthetic
• 20 clients for MNIST
• Data distributed by the power law

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Comparisons with Per-FedAvg and FedAvg

28

Experiments

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Figure 1: Performance comparison of pFedMe, FedAvg, and Per-FedAvg in µ-strongly convex and
nonconvex settings using MNIST (⌘ = 0.005, |D| = 20, S = 5, � = 1 for all experiments).

Remark 2. Theorem 2 shows a similar convergence structure to that of Theorem 1, but with a212
sublinear rate for nonconvex case. According to Corollary 2, we are able to obtain the sublinear213
speedup O


1/(TRN)2/3


, while most of existing convergence analysis for nonconvex FL can only214

achieve a sublinear speed-up of O

1/

p
TRN


[3, 6, 48].215

5 Experimental Results and Discussion216

In this section, we validate the performance of pFedMe when the data distributions are heterogeneous217
and non-i.i.d. We compare pFedMe with FedAvg and Per-FedAvg in both µ-strongly convex and218
nonconvex settings. The effect of hyperparameters is presented in the supplementary document.219

5.1 Experimental Settings220

We consider a classification problem using both real (MNIST) and synthetic datasets. MNIST [50]221
is a handwritten digit dataset containing 10 labels and 70,000 instances. Due to the limitation on222
MNIST’s data size, we distribute the complete dataset to N = 20 clients. To model a heterogeneous223
setting in terms of local data sizes and classes, each client is allocated a different local data size224
in the range of [1165, 3834], and only has 2 of the 10 labels. For synthetic data, we adopt the data225
generation and distribution procedure from [15], using two parameters ↵̄ and �̄ to control how much226
the local model and the dataset of each client differ, respectively. Specifically, the dataset serves227
a 10-class classifier using 60-dimensional real-valued data. We generate a synthetic dataset with228
↵̄ = 0.5 and �̄ = 0.5. Each client’s data size is in the range of [250, 25810]. Finally, we distribute229
the data to N = 100 clients according to the power law in [15].230

We fix the subset of clients S = 5 for MNIST, and S = 10 for Synthetic. We compare the algorithms231
using both cases of the same and fine-tuned learning rates, batch sizes, and number of local and global232
iterations. For µ-strongly convex setting, we consider a l2-regularized multinomial logistic regression233
model (MLR) with the softmax activation and cross-entropy loss functions. For nonconvex case, a234
two-layer deep neural network (DNN) is implemented with hidden layer of size 100 for MNIST and235
20 for Synthetic using ReLU activation and a softmax layer at the end. We use gradient descent to236
obtain �-approximate ✓̃i(wti,r) in pFedMe. All datasets are split randomly with 75% and 25% for237
training and testing, respectively. All experiments were conducted using PyTorch [51] version 1.4.0.238

5.2 Performance Comparison239

In order to highlight the empirical performance of pFedMe, we perform several comparisons between240
pFedMe, FedAvg, and Per-FedAvg. The explanations for choosing the range of hyperparameters are241
provided in the supplementary files. We first use the same parameters for all algorithms as an initial242
comparison. As algorithms behave differently when hyperparameters are changed, we conduct a grid243
search on a wide range of hyperparameters to figure out the combination of fine-tuned parameters244
that achieves the highest test accuracy w.r.t. each algorithm. We use both personalized model (PM)245
and the global model (GM) of pFedMe for comparisons.246

The comparisons for MNIST dataset are shown in Fig. 1 (the same hyperparameters) and Table. 1247
(fine-tuned hyperparameters). Fig. 1 shows that the pFedMe’s personalized models in strongly convex248

7

29

Experiments

Personalized Federated Learning with Moreau Envelopes, NeurIPS 2020

Figure 7: Performance comparison of pFedMe, FedAvg, and Per-FedAvg in µ-strongly convex and
nonconvex settings using Synthetic (⌘ = 0.005, |D| = 20, S = 10, � = 1 for all experiments).

Table 1: Comparison using fine-tuned hyperparameters. We fix |D| = 20, R = 20, K = 5, and
T = 800 for MNIST, and T = 600 for Synthetic, � = 2 for pFedMe (↵̂ and �̂ are learning rates of
Per-FedAvg).

Algorithm Model MNIST Synthetic

� ⌘ (↵̂, �̂) Accuracy (%) � ⌘ (↵̂, �̂) Accuracy (%)

FedAvg MLR 0.02 93.96± 0.02 0.02 77.62± 0.11
Per-FedAvg MLR 0.03, 0.003 94.37± 0.04 0.02, 0.002 81.49± 0.09
pFedMe-GM MLR 15 0.01 94.18± 0.06 20 0.01 78.65± 0.25
pFedMe-PM MLR 15 0.01 95.62 ± 0.04 20 0.01 83.20 ± 0.06
FedAvg DNN 0.02 98.79± 0.03 0.03 83.64± 0.22
Per-FedAvg DNN 0.02, 0.001 98.90± 0.02 0.01, 0.001 85.01± 0.10
pFedMe-GM DNN 30 0.01 99.16± 0.03 30 0.01 84.17± 0.35
pFedMe-PM DNN 30 0.01 99.46 ± 0.01 30 0.01 86.36 ± 0.15

respectively. The corresponding figures for nonconvex setting are 0.9%, 0.9%, and 1.3%. Table. 1
shows that when using fine-tuned hyperparameters, the pFedMe’s personalized model is the best
performer in all settings.

For Synthetic dataset, the comparisons for the utilizing the same parameters and the fine-tuned
parameter are presented in Fig. 7 and Table. 1, respectively. In Fig. 7, even though the global model
of pFedMe is less well-performed than others concerning testing accuracy and training loss, pFedMe’s
personalized model still shows its advantages as achieving the highest testing accuracy and smallest
training loss. Fig. 7 shows that pFedMe’s personalized model is 6.1%, 3.8%, and 5.2% more accurate
than its global model, Per-FedAvg, and FedAvg, respectively. The corresponding figures for the
nonconvex setting are 3.9%, 0.7%, and 3.1%. In addition, with fine-tuned hyperparameters in Table. 1,
the personalized model of pFedMe beats others in all settings while the global model of pFedMe only
performs better than FedAvg.

From the experimental results, when the data among clients are non-i.i.d, both pFedMe and Per-Avg
gain higher testing accuracy than FedAvg as they allow the global model to be personalized for a
specific client. However, by optimizing the personalized model approximately with multiple gradient
updates and avoiding computing the Hessian matrix, the personalized model of pFedMe is more
advantageous than Per-FedAvg in terms of the convergence rate and the computation complexity.

6 Conclusion

In this paper, we propose pFedMe as a personalized FL algorithm that can adapt to the statistical
diversity issue to improve the FL performance. Our approach makes use of the Moreau envelope
function which helps decompose the personalized model optimization from global model learning,
which allows pFedMe to update the global model similarly to FedAvg, yet parallelly optimizes the
personalized model w.r.t each client’s local data distribution. Theoretical results showed that pFedMe

10

30

Additional challenges in Federated Learning

non-convex optimization
deep learning

privacy & security
differential privacy
secure aggregation

personalization
model fine-tuning

reduced communication
compression
lazy aggregation

ML pipeline
on-device feature extraction
on-device inference

. . .

31

Additional Reading

• FedAvg: Communication-Efficient Learning of Deep Networks from Decentralized Data,
McMahan et al, AISTATS 2017

• FEDL: Federated Learning over Wireless Networks: Convergence Analysis and
Resource Allocation Multi-Task Learning, Canh Dinh et al, IEEE/ACM Transactions on
Networking

• pFedMe: Personalized Federated Learning with Moreau Envelopes, Canh Dinh et.al.,
NeuRIPS 2020

• Survey: Federated Advances and Open Problems in Federated Learning. Foundations
and Trends® in Machine Learning, 2021

32