MAST90083 Computational Statistics & Data Mining SDG & LR
Tutorial & Practical 2: Synthetic Dataset Generation
(SDG) and Linear Regression (LR)
In this practical, our aim is to learn how to use the least square estimator to solve a linear
regression problem of the type Y = DA + E where Y ∈ RN×V is the observed dataset,
D ∈ RN×L contains the set of regressors, A ∈ RL×V measures the relationship between the
observed responses and the regressors, and E accounts for the model error. Also, N is the
number of time points, L is the number of regressors, and V is the number of observed
variables. For this purpose, we are going to firstly construct a synthetic dataset X (here
we have not used the letter Y to differentiate between the synthetic and observed dataset)
using temporal sources called time courses TC and spatial sources called spatial maps SM,
and secondly perform the least square regression on X using regressors from D to retrieve
response signal strength A, and thirdly use A to estimate D. This can be accomplished by
performing the following steps in a serial manner:
Question 1:
According to the plot in Figure 1
1. Construct a matrix TC of size 240× 3 consisting of three temporal sources with onsets
i) 0 30 60 90 120 150 180 210, ii) 20 65 110 155 200, iii) 0 60 120 180
2. Visually can you tell if temporal dependence among the three TCs significant?
3. Standardize each TC by subtracting its mean and dividing it by its standard deviation.
This will make TCs bias free (centered around the origin) and equally important (have
unit variance)
4. Plot these TCs as they have been plotted in Figure 1
You can clearly see that TCs in Figure 1 are already standardized. As a hint for this step
you can use these functions in R: seq, numeric, mean, std, plot.
Question 2:
According to the plot in Figure 2
1. Construct an array tmpSM of size 3× (21× 21) consisting of ones and zeros. Ones at
these pixels i) 14 to 18, ii) 03 to 07, and iii) 08 to 13 along both dimensions of the slice
2. Is spatial dependence among the three SMs insignificant?
3. Why I do not want to do standardization in this step unlike the time sources case?
4. Plot these SMs as they have been plotted in Figure 2
1
MAST90083 Computational Statistics & Data Mining SDG & LR
Figure 1: Time Courses
Figure 2: Spatial Maps
5. Reshape the array tmpSM into a two dimensional matrix and call it (SM)
As a hint for this step you can use these functions in R: array (to manage three dimensional
matrices), c(tmpSM[,,1]) (for reshaping three dimensional matrices into a vector).
Question 3:
According to the plot in Figure 3
1. Generate zero mean white Gaussian noise for temporal and spatial sources denoted as
Γt ∈ R240×3 and Γs ∈ R3×441. Besides their dimensions, another difference between
spatial and temporal noise source is the noise variance, which is 0.25 for Γt, and 0.05
for Γs
2. Using temporal sources from Question 1 and spatial sources from Question 2 generate
a synthetic dataset X of size 240× 441 as X = (TC+ Γt)× (SM+ Γs)
3. Can these products TC× Γs and Γt × SM exist?
2
MAST90083 Computational Statistics & Data Mining SDG & LR
4. Plot atleast 100 randomly selected time-series from X
As a hint you can use these functions in R: rnorm and std for noise generation, and data.frame
for plotting.
Question 4:
The synthetic dataset X that you have generated in Question 3 follows the linear regression
model X = DA+ E where the unknown A can be estimated using least squares.
1. Since the set of regressorsD are known as you have used them to generateX in Question
3. These were the TCs so D = TC. Estimate A (retrieval of SMs) using least square
solution A = (D⊤D)−1D⊤X
2. After retrieving SMs in A you can now threshold them by replacing their values that
are below the threshold with zeros (this leads to a trivial sparse solution). This step is
necessary as it will help get rid of statistically insignificant pixels. Although threshold
needs to be chosen visually but you can use this as a guide: max(abs(A))/T, where T
=3.
3. Similarly, retrieve TCs in D for a known A using D = XA⊤
4. Plot both A and D side by side as shown in Figure 4
As a hint: use pinv function in pracma library in R for performing least square.
Question 5:
When you look at the estimates of D (estimate of source TCs) and A (estimate of source
SMs) in Figure 4 and compare them to Figure 1 and 2, you can notice how noise has affected
retrieved TCs and SMs (they are certainly different from original TCs and SMs). This was
a very simple case where TCs had no temporal and SMs had no spatial dependence among
them.
1. As a further exercise introduce spatial dependence for SMs as ones at these pixels i) 12
to 18, ii) 03 to 17, and iii) 08 to 14 along both dimensions of the slice. Regenerate X
for the new spatial sources, and then re-estimate D and A. Plot D and A side by side
2. Has the estimates of D deteriorated? If yes then why? How can you solve this issue?
3. What will happen if you standardize the generated dataset X (giving equal importance
to time-series of all pixels, even to those pixel’s time-series that consists of just noise)?
Would least square fail in this case too? How badly? Would there be any false positives
or false negatives? Visually, answer these questions for the same value of T (threshold)
by estimating and then plotting D and A for standardized X and then comparing them
to Question 5.1
3
MAST90083 Computational Statistics & Data Mining SDG & LR
Figure 3: Synthetic dataset
Figure 4: Retrieved Sources
4. Has this Question revealed to you the importance of sparse solution particularly to
avoid false positives?
4