UBC-SBME / BMEG310_2021 Public
Code Issues Pull requests Actions Projects Wiki Security Insights
BMEG310_2021 / Assignment 4 / Assignment_4.Rmd
naila53 Update Assignment_4.Rmd …
2 contributors
main
364 lines (217 sloc) 16.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
—
title: “BMEG310 2021W – Assignment 4”
output: html_document
—
“`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
“`
## Instructions
Please fork this to your repo (or copy it there). Make sure your repo is private, but shared wi
Group:
Group members:
Assignment due date: November 30, 2021, 11:59PM
## Introduction
*Epstein-Barr Virus (EBV) is a herpes virus and the cause of mononucleosis. Like other herpes v
*Today, we’re going to look at gene expression data to see how EBV infection changes the transc
## Load data
*Load needed libraries (install any you don’t yet have – edgeR can be installed through BioCond
“`{r pressure, echo=FALSE}
library(ggplot2)
library(edgeR)
library(reshape)
set.seed(416) # please do not change the seed or set a new seed.
https://github.com/UBC-SBME
https://github.com/UBC-SBME/BMEG310_2021
https://github.com/UBC-SBME/BMEG310_2021
https://github.com/UBC-SBME/BMEG310_2021/issues
https://github.com/UBC-SBME/BMEG310_2021/pulls
https://github.com/UBC-SBME/BMEG310_2021/actions
https://github.com/UBC-SBME/BMEG310_2021/projects
https://github.com/UBC-SBME/BMEG310_2021/wiki
https://github.com/UBC-SBME/BMEG310_2021/security
https://github.com/UBC-SBME/BMEG310_2021/pulse
https://github.com/UBC-SBME/BMEG310_2021
https://github.com/UBC-SBME/BMEG310_2021/tree/main/Assignment%204
https://github.com/naila53
https://github.com/naila53
https://github.com/UBC-SBME/BMEG310_2021/commit/362ee85b8da04b763e946197297a1f1391884c3f
https://github.com/UBC-SBME/BMEG310_2021/commits/main/Assignment%204/Assignment_4.Rmd
https://github.com/UBC-SBME/BMEG310_2021/commits/main/Assignment%204/Assignment_4.Rmd?author=Carldeboer
https://github.com/UBC-SBME/BMEG310_2021/commits/main/Assignment%204/Assignment_4.Rmd?author=naila53
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
“`
*Load data:*
“`{r}
rnaSeqData = read.table(textConnection(readLines(gzcon(url(
“https://ftp.ncbi.nlm.nih.gov/geo/series/GSE126nnn/GSE126379/suppl/GSE126379_BLCL_processed.t
sep=”\t”, stringsAsFactors = FALSE, header = TRUE)
“`
*Reshape and re-label data to make it more usable.*
“`{r}
#melt the data frame so that there is one row for every gene-sample-quantification.type (see be
rnaSeqDataMelted = melt(rnaSeqData, id.vars=c(“Chromosome”, “Gene.Name”, “Gene.ID”, “Sum.of.Exo
#label every value as a read count, but…
rnaSeqDataMelted$quantification.type = “counts”;
#relabel those that are really RPKM
rnaSeqDataMelted$quantification.type[grepl(rnaSeqDataMelted$variable,pattern = “RPKM”)] = “RPKM
#get the donor (human) ID
rnaSeqDataMelted$donor = gsub(“.*BLCL_([ABCDE])([12]).*”, “\\1”, rnaSeqDataMelted$variable)
#Get the identifier they use to distinguish LCLs from B-cells
rnaSeqDataMelted$celltype = gsub(“.*BLCL_([ABCDE])([12]).*”, “\\2″, rnaSeqDataMelted$variable)
#They didn’t label the cell types clearly, so we’re going to guess their assignment and verify
rnaSeqDataMelted$celltype[rnaSeqDataMelted$celltype==”1″]=”B”
rnaSeqDataMelted$celltype[rnaSeqDataMelted$celltype==”2″]=”LCL”
#Remove this, since we have already parsed out the needed info
rnaSeqDataMelted$variable=NULL;
rnaSeqDataMelted$Gene.Name=NULL; #we only use Gene.ID
rnaSeqDataMelted$sampleID = paste(rnaSeqDataMelted$celltype, rnaSeqDataMelted$donor, sep=”_”)
#divide data into read counts and RPKM
rnaSeqDataMeltedCounts = rnaSeqDataMelted[rnaSeqDataMelted$quantification.type==”counts”,]
rnaSeqDataMeltedRPKM = rnaSeqDataMelted[rnaSeqDataMelted$quantification.type==”RPKM”,]
#relabel “value” as appropriate, and remove the ‘quantification.type’ column
rnaSeqDataMeltedCounts$counts=rnaSeqDataMeltedCounts$value;
rnaSeqDataMeltedCounts$value=NULL;
rnaSeqDataMeltedCounts$quantification.type=NULL;
rnaSeqDataMeltedRPKM$RPKM=rnaSeqDataMeltedRPKM$value;
rnaSeqDataMeltedRPKM$value=NULL;
rnaSeqDataMeltedRPKM$quantification.type=NULL;
rm(‘rnaSeqDataMelted’) # remove
“`
*Here, `rnaSeqDataMeltedRPKM` contains the gene expression “Reads per kilobase of transcript pe
## Questions
### Question 1
*It is always important to perform Quality Control on your data. *
#### Question 1a (2 pts)
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
*For some reason, the authors labeled the different cell types “1” and “2” in this table. We ne
*Enter your answer below: *
“`{r}
“`
*Explain what you see. Are the celltype labels correct?*
#### Question 1b ( 2pts)
*With RNA-seq data, it is important both that samples have sufficient coverage, and that the sa
“`{r}
“`
*Which sample has the least coverage?*
“`{r}
“`
*Which sample has the most?*
“`{r}
“`
*What is the % difference between the max and min (relative to the min)?*
“`{r}
“`
*In cases where samples have vastly different coverage, you can potentially down-sample the hig
#### Question 1c ( 4.5 pts)
*One easy way to compare the samples, is to calculate the pairwise correlation between them. In
*First we need to create a RPKM matrix. You can use the `reshape` package to change the shape o
“`{r}
# Here, we need fun.aggregate because there are duplicate gene names. We’ll just add everything
rpkmMatrix = cast(rnaSeqDataMeltedRPKM, Gene.ID ~ sampleID, value=”RPKM”, fun.aggregate=”sum”)
#now convert this to a matrix
row.names(rpkmMatrix) = rpkmMatrix$Gene.ID
rpkmMatrix$Gene.ID=NULL;
rpkmMatrix= as.matrix(rpkmMatrix)
head(rpkmMatrix)
“`
*Calculate the pairwise Pearson correlations between samples. Use log(RPKM+1) to calculate the
“`{r}
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
“`
*How similar are the B-cell samples to each other? *
*The LCLs to each other? *
*The B-cells to the LCLs?*
*The donors are labeled [A-E]. What would we expect to see if there were donor-specific effects
#### Question 1d (4 pts)
*Plot the kernel density estimate for RPKM (x axis). Describe the distribution. *
“`{r}
“`
*Plot the kernel density estimate for log(RPKM+1) (x axis). Describe the distribution.*
“`{r}
“`
*Why do we log-transform gene expression data?*
*Why do we use RPKM+1?*
#### Question 1e ( 2pts)
*It is common to exclude some genes from analysis. For instance, if the gene is not expressed,
“`{r}
“`
*By looking at the graph, what fraction of genes will be excluded if we require an average of a
#### Question 1f
*Make a vector of `Gene.ID`s for all the genes we will _include_ in our analysis (e.g. those wi
“`{r}
“`
*How many genes have we retained?*
“`{r}
“`
#### Question 1f ( 2 bonus pts)
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
*`rnaSeqDataMeltedCounts` contains a commmon error introduced by pasting the data having passed
“`{r}
“`
### Question 2
*We want to know what genes are differentially expressed between LCLs and B cells.*
*Today, we’re going to use edgeR. The user guide is here:* https://www.bioconductor.org/package
#### Question 2a (2pts)
*For calculating differential expression, we need a count matrix. There is a lot more informati
“`{r}
“`
#### Question 2b (4pts)
*EdgeR is exceptionally versatile, with many different options for analysis. Today, you’re goin
“`{r}
ct = as.factor(gsub(,”\\1″,colnames(countMatrix)))
= DGEList(counts=, group=)
= calcNormFactors()
= model.matrix(~)
= estimateDisp(, )
= glmQLFit(, )
qlf = glmQLFTest(, coef=2)
allDEStats = as.data.frame(topTags(qlf,n=nrow(countMatrix)))
allDEStats$Gene.ID=row.names(allDEStats)
head(allDEStats)
“`
#### Question 2c (2pts)
*Now, let’s confirm that we did the comparison in the correct direction. We want to ensure that
“`{r}
“`
#### Question 2d (2pts)
*In the case of this experiment, each donor had corresponding B cell and LCL samples – these ar
“`{r}
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
“`
### Question 3
*When performing any bioinformatic analysis, it is critically important to verify your answers
#### Question 3a ( 1pts)
*Compare the results from the paired analysis vs. the original analysis (where donor was not co
*How many genes have a logFC that differs by 1 or more?*
“`{r}
“`
#### Question 3b ( 3pts)
*Plot the logFC values for paired (x-axis) vs unpaired (y-axis) analyses. What is the Pearson r
“`{r}
“`
#### Question 3c
*Why are there some genes whose logFC estimates differ so much between the paired and unpaired
“`{r}
“`
*What are some reasons why paired and unpaired analyses appear to be giving different answers i
### Question 4
*There are many different ways of normalizing RNA seq data and they all make different assumpti
*For more information on RNA-seq normalization methods, see this great post by Harold Pimentel
#### *From now on, we will be using only the paired (donor as blocking variable) analysis*
#### Question 4a ( 3.5 pts)
*edgeR uses TMM normalization. TMM normalization is usually a solid approach, but it assumes th
“`{r}
“`
*Do the P-values calculated by edgeR look like they are sampled uniformly between 0 and 1?*
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
*Estimate the fraction of genes that appear to have differential expression (does not have to b
*Do you think we should use TMM normalization for these samples? Why or why not?*
#### Question 4b (3.5 pts)
*Another way to ensure that P-values are well-behaved is to make a Q-Q plot. A Q-Q plot, in thi
“`{r}
“`
*Do most of the P-values lie along the y=x line?*
*What does this mean about how we did our analysis? What would you do differently if we were to
#### Question 4c (3 pts)
*FDRs are automatically calculated by edgeR. plot the -log10(FDR) (y axis) versus the -log10(P-
“`{r}
“`
*Are the points above or below the line? Why?*
#### Question 4d (3 pts)
*If we took our set of significant DE genes to be those that meet an FDR of 10%, how many genes
“`{r}
“`
*What is the P-value of the gene with the highest FDR that is still less than or equal to FDR < ```{r} ``` *Ignoring any potential flaws in the analysis up to this point and assuming our P-values and FD ### Question 5 *Now we're going to spend some time actually looking at the results of our analysis, trying to #### Question 5a (2 pts) *GOrilla, which we use below, can take a single ranked list as input. Here, you should make two ```{r} 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 ``` #### Question 5b (6 pts) *Now use GOrilla to measure gene set enrichment analysis for both sets of genes.* http://cbl-gorilla.cs.technion.ac.il/ *Use Homo sapiens; a single ranked list of genes (those we just created - run it twice, once pe *When inspecting the results, consider both the significance of the enrichment, the degree of e *What processes are relatively high in B cells?* *Which are relatively high in LCLs?* *What do you think could be happening as B-cells are reprogrammed to LCLs?*