CS计算机代考程序代写 UBC-SBME / BMEG310_2021 Public

UBC-SBME / BMEG310_2021 Public

Code Issues Pull requests Actions Projects Wiki Security Insights

BMEG310_2021 / Assignment 4 / Assignment_4.Rmd

naila53 Update Assignment_4.Rmd …

2 contributors

main

364 lines (217 sloc) 16.5 KB

—

title: “BMEG310 2021W – Assignment 4”

output: html_document

—

“`{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

“`

## Instructions

Please fork this to your repo (or copy it there). Make sure your repo is private, but shared wi

Group:

Group members:

Assignment due date: November 30, 2021, 11:59PM

## Introduction

*Epstein-Barr Virus (EBV) is a herpes virus and the cause of mononucleosis. Like other herpes v

*Today, we’re going to look at gene expression data to see how EBV infection changes the transc

## Load data

*Load needed libraries (install any you don’t yet have – edgeR can be installed through BioCond

“`{r pressure, echo=FALSE}

library(ggplot2)

library(edgeR)

library(reshape)

set.seed(416) # please do not change the seed or set a new seed.

https://github.com/UBC-SBME
https://github.com/UBC-SBME/BMEG310_2021
https://github.com/UBC-SBME/BMEG310_2021
https://github.com/UBC-SBME/BMEG310_2021/issues
https://github.com/UBC-SBME/BMEG310_2021/pulls
https://github.com/UBC-SBME/BMEG310_2021/actions
https://github.com/UBC-SBME/BMEG310_2021/projects
https://github.com/UBC-SBME/BMEG310_2021/wiki
https://github.com/UBC-SBME/BMEG310_2021/security
https://github.com/UBC-SBME/BMEG310_2021/pulse
https://github.com/UBC-SBME/BMEG310_2021
https://github.com/UBC-SBME/BMEG310_2021/tree/main/Assignment%204
https://github.com/naila53
https://github.com/naila53
https://github.com/UBC-SBME/BMEG310_2021/commit/362ee85b8da04b763e946197297a1f1391884c3f
https://github.com/UBC-SBME/BMEG310_2021/commits/main/Assignment%204/Assignment_4.Rmd
https://github.com/UBC-SBME/BMEG310_2021/commits/main/Assignment%204/Assignment_4.Rmd?author=Carldeboer
https://github.com/UBC-SBME/BMEG310_2021/commits/main/Assignment%204/Assignment_4.Rmd?author=naila53

“`

*Load data:*

“`{r}

rnaSeqData = read.table(textConnection(readLines(gzcon(url(

“https://ftp.ncbi.nlm.nih.gov/geo/series/GSE126nnn/GSE126379/suppl/GSE126379_BLCL_processed.t

sep=”\t”, stringsAsFactors = FALSE, header = TRUE)

“`

*Reshape and re-label data to make it more usable.*

“`{r}

#melt the data frame so that there is one row for every gene-sample-quantification.type (see be

rnaSeqDataMelted = melt(rnaSeqData, id.vars=c(“Chromosome”, “Gene.Name”, “Gene.ID”, “Sum.of.Exo

#label every value as a read count, but…

rnaSeqDataMelted$quantification.type = “counts”;

#relabel those that are really RPKM

rnaSeqDataMelted$quantification.type[grepl(rnaSeqDataMelted$variable,pattern = “RPKM”)] = “RPKM

#get the donor (human) ID

rnaSeqDataMelted$donor = gsub(“.*BLCL_([ABCDE])([12]).*”, “\\1”, rnaSeqDataMelted$variable)

#Get the identifier they use to distinguish LCLs from B-cells

rnaSeqDataMelted$celltype = gsub(“.*BLCL_([ABCDE])([12]).*”, “\\2″, rnaSeqDataMelted$variable)

#They didn’t label the cell types clearly, so we’re going to guess their assignment and verify

rnaSeqDataMelted$celltype[rnaSeqDataMelted$celltype==”1″]=”B”

rnaSeqDataMelted$celltype[rnaSeqDataMelted$celltype==”2″]=”LCL”

#Remove this, since we have already parsed out the needed info

rnaSeqDataMelted$variable=NULL;

rnaSeqDataMelted$Gene.Name=NULL; #we only use Gene.ID

rnaSeqDataMelted$sampleID = paste(rnaSeqDataMelted$celltype, rnaSeqDataMelted$donor, sep=”_”)

#divide data into read counts and RPKM

rnaSeqDataMeltedCounts = rnaSeqDataMelted[rnaSeqDataMelted$quantification.type==”counts”,]

rnaSeqDataMeltedRPKM = rnaSeqDataMelted[rnaSeqDataMelted$quantification.type==”RPKM”,]

#relabel “value” as appropriate, and remove the ‘quantification.type’ column

rnaSeqDataMeltedCounts$counts=rnaSeqDataMeltedCounts$value;

rnaSeqDataMeltedCounts$value=NULL;

rnaSeqDataMeltedCounts$quantification.type=NULL;

rnaSeqDataMeltedRPKM$RPKM=rnaSeqDataMeltedRPKM$value;

rnaSeqDataMeltedRPKM$value=NULL;

rnaSeqDataMeltedRPKM$quantification.type=NULL;

rm(‘rnaSeqDataMelted’) # remove

“`

*Here, `rnaSeqDataMeltedRPKM` contains the gene expression “Reads per kilobase of transcript pe

## Questions

### Question 1

*It is always important to perform Quality Control on your data. *

#### Question 1a (2 pts)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

*For some reason, the authors labeled the different cell types “1” and “2” in this table. We ne

*Enter your answer below: *

“`{r}

“`

*Explain what you see. Are the celltype labels correct?*

#### Question 1b ( 2pts)

*With RNA-seq data, it is important both that samples have sufficient coverage, and that the sa

“`{r}

“`

*Which sample has the least coverage?*

“`{r}

“`

*Which sample has the most?*

“`{r}

“`

*What is the % difference between the max and min (relative to the min)?*

“`{r}

“`

*In cases where samples have vastly different coverage, you can potentially down-sample the hig

#### Question 1c ( 4.5 pts)

*One easy way to compare the samples, is to calculate the pairwise correlation between them. In

*First we need to create a RPKM matrix. You can use the `reshape` package to change the shape o

“`{r}

# Here, we need fun.aggregate because there are duplicate gene names. We’ll just add everything

rpkmMatrix = cast(rnaSeqDataMeltedRPKM, Gene.ID ~ sampleID, value=”RPKM”, fun.aggregate=”sum”)

#now convert this to a matrix

row.names(rpkmMatrix) = rpkmMatrix$Gene.ID

rpkmMatrix$Gene.ID=NULL;

rpkmMatrix= as.matrix(rpkmMatrix)

head(rpkmMatrix)

“`

*Calculate the pairwise Pearson correlations between samples. Use log(RPKM+1) to calculate the

“`{r}

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

“`

*How similar are the B-cell samples to each other? *

*The LCLs to each other? *

*The B-cells to the LCLs?*

*The donors are labeled [A-E]. What would we expect to see if there were donor-specific effects

#### Question 1d (4 pts)

*Plot the kernel density estimate for RPKM (x axis). Describe the distribution. *

“`{r}

“`

*Plot the kernel density estimate for log(RPKM+1) (x axis). Describe the distribution.*

“`{r}

“`

*Why do we log-transform gene expression data?*

*Why do we use RPKM+1?*

#### Question 1e ( 2pts)

*It is common to exclude some genes from analysis. For instance, if the gene is not expressed,

“`{r}

“`

*By looking at the graph, what fraction of genes will be excluded if we require an average of a

#### Question 1f

*Make a vector of `Gene.ID`s for all the genes we will _include_ in our analysis (e.g. those wi

“`{r}

“`

*How many genes have we retained?*

“`{r}

“`

#### Question 1f ( 2 bonus pts)

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

*`rnaSeqDataMeltedCounts` contains a commmon error introduced by pasting the data having passed

“`{r}

“`

### Question 2

*We want to know what genes are differentially expressed between LCLs and B cells.*

*Today, we’re going to use edgeR. The user guide is here:* https://www.bioconductor.org/package

#### Question 2a (2pts)

*For calculating differential expression, we need a count matrix. There is a lot more informati

“`{r}

“`

#### Question 2b (4pts)

*EdgeR is exceptionally versatile, with many different options for analysis. Today, you’re goin

“`{r}

ct = as.factor(gsub(,”\\1″,colnames(countMatrix)))

= DGEList(counts=, group=)

= calcNormFactors()

= model.matrix(~)

= estimateDisp(, )

= glmQLFit(, )

qlf = glmQLFTest(, coef=2)

allDEStats = as.data.frame(topTags(qlf,n=nrow(countMatrix)))

allDEStats$Gene.ID=row.names(allDEStats)

head(allDEStats)

“`

#### Question 2c (2pts)

*Now, let’s confirm that we did the comparison in the correct direction. We want to ensure that

“`{r}

“`

#### Question 2d (2pts)

*In the case of this experiment, each donor had corresponding B cell and LCL samples – these ar

“`{r}

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

“`

### Question 3

*When performing any bioinformatic analysis, it is critically important to verify your answers

#### Question 3a ( 1pts)

*Compare the results from the paired analysis vs. the original analysis (where donor was not co

*How many genes have a logFC that differs by 1 or more?*

“`{r}

“`

#### Question 3b ( 3pts)

*Plot the logFC values for paired (x-axis) vs unpaired (y-axis) analyses. What is the Pearson r

“`{r}

“`

#### Question 3c

*Why are there some genes whose logFC estimates differ so much between the paired and unpaired

“`{r}

“`

*What are some reasons why paired and unpaired analyses appear to be giving different answers i

### Question 4

*There are many different ways of normalizing RNA seq data and they all make different assumpti

*For more information on RNA-seq normalization methods, see this great post by Harold Pimentel

#### *From now on, we will be using only the paired (donor as blocking variable) analysis*

#### Question 4a ( 3.5 pts)

*edgeR uses TMM normalization. TMM normalization is usually a solid approach, but it assumes th

“`{r}

“`

*Do the P-values calculated by edgeR look like they are sampled uniformly between 0 and 1?*

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

*Estimate the fraction of genes that appear to have differential expression (does not have to b

*Do you think we should use TMM normalization for these samples? Why or why not?*

#### Question 4b (3.5 pts)

*Another way to ensure that P-values are well-behaved is to make a Q-Q plot. A Q-Q plot, in thi

“`{r}

“`

*Do most of the P-values lie along the y=x line?*

*What does this mean about how we did our analysis? What would you do differently if we were to

#### Question 4c (3 pts)

*FDRs are automatically calculated by edgeR. plot the -log10(FDR) (y axis) versus the -log10(P-

“`{r}

“`

*Are the points above or below the line? Why?*

#### Question 4d (3 pts)

*If we took our set of significant DE genes to be those that meet an FDR of 10%, how many genes

“`{r}

“`

*What is the P-value of the gene with the highest FDR that is still less than or equal to FDR < ```{r} ``` *Ignoring any potential flaws in the analysis up to this point and assuming our P-values and FD ### Question 5 *Now we're going to spend some time actually looking at the results of our analysis, trying to #### Question 5a (2 pts) *GOrilla, which we use below, can take a single ranked list as input. Here, you should make two ```{r} 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 ``` #### Question 5b (6 pts) *Now use GOrilla to measure gene set enrichment analysis for both sets of genes.* http://cbl-gorilla.cs.technion.ac.il/ *Use Homo sapiens; a single ranked list of genes (those we just created - run it twice, once pe *When inspecting the results, consider both the significance of the enrichment, the degree of e *What processes are relatively high in B cells?* *Which are relatively high in LCLs?* *What do you think could be happening as B-cells are reprogrammed to LCLs?*

Related Posts