Ranges and features in Bioconductor
The foundation layers of genomic arithmetic
– 30, 2022
Copyright By PowCoder代写 加微信 powcoder
Prequel: The pieces and layers of Bioconductor Some concepts
Integer ranges of information (IRanges) Genomic Ranges (GRanges)
GenomicFeatures
Prequel: The pieces and layers of Bioconductor
The elephant in the room
Why does Bioconductor reinvent so many wheels using S4?
• and so on
Is this just so much yak shaving?
The core components of Bioconductor classes work together
Orchestration provides power (SummarizedExperiment)
Some concepts
Run Length Encoding (Rle)
• Chromosomes can be large (> 108 base pairs), which make for big vectors • Information on chromosomes tends to come in chunks, and many repeats • We can use compression to make it easier to work with in memory.
• Run-length encoding (RLE) is a common compression technique for storing long sequences with lengthy repeats.
• Rle vectors of a sequence will represent repeats by their length, e.g., A8
• DNA only has four letters, so this is pretty efficient even for sequences
Annotation
• Information on genomes includes lots of metadata, like gene names, peak heights, and so on
• We need to use this information efficiently
• Genomic information in Bioconductor can include annotation, which is a kind of data frame attached to the vector of ranges.
• Chromosomes can be large (> 108 base pairs), which means creating lots of subsets of sequences is very inefficient.
• We need a solution
• Keep the original sequence, but only pass a reference to a location, not the value. • Associate ranges with the sequence to select subsequences
Integer ranges of information (IRanges)
All genome browswer information sits on ranges
The IRanges class (from Bioconductor package IRanges)
ir <- IRanges(c(1, 8, 14, 15, 19, 34, 40),
width=c(12, 6, 6, 15, 6, 2, 7))
IRanges object with 7 ranges and 0 metadata columns:
start end width
[1] 1 12 12
[2] 8 13 6
[3] 14 19 6
[4] 15 29 15
[5] 19 24 6
[6] 34 35 2
[7] 40 46 7
What is an IRanges object?
• A bit like a vector (A vector of integer ranges, built from the S4Vectors framework) • A bit like a data frame (you can add arbitrary metadata to it)
• Very memory-efficient
• Lots of associated methods that operate on IRanges
A convenience function (from the IRanges vignette)
plotRanges <- function(x, xlim=x, main=deparse(substitute(x)),
col="black", sep=0.5, ...)
height <- 1
if (is(xlim, "IntegerRanges"))
xlim <- c(min(start(xlim)), max(end(xlim)))
bins <- disjointBins(IRanges(start(x), end(x) + 1))
plot.new()
plot.window(xlim, c(0, max(bins)*(height + sep)))
ybottom <- bins * (sep + height) - height
rect(start(x)-0.5, ybottom, end(x)+0.5, ybottom + height, col=col, ...)
title(main)
Plotting some ranges
plotRanges(ir)
0 10 20 30 40
Normal ranges
• A normal IRange has no overlaps.
• Any IRange can be normalised using reduce(ir). • A normal IRange is also a set of integers
Operations on a single IRanges object
• In addition to reduce(ir), you may find start(ir), end(ir) and width(ir) useful.
• These produce the corresponding integer vectors that define the IRange.
Many other operations to modify ranges, such as narrow, resize, flank, reflect, restrict, and threebands
Finding overlaps between ranges
findOverlaps(ir,reduce(ir))
Hits object with 7 hits and 0 metadata columns:
queryHits subjectHits
[1] 1 1
[2] 2 1
[3] 3 1
[4] 4 1
[5] 5 1
[6] 6 2
[7] 7 3
queryLength: 7 / subjectLength: 3
Calculating “coverage”
coverage(ir)
coverage(reduce(ir))
integer-Rle of length 46 with 11 runs
Lengths: 7 5 2 4 1 5 5 4 2 4 7
Values : 1 2 1 2 3 2 1 0 1 0 1
integer-Rle of length 46 with 5 runs
Lengths: 29 4 2 4 7
Values : 1 0 1 0 1
Genomic Ranges (GRanges)
What are GRanges?
• Bioconductor package GenomicRanges.
• Builds upon IRanges.
• Includes seqnames (typically chromosomes), and strand information. • Like IRanges, metadata can be added to GRanges objects
An example
gr <- GRanges(
seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
score = 1:10,
GC = seq(1, 0, length=10))
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
a chr1 101-111 -|
b chr2 102-112 +|
c chr2 103-113 +|
d chr2 104-114 *|
e chr1 105-115 *|
1 1.000000
2 0.888889
3 0.777778
4 0.666667
5 0.555556
Power of GRanges
• The seqnames means that IRanges on different chromosomes never overlap
• The IRanges and Rle encoding ensure memory efficiency and speed
• When seqnames correspond to BSgenome chromosome names, GRanges can be used to retrieve sequences efficiently
• Metadata allows building all the types of data that appear in Genome browsers
GenomicFeatures
The GenomicFeatures package
The GenomicFeatures packages defines TxDb objects
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
head(seqlevels(txdb),25)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY" "chrM"
What’s in a TxDb object?
columns(txdb)
[1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSPHASE"
[6] "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID"
[11] "EXONNAME" "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID"
[16] "TXCHROM" "TXEND" "TXID" "TXNAME" "TXSTART"
[21] "TXSTRAND" "TXTYPE"
Getting some ranges
genes(txdb)
1662 genes were dropped because they have exons located on both strands
of the same reference sequence or on more than one reference sequence,
so cannot be represented by a single genomic range.
Use 'single.strand.genes.only=FALSE' to get all the genes in a
GRangesList object, or use suppressMessages() to suppress this message.
GRanges object with 29721 ranges and 1 metadata column:
1 chr19
10 chr8
100 chr20
1000 chr18
100008586 chrX
... ...
ranges strand | gene_id
58345178-58362751
18386311-18401218
44619522-44652233
27932879-28177946
49551278-49568218
-| 1 +| 10 -| 100 -| 1000 +| 100008586
Getting some ranges
transcripts(txdb)
GRanges object with 266064 ranges and 2 metadata columns:
[266060] chrUn_GL000220v1
[266061] chrUn_KI270442v1
[266062] chrUn_KI270442v1
[266063] chrUn_KI270744v1
[266064] chrUn_KI270750v1
1 ENST00000456328.2
2 ENST00000450305.2
3 ENST00000473358.1
4 ENST00000469289.1
5 ENST00000607096.1
… …
266060 ENST00000619779.1
266061 ENST00000620265.1
266062 ENST00000611690.1
266063 ENST00000616830.1
266064 ENST00000612925.1
ranges strand | tx_id
11869-14409
12010-13670
29554-31097
30267-31109
30366-30503
155997-156149
380608-380726
217250-217401
51009-51114
148668-148843
+| +| +| +| +|
… . +| +| -| -| +|
Getting some ranges
exons(txdb)
GRanges object with 713360 ranges and 1 metadata column:
[713356] chrUn_GL000220v1
[713357] chrUn_KI270442v1
[713358] chrUn_KI270442v1
[713359] chrUn_KI270744v1
[713360] chrUn_KI270750v1
ranges strand | exon_id
11869-12227
12010-12057
12179-12227
12613-12697
12613-12721
155997-156149
380608-380726
217250-217401
51009-51114
148668-148843
+| 1 +| 2 +| 3 +| 4 +| 5
… . … +| 713356 +| 713357 -| 713358 -| 713359 +| 713360
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com