CS570 Biomedical Science & Health IT
CS544 D1
Foundations of Analytics
Module 6
Guanglan Zhang
1
1
Regular Expressions
A regular expression, regex or regexp, is a sequence of characters that define a search pattern.
This pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation.
The concept came into common use with Unix text-processing utilities.
Regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis.
Many programming languages provide regex capabilities, built-in or via libraries.
https://en.wikipedia.org/wiki/Regular_expression
https://stringr.tidyverse.org/articles/regular-expressions.html
2
2
R package stringr
There are 4 main families of functions in stringr:
Character manipulation: these functions allow you to manipulate individual characters within the strings in character vectors.
Whitespace tools to add, remove, and manipulate whitespace.
Locale sensitive operations whose operations will vary from locale to locale.
Pattern matching functions. These recognise four engines of pattern description. The most common is regular expressions, but there are three other tools.
https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html
3
3
Functions in stringr
>str_c() #used for joining multiple strings into a single string
>str_flatten() #takes a single argument and flattens it to a single string
>str_length() #returns the length of the input values
>str_sub() #extracts substrings from the given input string vector
>str_dup() #duplicates the string by the specified number of times
>str_trim() #removes white spaces from the ends of the input strings
>str_squish() #removes the white spaces from the ends of the string and collapses consecutive white space character into a single space within the string
>str_pad() #pads the given strings to the given width
>str_trunc() #truncates the given string
4
4
Functions in stringr
>str_detect() #detects if the specified pattern is present in the given input string
>str_subset() #directly returns the matched strings
>str_locate() #locates the positions of the pattern in the given string
>str_locate_all() #locates all occurrences of the pattern in the input strings
>str_extract() #extracts the first matched pattern in the given input strings
>str_extract_all() #extracts all the matched pattern in the given input strings
>str_match() #returns a matrix with the matched groups for the first match in the input strings
>str_match_all() #returns a list of matrices showing all the matches in the input strings
>str_replace() #replaces the first matched pattern in each input string with the specified string
>str_replace_all() #replaces all the matched patterns in each input string with the specified string
5
5
Functions in stringr
>str_count() #counts the number of matches of the pattern in the input strings
>str_split() #splits the input string into parts over the specified pattern
6
6
Big picture
Data wrangling is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.
https://r4ds.had.co.nz/introduction.html
7
7
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
Tidyverse
The tidyverse is a collection of R packages useful for data preparation, data cleaning, and data transformations falling under the broader scope of data wrangling.
A tibble is an inherited version of the data frame that are more suitable and easier to handle than the data frame.
glimpse() provides a sneak peek of the data by printing the number of rows, the number of columns, a row of data for each column as it fits the printing area of the window.
https://www.datacamp.com/courses/introduction-to-the-tidyverse?tap_a=5644-dce66f&tap_s=213362-c9f98c
https://www.tidyverse.org/
8
8
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
R package dplyr
The dplyr package provides the functions that are most commonly needed for data preprocessing, transformations, and manipulations.
Filter() – extract existing observations by their values
Arrange() – reorder the rows of the data
Select() – pick existing variables by their names
Mutate() – create new variables from existing variables
Summarize() – used for summarizing grouped data
https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
9
9
Tidy data sets
A tidy set makes it easy for further data analysis and exploration. There are 3 interrelated rules which make a dataset tidy:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
Main advantages of having tidy data:
There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. Most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
10
https://r4ds.had.co.nz/tidy-data.html
10
Data Organization with tidyr
The following functions from the tidyr package are frequently used in tidying the data:
Gather() – takes multiple columns in the data set and collapses them into key-value pairs. The resulting data set is typically known as the long form of the data.
Separate() – takes the values in a column and splits them into multiple columns.
Unite() – takes the values from multiple columns and combines them into a single column.
Spread() – takes the data in a long form and spreads the key-value pairs across multiple columns
https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
11
11
/docProps/thumbnail.jpeg