Portfolio 1: Data Wrangling
In order to complete these exercises you will need to download the data .csv files, as well as the assignment .Rmd file, which you need to edit, titled GUID_RM2_Portfolio1.Rmd. These can be downloaded from the Moodle site. Once downloaded and unzipped, you should create a new folder that you will use as your working directory; put the data files and the .Rmd file in that folder and set your working directory to that folder through the drop-down menus at the top.
Now open the assignment .Rmd file within RStudio. You will see there is a code chunk for each task, follow the instructions on what to edit in each code chunk. This will often be entering code based on the lab you have just done as opposed to always just entering a value.
In the labs we learned about data-wrangling using the Wickham 6 verbs, looked at additional functions such as gather() and inner_join(), and at piping chains of code for efficiency using %>%. You will need these skills to complete the following exercises so please make sure you have carried out the PreClass and InClass activities before attempting these exercises. Remember to follow the instructions and if you get stuck at any point to post questions on msc-conv-rm2.slack.com channel #rstats. Also, two useful online resources are:
• Hadley Wickham’s R for Data Science book @ http://r4ds.had.co.nz
• RStudio’s dplyr cheatsheet @ Rstudio.com
Today’s Topic – The Ageing Brain
A key topic in current psychologial research, and one which forms a main focus for some of the research in our School, is that of human ageing. In this research we use brain imaging techniques to understand how changes in brain function and structure relate to changes in perception and behaviour. A typical ‘ageing’ experiment will compare a measure (or a number of measures) such as performance on a cognitive or perceptual task between younger and older adults (i.e. a between-subjects design experiment).
However in order to make sure we are studying ‘healthy’ ageing, we first have to ‘screen’ our older participants for symptoms of age-related dementia (Alzheimer’s Disease), where cognitive function can be significantly impaired. We do this using a range of cognitive tests. Some studies will also test participants’ sensory acuity (ability to perceive something), as a function of age (particularly eyesight and hearing).
The data you have downloaded for this lab is example screening data taken from research investigating how the ageing brain processes different types of sounds. The tests used in this study are detailed below. Please note that the links are there to provide you with further information and examples of the tests once you have completed the assignment if you so wish; you do not have to read them to complete the worksheet
• Montreal Cognitive Assessment (MoCA) : a test specifically devised as a stand-alone screening tool for mild cognitive impairment. Assesses visuospatial skills, memory, language, attention, orientation, and abstraction skills. Example here
• Working Memory Digit Span Test (D-SPAN): measures the capacity of participants’ short-term (working) memory. Example here
• D2 Test of Attention: measures participant’s selective and sustained concentration and visual scanning speed. Example here
• Better Hearing Institute Quick Hearing Check: a self-report questionnaire which measures participant’s subjective experience of their own hearing abilities. Paper version and scoring on pages 8-9. Example here and Main Test page here
The data files
You have just downloaded the three .csv files containing all the data you need. Below is a list of the .csv files names and a description of the variables each contains:
• p_screen.csv contains particpants demographic information including:
• ID Participant Id number – for confidentiality (no names or other identifying info)
• AGE in years
• SEX M for male, F for female
• HANDEDNESS L for left-handed, R for right-handed
• EDUCATION in years
• MUSICAL whether they have any musical abilties/experience (YES or NO)
• FLANG speak any foreign languages (YES or NO)
• MOCA Montreal Cognitive Assessment score
• D-SPAN Working Memory Digit Span test score
• D2 D2 Test of Attention score
• QHC_responses.csv contains participants’ responses to each question on the Better Hearing Institute Quick Hearing Check (QHC)questionnaire.
• Column 1 represents participants’ ID (matching up to that in p_screen.csv).
• Each column thereafter represents the 15 questions from the questionnaire.
• Each row represents a participant and their response to each question.
• QHC_scoring.csv contains the scoring key for each question of the QHC, with the columns:
• RESPONSE the types of responses participants could give (STRONGLY DISAGREE, SLIGHTLY DISAGREE, NEUTRAL, SLIGHTLY AGREE, STRONGLY AGREE)
• SCORE the points awarded for each response type (from 0 to 4). A score for each participant can be calculated by converting their categorical responses to values and summing the values.
Before starting lets check:
• The .csv files are saved into a folder on your computer and you have manually set this folder as your working directory.
• The .Rmd file is saved in the same folder as the .csv files. For assessments we ask that you save it with the format GUID_RM2_Portfolio1.Rmd where GUID is replaced with your GUID.
You may want to practice the tasks first to get the correct code and format, and to make sure they work. You can do this in the console or a script, but remember, once you have the correct code, edit the necessary parts of this assignment .Rmd file to produce a reproducible Rmd file. This is what you will do from now on for all other portfolio files so practicing this now will really help. In short, go through the tasks and change only the NULL with what the question asks for and then make sure that the file knits at the end so that you have a fully reproducible code.
When altering code inside the code blocks, do not re-order or rename the code blocks (T1, T2, … etc.). If you do this then the code is no longer consistent across people and this will impact your grade!
Task 1 – Load the tidyverse
In the code chunk below, write and run the code to load the tidyverse.
# hint: library(“something”)
Task 2 – Load in the data
Now we have the tidyverse loaded, edit the below code to load in the three data files, p_screen.csv, QHC_responses.csv, and QHC_scoring.csvand then run the code (make sure you have loaded the tidyverse). You need to replace the NULL values with the code used to load each of the data files. Ensure that the names of the csv files are exactly as they were when you downloaded them from Moodle.
You should use read_csv() to load the files and NOT read.csv() as it will change the names of your variables and you won’t get the correct answers for the rest of the tasks!
# hint: name <- read_csv("file_name.csv")
screening <- NULL
responses <- NULL
scoring <- NULL
View the data
• It is always a good idea to familiarise yourself with the layout of the data that you have just loaded in. You can do this through using glimpse() or View() in the Console windown, but you must never put these functions in your assignment file.
Task 3 - Oldest Participant
Replace the NULL in the T3 code chunk with the Participant ID of the oldest participant. Store this single value in oldest_participant (e.g. oldest_participant <- 999.
# hint: look at your data, who is oldest?
oldest_participant <- NULL
Task 4 - Arranging D-SPAN
Replace the NULL in the T4 code chunk with code that arranges participants’ D-SPAN performance from highest to lowest using the appropriate one-table dplyr (i.e., Wickham) verb. Store the output in cogtest_sort. (e.g. cogtest_sort <- verb(data, argument))
# hint: arrange your screening data
cogtest_sort <- NULL
Task 5 - Foreign Language Speakers
Replace the NULL in each of the two lines of code chunk T5, so that descriptives has a column called n that shows the number of participants that speak a foreign language and number of participants that do not speak a foreign language, and another column called median_age that shows the median age for those two groups. If you have done this correctly, descriptives should have 3 columns and 2 rows of data, not including the header row.
# hint: First need to group_by() foreign language
screen_groups <- NULL
# hint: second need to summarise(). Pay attention to specific column names given.
descriptives <- NULL
Task 6 - Creating Percentage MOCA scores
Replace the NULL in the T6 code chunk with code using one of the dplyr verbs to add a new column called MOCA_Perc to the dataframe screening In this new column should be the MOCA scores coverted to percentages. The maximum achievable score on MOCA is 30 and percentages are calculated as (participant score / max score) * 100. Store this output in screening.
# hint: mutate() something using MOCA and the percentage formula
screening <- NULL
Task 7 - Remove the MOCA column
Now that we have our MoCA score expressed as a percentage MOCA_Perc we no longer need the raw scores held in MOCA. Replace the NULL in the T7 code chunk using a one-table dplyr verb to keep all the columns of screening, with the same order, but without the MOCA column. Store this output in screening.
# hint: select your columns
screening <- NULL
The next set of tasks focus on merging two tables.
You suspect that the older adults with musical experience might report more finely-tuned hearing abilities than those without musical experience. You therefore decide to check whether this trend exists in your data. You measured participant’s self reported hearing abilties using the Better Hearing Institute Quick Hearing Check Questionnaire. In this questionnaire participants rated the extent to which they agree or disagree with a list of statements (e.g. ‘I have a problem hearing over the telephone’) using a 5 point Likert scale (Strongly Disagree, Slightly Disagree, Neutral, Slightly Agree, Strongly Agree).
Each participant’s response to each question is contained in the responses dataframe in your environment. Each response type is worth a certain number of points (e.g. Strongly Disagree = 0, Strongly Agree = 5), and the scoring key is contained in the scoring dataframe. A score for each participant is calculated by totalling up the number of points across all the questions to derive at an overall score. The lower the overall score, the better the participant’s self-reported hearing ability.
In order to score the questionnaire we first need to perform a couple of steps.
Task 8 - Gather the Responses together
Replace the NULL in the T8 code chunk using code to gather the responses to all the questions of the QHC from wide format to tidy/long format. Name the first column Question and the second column RESPONSE. Store this output in responses_long.
# hint: gather the question columns (Q1:Q15) in responses
responses_long <- NULL
Task 9 - Joining the data
Now we need to join the number of points for each response in scoring to the participants’ responses in responses_long.
Replace the NULL in the T9 code chunk using inner_join() to combine responses_long and scoring into a new variable called responses_points.
# hint: join them by the column common to both scoring and responses_long
responses_points <- NULL
Task 10 - Working the Pipes
Below we have given you a code chunk with 5 lines of code. The code takes the data in its current long format and then creates a QHC score for each participant, before calculating a mean QHC score for the two groups of participants - those that play musical intruments and those that don’t - and stores it in a variable called musical_means.
participant_groups <- group_by(responses_points, ID)
participant_scores <- summarise(participant_groups, Total_QHC = sum(SCORE))
participant_screening <- inner_join(participant_scores, screening, "ID")
screening_groups_new <- group_by(participant_screening, MUSICAL)
musical_means <- summarise(screening_groups_new, mean_score = mean(Total_QHC))
Replace the NULL in the T10 code chunk with the following code converted into a fuctioning pipeline using pipes. Put each function on a new line one under the other. This pipeline should result in the mean QHC values of musical and non-musical people stored in musical_means which should be made of two rows by two columns.
hint: function1 %>% function2
# hint: in pipes, the output of the previous function is the input of the subsequent function.
musical_means <- NULL
Task 11 - Difference in Musical Means
Finally, replace the NULL in the T11 code chunk with a single value, to two decimal places, that is the value of how much higher the QHC score of people who play music is compared to people who don’t play music (e.g. 2.93)
# hint: look in musical means and enter the difference between the two means.
QHC_diff <- NULL
Finished
Congratulations! You have finished the first worksheet! Make sure to save this worksheet on a secure drive. You will submit all five worksheets for the portfolio on the 12th March and instructions on how to submit will be given on Moodle nearer the time. Remember that you can ask questions on the Slack forum in the #rstats channel at any time.