CS代考 XLS-76K-92907.xlsx. I suggest that you organize the data files for the proj

Maths Skills 2 – Statistics Assessment University of York
Spring Term 2022
This assignment concerns data, from the National Health Service, on consultant-led referral to treatment waiting times. At first you will consider summary time series data, aggregated at national level and type of treatment, contained in a single spreadsheet. Then, you will collate several spreadsheets containing data at finer regional level and distinguishing by treatment. In both cases, you will download data from a website, read them into R, perform some computations on them, and produce graphical displays.
The process and results of the analysis should be documented in an R Markdown file, producing an output document in PDF format. Code alone is not acceptable: at each step, you need to explain what you are doing and comment (briefly!) on the results. Your submission should consist of a single zipped folder containing only:

Copyright By PowCoder代写 加微信 powcoder

1. the R Markdown file (.Rmd file) and 2. the output PDF document.
Absence of one of the files will incur a very high penalty. Inclusion in the zipped folder of other unrequested files, including the data files, will also attract a penalty.
The maximum number of pages of the PDF document is 10, all inclusive. Because of this, it is best to do without a table of contents; however, sectioning of the document is strongly encouraged. In particular, separate sections should be devoted to the two parts of the assignment.
You are allowed to use and adapt for your purposes all the materials presented in the lectures and posted on Moodle, including the R Markdown input files, without any need of referencing them. However, you should provide, at the end of your report, references to the R packages you are using and to the data sources.
You may informally discuss with fellow students how to perform a task, but you are not allowed to share your code, and you should write the report on your own.
The following is a more detailed description of what the assignment entails.
Referral to Treatment (RTT) data
Data on consultant-led referral to treatment (RTT) waiting times in England are available at the following NHS website https://www.england.nhs.uk/statistics/statistical-work-areas/rtt- waiting-times/. The goal of the assignment is to display how the numbers of referred patients, still waiting for their treatment to begin at the end of each month, have changed from January 2017 to December 2021. We will also plot (a) two quantiles of the distribution of waiting

times and (b) the percentages of such patients who have been waiting for no more than 18 weeks (the corresponding NHS operational standard is 92%) and for more than one year.
Part A: Incomplete RTT Pathways – National time series
1. The section Latest Data in the above website contains links to pages with the data, one page for each financial year. Follow the link 2021-22 RTT waiting times data and download the file under England-level time series: at the time of writing
(21/2/2022) the file was RTT-Overview-Timeseries-Dec21-XLS-76K-92907.xlsx. I suggest that you organize the data files for the project as follows. Say that you have a folder named MS2-Stats containing all materials for the Statistics part of Maths Skills 2. Create a subfolder Project where you put all the work for this assignment, with a sub- folder DATA. Move in the DATA folder the file you have downloaded. Its location will then be MS2-Stats/Project/DATA/RTT-Overview-Timeseries-Dec21-XLS-76K-92907.xlsx.
2. Use the function read_excel() in package readxl to read the data into R, assigning it to an object called overview. You will need to skip the first 9 lines and set the na argument so that it handles character strings “-” present in the xlsx file (by default read_excel() only reads empty cells as NAs). Keep in overview only the columns from the second to the tenth. As you can see from the xlsx file, columns 3 to 10 contain data on Incomplete RTT Pathways, i.e., waiting times for patients that have been referred for treatment but are, at the end of the month, still waiting for it to start. Change the names of the variables to the following shorter ones: Time, Median_wait, 92nd_percentile, IPs_within_18_weeks, %_within_18_weeks, IPs_>18_weeks, IPs_>52_weeks, %_>52_weeks, Total_IPs.
3. The function read_excel() has read the second column in the xlsx file as class POSIXct, a more general class than Date, as it can accommodate both dates and times. Since the variable Time only contains date information, convert it to class Date.
The variable 92nd_percentile is of class character and contains some instances of the string “52+”. In order to convert the variable to numeric, first change the strings “52+” to “52”. The functions str_replace() and fixed() will be useful for this. Columns 6 and 9 in the xlsx file contain numbers followed by the % symbol. The corresponding columns in the tibble overview no longer contain the % symbols, while the numbers have been divided by 100, e.g. 57.2% has been read as 0.572. Restore the percentages, by multiplying the relevant columns of overview by 100.
Keep in overview only the rows corresponding to months from January 2017 to December 2021.
4. Use ggplot2 to reproduce the three plots reported in the following page. For the third plot you will need to create an additional variable, say %_>18-52_weeks, containing the percentages of IPs with a wait between 18 weeks (excluded) and 52 weeks (included).

50th and 92nd percentile of wait in weeks
Quantiles Median
92nd Percentile
Numbers of incomplete pathways by weeks waited
IPs Numbers (mil) Total
<= 18 weeks > 18 weeks > 52 weeks
Percentages of pathways by time waited in weeks
Percentages <= 18 weeks > 52 weeks
>18 − 52 weeks
Percentage of pathways Numbers waiting (millions) Wait time in weeks

Part B: Incomplete RTT Pathways by Region and Treatment
The aim of this part is to produce plots similar to the one at the bottom of the previous page, but for regional rather than national level, and accounting for the different treatments. Achieving this aim requires assembling together data from different spreadsheets, and a fair amount of data cleaning, because of changes in data recording that have occurred through time.
1. Return to the section Latest data in the NHS website, follow the links listed there and download the Excel files with names beginning with Incomplete Commissioner from December 2021 back to January 2017. Each Excel file contains several sheets, but in this assignment only the data from the Region sheet will be used. Save the downloaded xls files in separate folders named Year-Mo: for instance the Jan- uary 2017 file Incomplete-Commissioner-Jan17-revised-XLS-4346K.xls should be saved into a folder named 2017-01. Put all the folders Year-Mo into the folder MS2-Stats/Project/DATA created in the previous part.
2. Read the xls files into R using read_excel() from package readxl. Begin by using list.files() to create a vector fullfilenames containing the full names of the files you need to read (you may need to specify a pattern argument). Then set a tibble rtt to NULL and loop over the entries in fullfilenames. Within each pass in the loop:
• use str_extract() to extract from the entry in fullfilenames the Year-Mo string;
• call read_excel() with the appropriate sheet and skip arguments, put the result in a
tibble called data;
• add to the tibble data another variable Time with all entries equal to the Year-Mo string,
then use relocate() to move Time to the first column;
• use bind_rows() to append the tibble data to the tibble rtt containing the data already
The above works for all months prior to April 2021. Since April 2021, the number of columns in the spreadsheet has increased from 62 to 115. This would not be much of a problem, as bind_rows() matches columns by name. However, the old column 52 plus was also renamed as Total 52 plus weeks. Because of this, one has to modify the above procedure: before using bind_rows(), if the Year-Mo string is equal to or larger than “2021-04”, change the name of the variable Total 52 plus weeks to 52 plus.
3. Only keep in the tibble rtt the six variables Time, Region Name, Treatment Function, Total number of incomplete pathways, Total within 18 weeks, and 52 plus. Change the names to the last five variables to Region, Treatment, Incomplete_Pathways, IPs_within_18_weeks, and IPs_>52_weeks.
Create, in the tibble rtt, a new variable IPs_>18-52_weeks containing the numbers of IPs between 18 weeks (excluded) and 52 weeks (included). Then relocate the variable IPs_>52_weeks after the variable IPs_>18-52_weeks.
Transform Time into a variable of class Date, then use the function days_in_month() in package lubridate to use the correct last day of the month (e.g. 2019-02-28, but 2020-02-29).
4. Transform Treatment from a character variable into a factor.
Another change that took effect in April 2021 concerns the levels of Treatment: these were modified, often by adding Service to a previously used name. In order to make the

notation consistent through the whole period, change the levels of the factor Treatment so that only the following 25 levels are present:
General Surgery, Urology, Trauma & Orthopaedics, Ear, Nose & Throat, Ophthalmology, Oral Surgery, Neurosurgery, Plastic Surgery, Cardiothoracic Surgery, General Medicine, Gastroenterology, Cardiology, Dermatology, Thoracic Medicine, Neurology, Rheumatology,
Geriatric Medicine, Gynaecology, Other, Total,
Other – Medical Services, Other – Mental Health Services,
Other – Paediatric Services, Other – Surgical Services,
Other – Other Services.
Most of the changes are obvious, here are a few that are possibly not:
General Internal Medicine Service → General Medicine
Respiratory Medicine Service → Thoracic Medicine
Elderly Medicine Service → Geriatric Medicine
ENT → Ear, Nose & Throat
Another modification to Treatment that occurred in April 2021 was the replacement of the level Other with five levels called “Other – …”, as reported in the list above. This change has to be undone, if one wants to have a complete time series for Other: create a new factor called Treat_20lev, by recoding the levels of Treatment so that all the “Other – …” levels are mapped to Other. Then move the new variable Treat_20lev just after the variable Treatment, in the tibble rtt, so that visual checks of the two become easier.
5. The variable Region has 11 different values but, as we will see shortly, the classification has changed through time. Begin by making Region into a factor, then change its levels with fct_recode() to the following shorter ones:
NHS England, North, Midlands & East, London, South, South West,
South East, Midlands, East, North West, N. East & Yorkshire.
As hinted at above, until March 2018 Region had only five levels:
NHS England, North, Midlands & East, London, and South.
Starting from April 2018 the category South was split into two: South West and South East. Then from April 2019, Midlands & East was split into Midlands and East, while North was split into North West and N. East & Yorkshire.
In order to produce time series plots for the whole period, these changes need to be undone. Create a new variable Region_5lev, by changing the levels of Region so that South, South West and South East are all mapped to the same level South. Handle similarly Midlands & East and North.
In order to make comparisons easier, move the factor Region_5lev just after the factor Region.
6. At this stage the tibble rtt has 9 columns and 8508 rows. But
20 treatments × 60 months × 5 levels of Region = 6000 rows !
The discrepancy is due to there being multiple rows in rtt with the same combinations of Month, Treat_20lev, and Region_5lev (but different values of Treatment and/or Region), because of the alignment of Treatment and Region levels performed in the previous two points. The corresponding entries of the four variables in rtt have to be summed. In order to do that, use group_by() and summarize(), saving the result

to a tibble named rtt_Reg5_Trt20. This new tibble should have seven columns and 6000 rows.
Then, add to rtt_Reg5_Trt20 three additional variables containing the percentages of IPs within 18 weeks, between 18 and 52 weeks, and larger than 52 weeks: call them %_within_18_weeks, %_>18-52_weeks, and %_>52_weeks.
7. Consider the data in tibble rtt_Reg5_Trt20 and choose a level trt of Treat_20lev and a level reg of Region_5lev.
Send me an email at with subject Maths Skills 2 – TRT, to inform me of your choices: I may ask you to change them, if they turn out to be too popular.
Using ggplot2, reproduce the plot at the bottom of page 3 for your chosen levels trt and reg.
Use facet_wrap() to produce the same plot:
• for all regions in Region_5lev, at the chosen level trt of treatment;
• for all treatments in Treat_20lev, at the chosel level reg of region.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com