代写 R html graph statistic # First, do all of the coding in an R script.

# First, do all of the coding in an R script.
# When it is all working, you will use it to
# create an html file, using R Markdown —
# Instructions for this part are at the bottom
# of this script.

# Hint: A ‘parametric’ test assumes a distributional
# shape (normal, for example), as opposed to
# the randomization and bootstrap methods that
# we have used recently in class.

#################################################

# I. Setup tasks:

# Load ggplot2 and dplyr

# Download the genetic counseling data, which is
# the same as the data from Test 1.
# Both data sets are from a sample of patients who came to
# the UVM Medical Center for genetic counseling.

# Read in the Test2_gc.csv as gc, using the
# stringsAsFactors = FALSE argument.

# Read in Test2_gc_payment.csv as gcp; don’t use
# the above argument for this file.

# Use dplyr to merge gc and gcp by matching cases.
# Call the result gcall, and make sure all of the
# cases in gc are in gcall.

# For tasks below, using gcall.

#################################################

# II. Data cleaning tasks:

# Recode the responses to ResState so they say
# VT, NH, NY or Other.

# Make the age vector into a factor, and apply
# meaningful value labels: Under 1, 1 to 17,
# 18 to 24, 25 to 29, 30 to 34 …. 70 to 74,
# and ’75 and up’.

# Make values for Charges that are negative,
# or greater than 5000 into missing values.

#################################################

# III. Descriptive Stats

# Use dplyr to create a list of the 10 diagnoses
# (ccsdx) that appear most in the data, in descending order
# of appearance. Include
# the mean charges for each diagnosis, and the
# number of patients with that diagnosis.

# Use dplyr to create a list of the 10 diagnoses
# (ccsdx) that incur the highest charges, on average,
# in descending order of mean charges. Include
# the mean charges for each diagnosis, and the
# number of patients with that diagnosis.

# Using dplyr, print the mean Charges for
# males and females in the data set.

# Using base package, create a vector, called
# mcharges, that contains only the charges for males.
# Also, create a vector, called fcharges, that
# contains only the charges for females.
# Have R print the mean for males, then the mean for females
# on the console, using mean(). Also print the difference
# between the two. The values should be the same as your
# dplyr gives above.

# Use ggplot to create a series of boxplots showing charges by
# age groups, so that each boxplot is a different color.
# Give your graph a title, label the y axis with
# “Medical Charges”, and label the legend to say
# ‘Age Groups’

# Briefly describe the trend you observe, in terms
# of center, spread, and skewness

#################################################

# IV. Inference, Test 1

# Create an ANOVA object for comparing the mean
# charges for different methods of payment.
# Summarize the model object, so that you can
# see the p-value.

# To go with the analysis, create a plot of several
# boxplots showing charges by payment method groups,
# so that each boxplot is a different color.
# Add a title, and change the y axis and legend labels.

# Summarize your results, including the p-value,
# commenting on whether this suggests a
# difference in the mean charges for different
# payment methods.

#################################################

# V. Inference, Test 2

# Use ggplot to create a density plot of charges
# by Sex, making use of facets and color.

# Describe the distribution shapes,
# and suggest a reason for the difference:
# Why might charges for females look this way,
# and charges for males look this way?

# Run an appropriate ‘parametric’ statistical test
# (see hint at beginning of script)
# to determine if the mean charges are different
# for males versus females. Also find the corresponding
# (parametric) 95% CI for the difference between mean charges.

# State your CI in a sentence in terms of the problem:
# “I’m 95% sure that….
# Comment on the results of your statistical test.
# Is the difference statistically significant?
# What can you conclude about the true difference?

#################################################

# VI. Writing Functions

# Put your code here for a function making a 95% Bootstrap
# Confidence interval for one mean. Run your function,
# using the data on Charges. Be sure that the function
# prints the point estimate (the observed mean), and
# CI with a description: “I’m 95% sure that…

# Using your CI function above as a start, create
# a NEW function that will find a 95% Bootstrap CI
# for the *Difference Between Two Means*.
# The user will provide two vectors with quantitative
# data (which may have missing values), and
# the function will print the observed means,
# the observed difference, and a CI for the difference
# along with descriptive text: “I’m 95% confident…”

# The procedure:
# First, remove missing values from each vector.
# Next, calculate the means, and the observed difference
# Take a bootstrap sample from each vector, separately,
# and find the difference.
# Repeat many times, and accumulate the differences in
# a vector. Calculate the CI, as we have in other functions.

# Once you have your function, apply it to
# the vectors mcharges and fcharges, that you
# created in part III above.

#################################################

# Finally, put your code in an Rmd script.
# The script should Knit successfully, to
# produce a good-looking html file,
# that shows all of the requested R code,
# results, and text.

# Your html file should begin with text briefly
# describing the data set.
# Your Rmd script should have six code chunks, each
# named as noted above. Each code chunk should have
# text: a title, and description of results,
# where requested above.

# Do include all setup code, and make sure all code chunks
# are printed in the final html document.
# Prevent all of the notifications after loading ggplot2
# and dplyr.
# Prevent the following from showing for plots:
# *Removed 6 rows containing non-finite values (stat_boxplot).*
# (It is ok if this message stays on the ANOVA output:
# *6 observations deleted due to missingness*)