F71SM Statistical Methods Heriot-Watt University Computer Lab 1: Data summaries, plots and functions in R
1. The data in this task refer to the amount of claims (in £1000) arising in a general insurance portfolio. You can find the data in the file claim amounts data.txt on Vision under Learning Materials → Labs.
Read data into R
Save the file on your computer and then read the data into R:
claim.amounts = scan(file=”H:/claim_amounts_data.txt”,skip=2)
Remember that you need to tell R where to look for the file, so you must specify the
full path of the file (here H:/claim amounts data.txt). Note the following:
• If you have saved the file in a different folder on your computer you need to make sure that you know where it is (e.g. on Windows based systems, if the full path is not displayed on the toolbar, use right-click → properties on the file).
• Also, in some cases the use of forward or backward slash for specifying the path is system-specific. If the file cannot be found, try file=”H:\claim amounts data.txt” in the command above.
If you want to keep all your files in the same folder (e.g. for a particular project), it will be convenient to run R in the same folder so that you don’t need to repeatedly specify the folder. You can change the working folder in R at File → Change dir.
Notice also the skip=2 option in the scan() command above. Look at the data file and check ?scan to see what this option does.
If you are used to working with spreadsheets, R has a convenient data editor: try data.entry(claim.amounts) or de(claim.amounts).
Numerical summaries
For summary statistics, check how the commands summary(), mean(), median(), quantile(), var(), sd() etc. work. For example, the 3rd quartile of the claim amounts data can be obtained using
quantile(claim.amounts,prob=0.75)
Now, put the data in ascending order (use sort()) and try to determine which of the
two definitions of the quartiles that we have given in the lectures R uses.
Let’s now compute the coefficient of symmetry β = i (xi −x ̄)3 /n . R doesn’t have a
function to do this automatically, but it is straightforward to write our own code.
First, let’s look at how arrays work in R. The data claim.amounts are stored as a one- dimensional array (vector) in R, with the individual values (elements) being indexed from 1 to 200. For example, you can access (and see) the value of the 5th amount
using claim.amounts[5], or the last 10 values using claim.amounts[191:200]. Note that the indexing corresponds (by default) to the order by which we entered the data. Now, for computing β1 we can exploit R’s array operations. Try:
sum.dev3 = sum( (claim.amounts – mean(claim.amounts))^3 )
beta1 = (sum.dev3/length(claim.amounts))/(sd(claim.amounts))^3
cat(“This gives:”, “\n”, “beta_1 =”, beta1, “\n”)
Note that R is very good for working with arrays of data – see how we computed the value of i(xi −x ̄)3 in the first line of the code above. The second line then computes {i(xi − x ̄)3/n}/s3. Finally, the third line prints the answer on the screen; note that “\n” starts a new line for printing.
Let’s now compute the sum of the cubed deviations for β1 with the use of a for loop:
new.sum.dev3 = 0
dev3 = rep(0,length(claim.amounts))
for (i in 1:length(claim.amounts)){
# initialise variable to hold the sum
# initialise array of cubed deviations
# start loop
dev3[i] = (claim.amounts[i] – mean(claim.amounts))^3
new.sum.dev3 = new.sum.dev3 + dev3[i]
print(new.sum.dev3)
Again, make sure you understand how this works!
# compute sum
# iteratively
Check if the answer is the same as with sum.dev3 above. Using loops will most times be slower than using array operations, but it often provides clearer understanding – and in some cases it might be the easier option for coding. Notice in the code above, that it is important to make sure that the variable new.sum.dev3 is equal to zero before the loop starts (line 1 of the code), and that dev3 has to be ‘defined’ as an array outside the loop (line 2 of code).
Graphical summaries
For producing various summary plots, see the use of commands hist(), stripchart(), boxplot() etc. on the tutorial sheets. Here, as an example, we consider how to show two plots on the same graph or scale (e.g. when we want to compare two data sets). Let’s assume that the claim amounts data can be divided to two different groups (e.g. for gender of policyholder) and we want separate boxplots per group on the same graph. We will first ‘assign’ each claim to group ‘1’ or ‘2’ (here completely at random by alternating group membership):
group.ind = rep(c(1:2),100); group.ind
group.1 = claim.amounts[group.ind == 1]
group.2 = claim.amounts[group.ind == 2]
# create index by
# repeating (1,2) 100 times
# choose amounts with index = 1
# same for index = 2
We are now ready to draw the boxplots:
boxplot(group.1,group.2,horizontal=T,col=”orange”,names=c(“Group 1″,”Group 2″))
title(main=”Boxplots of claim amounts per group”,xlab=”Amount (1000s pounds)”)
Note that an alternative way to do this would be:
boxplot(claim.amounts ~ group.ind,horizontal=T,col=”orange”,
names=c(“Group 1″,”Group 2″))
but the first command is more useful when we have two separate data sets.
We can also draw histograms for the two groups side-by-side. For comparison purposes we need to ensure that the two plots are drawn on the same scale (for the x and y axes). Try the following:
par(mfrow=c(1,2)) # Specifies layout for plotting window
amount.breaks = seq(0.9,2.3,by=0.1)
hist(group.1, breaks=amount.breaks, right=FALSE, col=”gray”, xlim=c(0.9,2.3),
ylim=c(0,25), main=”Claim amounts: Group 1″, xlab=”Amount (1000)”)
hist(group.2, breaks=amount.breaks, right=FALSE, col=”gray”, xlim=c(0.9,2.3),
ylim=c(0,25), main=”Claim amounts: Group 2″, xlab=”Amount (1000)”)
Notice how we have determined the cells for the histograms (both range and whether they are right-closed), and controlled the range of the axes.
It is important to know how you can export your graphs for use in other programs (e.g. when you are preparing reports). It is straightforward to copy and paste the graph on an editor – e.g. right-click on the R graphics window and choose Copy as. However, it is often useful to save the graph for later use. To save your graph as a PDF file, make the graphics window active (by clicking on it) and then choose File → Save as. Try to save your boxplot graph and then open it on your computer (make sure you remember where you saved the file!). Later you may also want to check the postscript() command which gives you more control over the saving options – e.g. size of the graph.
2. R is a programming environment and you can use it to write functions that can perform complex mathematical and statistical operations. Functions are a very useful and portable tool, especially for computations that you use repeatedly or when you want to write a longer piece of R code that can be broken down to smaller ‘tasks’. Here we will consider a simple task – writing a function for computing the symmetry coefficient β1 which was also computed before without a user defined function.
The code for defining the function is given below. As with the for(…) command in task 1 (which is a system-defined function itself), note that the body of the function (included in curly brackets) must be entered in R as a block of commands.
beta1.fn <- function(x){ # name and input of function sum.dev3 = sum( (x - mean(x))^3 ) # see Lab 1 ... b1 = (sum.dev3/length(x))/(sd(x))^3 cat("\n","Coefficient of symmetry for these data is: ", "beta_1 =", b1, "\n\n") Note that the input to this function is the array x, which is the data to be used. Also recall the use of the cat(...) command from before (use ?cat for help). Load the claim.amounts data on your R session again (use the scan(...) command as before if required), and call the beta1.fn function using the claim amounts as its argument: beta1.fn(claim.amounts) Check that you get the same answer as in Task 1. Now change your data, for example by using claim.amounts.new = 0.6 + claim.amounts or claim.amounts.new = 1.3 * claim.amounts and call the beta1.fn function again to compute the new value of β1. Comment on how the symmetry coefficient has changed with the new data.