____________________Name
_______________________Name
Fall 2018
1. (10 pts) With multi-category data, we often have to choose the type of correlation matrix
we generate to describe relationships. In a British Journal of Mathematical and Statistical
Psychology, Conor Dolan indicated that a critical cutoff for determining whether a Pearson
correlation is appropriate occurs when you have a variable with around 5 ordinal categories
(Dolan, 1994). Given that, I would like you to consider the data set, factdata.xls which includes
data in which students answered questions related to food preferences using four different
scalings approaches (true/false, Likert from Strongly disagree to Strongly agree, semantic
differential, Likert from Disagree to Agree). If you’re having trouble reading in an EXCEL
spreadsheet, the first Likert scale necessary for this question is in the raw data file, fLikert.dat.
The header includes the variable names. I’ve included the original questionnaire so you can see
what the questions looked like.
For the first type of Likert variables flkrt1 to flkrt20 (on the second page of the questionnaire), I
would like you to pick variables from one of the following five subgroups:
Seafood I: flkrt1-flkrt5
Fast food: flkrt6-flkrt10
Challenging food: flkrt11-flkrt15
Seafood II: flkrt16-flkrt20
and create a function to generate descriptive statistics appropriate to an interval level variable:
mean and standard deviation, and statistics appropriate to an ordinal level variable: median,
minimum, maximum, and range, along with the N.
I would like you to put these statistics into an object similar to descripstat2() in the program
scndprog.cowdata.R. Make sure to return a matrix and label the dimensions of the matrix
appropriately. I’m including the formula for the median and some other statistics we will need
later in the semesters in a file called, add.stats.R. When calculating the median, I want you to
use this computational formula, not the median() function. The same for the other descriptive
statistics. Please use computational formulas, not the functions that are preprogrammed.
Compare your results to describe() in the “psych” package.
Now, also in R, I would like you to create a table including three kinds of correlations: Pearson,
Spearman, and Kendall correlations. You can do create this table by stacking the correlation
matrices. Once you have all of the correlations in a single table, you will have to rename the
dimensions (rows and column) to let the reader know what is what.
The difference between the Spearman and Kendall coefficients involves assumptions regarding
the underlying distributions of the variables. Spearman ρ assumes that the ranks are interval
scales, while Kendall τ. So, does it matter here? Are the correlation coefficients different?
What about the central tendency, does the median differ from the mean? What do you
conclude?
Dolan, C. V. (1994). Factor analysis of variables with 2, 3, 5 and 7 response categories: A
comparison of categorical variable estimators using simulated data. British Journal of
Mathematical und Statistical Psychology, 41, 309-326
2. (6 pts) I would like you to write a program for intensive regression to simultaneously
regress dep4 of the portroy data onto dep1, dep2, and dep3. You can pull numbers representing
regression weights from a random uniform distribution. Make sure to vary each coefficient
between -1 and +1. Remember that you can place restrictions on the coefficients within the
runif() function. Use the lm() function to check your results. Note that if you are using three
variables to predict a fourth, the regression function would be:
lm(dep4 ~ dep1 + dep2 + dep3)
The standardized regression would be:
lm(scale(dep4) ~ scale(dep1) + scale(dep2) + scale(dep3))
3. (6 pts) For the 20 flkrt items, either factdata.xls or fLikert.dat, take the wide data set and
create a long data set creating the dependent variable flkrt. There will be 20 observations per
person, on four types of food: seafood_1, fast_food, challenging_food, and seafood_2.
Create a new indexing variable which identified which type of food each item is measuring.
So, the long data set should have
id food_type flkrt
This requires one trick that we didn’t discuss in class. The varying variables are going to be all
20 items. Pick a good v.name. I picked flkrt. I called the timevar variable food_type.
Rather than times, you want to provide the levels (values) for the food_type variable. There are
20 of them with 5 of each type. You can create that variable using
times = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5)),
One you create this long data set, print it. You will see that the data are sorted by food type and
not by id. While the data are sorted by food_type, calculate the mean for each food type
pooling across items of that food type and individuals. You can use the indexing ability in R
and the mean() function to calculate the mean of the first 135 lines (food_type 1), the next 135
lines (food_type 2), etc to get the means for the four different food types. What are the means?
Which one is smallest, which one is largest?
Finally, sort the data by individual id making all of each individuals data contiguous. Make
sure to include your R output to show that you have done all of this successfully.
4. (5 pts) I would like you to write a function that will calculate a running sum for these two
series. You will initialize each sum at the value of the first number in each series, then start the
loop counter for each loop at 2 [Note that is a hint]. If you look at the two series, you will see
that they diverge. One simple way to show that they diverge is to calculate the difference
between the two series and show that the difference increases. I would like you to write a
function that returns the running sum for each series (the string of sums, not just the final sum),
and the running set of differences. You can then look at the differences and see that they
increase. Use the following strings:
First: 1 2 3 5 4 3 6 4 3 5 7 7 9
8
Second: 2 4 5 8 7 10 10 11 11 14 17 18 21 24
Make sure to subtract First string from the second, so that the differences are positive.
There are many different ways to do this. Any one that gets the correct answer is ok.