/*TOPICS for September 20, 2021: CONVERTING DATA FROM ONE FORMAT TO ANOTHER –
WIDE TO LONG, AND LONG TO WIDE, COMBINING DATA SETS, CONVERTING VARIABLE TYPES*/
%let pathname=C:\BIS679A\Lecture3and4\Data_9_14_2020\;
/*What is the difference between long and wide format?
“One reason for setting up the data in one format or the other is simply that different analyses in SAS
require different set ups. For example, for many types of software, the wide format is required for MANOVA
and repeated measures procedures.
Many data manipulations are much, much easier as well when data are in the wide format.
Likewise, mixed models and many survival analysis procedures require data to be in the long format.
Beyond software requirements, each approach has analytical implications.
For example, in the wide format, the unit of analysis is the subject–(the county)–whereas in the long format,
the unit of analysis is each measurement occasion for each county.
The practical difference is that when the occasion is the unit of analysis, you can use each decade’s college education
rate as a covariate for the same decade’s Jobs value. In the wide format, when the unit of observation is the county,
there is no way to do this. You can use any of the college rates as covariates for all years, but you can’t have decade-specific
covariates.
Another implication is that in the wide format, those repeated outcomes are considered different and non-interchangeable variables. Each can have its own distribution. Each is distinct. This makes sense in the county example where each observation occurred in the same four years for every county. But if each county had been measured a different number of times, or measured in different years, this set up doesn’t make a lot of sense.
So it’s important to think about the implications before you enter data.”
Reference: http://www.theanalysisfactor.com/wide-and-long-data/
*/
/*CONVERTING DATA FROM SINGLE RECORD PER INDIVIDUAL TO MULTIPLE RECORDS PER INDIVIDUALS (LONGITUDINAL)*/
*use proc transpose to convert between long and wide (will do later today and show
how used in longitudinal data analyses;
*can also use the following instructions to do;
proc import datafile=”&pathname.MouseExperiments1.xlsx” out=exp2 replace;
sheet=”Exp2$”;
run;
*Note that the observations (mice) are different in the two files (sheets)-i.e. two separate
experiments were run-the outcome here is Paw Swelling (ps);
proc import datafile=”&pathname.MouseExperiments1.xlsx” out=exp3 replace;
sheet=”Exp3$”;
run;
/*NOTE: YOU CANNOT USE AN ARRAY ON DIFFERENT DATA TYPES – so want to make sure that all
the variables in the array will be of the same type-for example if studyid is a number in one file
and a character in another file*/
title “Contents of the Mouse Data – Sheet 2”;
proc contents data=exp2;
run;
title “Contents of the Mouse Data – Sheet 3”;
proc contents data=exp3;
run;
*See any problems here? How to fix?;
*We want to combine the two experiments worth of data (CONCATENATE THE DATA SETS);
*Create a variable to distinguish the data set;
data ex2; set exp2;
experiment=2;
run;
data ex3; set exp3;
experiment=3;
run;
/* USE ARRAYS TO CONVERT THE DATA FROM A SINGLE RECORD (WIDE FORMAT) TO MULTIPLE RECORDS
PER INDIVIDUAL (LONG FORMAT) */
/* Going to use an array and a do loop here*/
/*The use of arrays allows us to simplify our processing. We can use arrays to help read
and analyze repetitive data with a minimum of coding. An array and a loop make the program smaller.
For example, suppose we have a file where each record contains
24 values with the temperatures for each hour of the day. These temperatures are in Fahrenheit
and we need to convert them to 24 Celsius values. Without arrays we need to repeat the same calculation for all 24 temperature
variables*/
/*Question
When reading in array below which terminology is correct: temp1-temp24, temp1-24, or temp1_temp24?
*/
data temp;
input ctemp1-ctemp24; cards;
5 5 7 4 8 5 3 3 5 5 6 8 9 9 8 11 11 12 6 5 3 4 7 6
;
Ftemp1 = 9/5*(ctemp1+32);
Ftemp2 = 9/5*(ctemp2+32);
. . .
Ftemp24 = 9/5*(cemp24+32);
run;
proc print; run;
/*An alternative is to define arrays and use a loop to process the calculation for all variables*/
data temp;
input ctemp1-ctemp24; cards;
5 5 7 4 8 5 3 3 5 5 6 8 9 9 8 11 11 12 6 5 3 4 7 6
; run;
data temp1; set temp;
array Ctemp [24] ctemp1-ctemp24;
array Ftemp [24] Ftemp1-Ftemp24;
do i = 1 to 24;
Ftemp[i] = 5/9*(Ctemp[i]+32);
end;
run;
proc print data=temp1; run;
/*can also create multi-dimentional arrays or matrices
i.e. the command “array sale_array {3, 12} sales1-sales12 exp1-exp12 comm1-comm12;”
will give a 3X12 matrix*/
/* So back to the mice*/
data experiments2;
set ex2 ex3; /*how many observations in this new data set? how many variables*/
array pst[65] pst0-pst64;
do i=1 to 65;
ps=pst[i];
time=i-1;
output;
end;
drop i pst0-pst64;
label ps=”Paw Swelling”;
run;
proc print; run;
*Remove those data points where paw swelling is missing;
data experiments2a; set experiments2;
if ps = . then delete;
run;
proc print; run;
*What if we wanted the mean by time and treatment group;
/*Before we can do anything BY a variable, we must sort the variable*/
proc sort data=experiments2a;
by group time;
run;
title ‘Mean of paw swelling by time and group’;
proc means data=experiments2a; *Noprint keeps the output from being printed;
by group time ;
var ps;
output out=means mean=meantime;
run;
*Print the means data set;
title “Mean of paw swelling by time and group”;
proc print data=means;
run;
*What are the additional variables in the data set?;
*https://support.sas.com/resources/papers/proceedings/proceedings/sugi23/Coders/p73.pdf;
*”The variables _FREQ_ and _TYPE_ are
automatically added to the summary data set when
the OUTPUT statement is used. _FREQ_ is the count
of the number of observations available for use and
_TYPE_ is a numeric flag which indicates the
subgroup of the CLASS variables summarized by that
observation in the output data set. When no CLASS
statement is present the resulting data set will have
one observation and _TYPE _ will equal 0.”;
/*NOTE: YOU CANNOT USE AN ARRAY ON DIFFERENT DATA TYPES*/
proc import datafile=”&pathname.MouseExperiments.xlsx” out=exp2 replace;
sheet=”Exp2$”;
run;
proc import datafile=”&pathname.MouseExperiments.xlsx” out=exp3 replace;
sheet=”Exp3$”;
run;
*We want to combine the two experiments worth of data (CONCATENATE THE DATA SETS);
data exp2; set exp2;
experiment=2;
run;
data exp3; set exp3;
experiment=3;
run;
*What is the issue? How can we solve this issue?;
*look at for example pst3;
data experiments; set exp2 exp3;
run;
title “Contents of the experiment data set”;
proc contents data=exp2;
run;
proc contents data=exp3;
run;
*CONVERTING VARIABLE TYPES:
http://blogs.sas.com/content/sgf/2015/05/01/converting-variable-types-do-i-use-put-or-input/;
*”PUT() always creates character variables
INPUT() can create character or numeric variables based on the informat
The source format must match the source variable type in PUT()
The source variable type for INPUT() must always be character variables”;
*(1) convert the character variables to numeric variables;
data exp2a; set exp2;
v1=input(pst14,1.);
v2=input(pst34,1.);
v3=input(pst37,1.);
v4=input(pst53,1.);
drop pst14 pst34 pst37 pst53;
rename v1=pst14 v2=pst34 v3=pst37 v4=pst53;
run;
*(2) Use it in a numeric context as long as it only includes numbers;
data exp2b; set exp2;
v1=pst14*1;
v2=pst34*1;
v3=pst37*1;
v4=pst53*1;
drop pst14 pst34 pst37 pst53;
rename v1=pst14 v2=pst34 v3=pst37 v4=pst53;
run;
data exp3a; set exp3;
v1=input(pst3,1.);
v2=input(pst10,1.);
v3=input(pst13,1.);
v4=input(pst31,1.);
v5=input(pst35,1.);
v6=input(pst38,1.);
v7=input(pst52,1.);
v8=input(pst57,1.);
v9=input(pst60,1.);
v10=input(pst64,1.);
drop pst3 pst10 pst13 pst31 pst35 pst38 pst52 pst57 pst60 pst64;
rename v1=pst3 v2=pst10 v3=pst13 v4=pst31 v5=pst35 v6=pst38 v7=pst52 v8=pst57 v9=pst60 v10=pst64;
run;
data experiments; set exp2a exp3a;
run;
*How does the data get combined? What might be some issues with this;
title “Combined experiment data set”;
proc print data=experiments;
run;
*Reorder the variables in the data set;
*Retain causes a variable that is created by an INPUT or assignment statement to retain its value
from one iteration of the DATA step to the next.;
data experiments1;
retain tagid group experiment pst0-pst13 pst14 pst15-pst33 pst34
pst35 pst36 pst37 pst38-pst52 pst53 pst54-pst64;
set exp2a exp3a;
run;
proc print data=experiments1;
run;
*What is wrong with this code?;
data experiments2;
set experiments1;
array pst[65] pst0-pst64;*hint-look at the variable types going into the array;
do i=1 to 65;
ps=pst[i];
time=i-1;
output;
end;
drop pst0-pst64;
run;
proc contents data=experiments1; run;
*How could we fix this?;
/* TOPIC: MERGING DATA SETS
MUST BE CAREFUL WHEN MERGING DATA SETS, ESPECIALLY IF DATA SETS SHARE VARIABLES
(EXCEPT FOR BY VARIABLE(S)).
THE SECOND DATA SET WILL OVERWRITE (MASTER) ANY VARIABLES HAVING THE SAME NAME
IN THE FIRST DATA SET-again NEVER EVER OVERWRITE ORIGINAL DATA SET!!!!!!!
*NOTE: BOTH DATA SETS NEED TO BE SORTED BASED ON THE BY VARIABLE – OTHERWISE, CANNOT MERGE.
*/
data means; set means; run;
proc sort; by group time; run;
data experiments2a; set experiments2a; run;
proc sort; by group time; run;
data experiments3; merge means experiments2a;
by group time;
run;
title “Merged data set – subset”;
proc print data=experiments3(obs=30);
run;
/* IF YOU WANT ONE FILE TO FILL IN INFORMATION IN ANOTHER FILE, USE UPDATE INSTEAD OF MERGE.
IN THIS CASE, THE FIRST FILE IS THE MASTER FILE, THE SECOND FILE IS THE TRANSACTION FILE.
YOU CAN ONLY HAVE TWO DATA SETS HERE
MISSING VALUES IN THE TRANSACTION DATA SET DO NOT OVERWRITE EXISTING VALUES IN THE MASTER DATA SET
*/
data experiments4; update means experiments2a;
by group time;
run;
title “Updated data set”;
proc print data=experiments4(obs=30);
run;
/*CONVERTING DATA FROM LONG FORMAT (REPEATED OBSERVATIONS) TO WIDE FORMAT (SINGLE RECORD)*/
*Interested in difference between total scores for grades 3, 4 and 5;
proc import datafile=”&pathname.Psychiatry.xlsx” out=psych replace;
sheet=”Data Setup$”;
label V=”Vocabulary”;
label R=”Reading Comprehension”;
label L_1=”Spelling”;
label L_2=”Capitalization”;
label L_3=”Punctuation”;
label L_4=”Usage and Expression”;
label LT=”Language Total”;
label W_1=”Maps”;
label W_2=”Graphs”;
label W_3=”References”;
label WT=”Work Study Total”;
label M_1=”Math Concepts”;
label M_2=”Math Problem Solving”;
label MT=”Math Total”;
label CC=”Complete Composite”;
run;
title “Contents of the Psychiatry Data Set”;
proc contents data=psych;
run;
title “Psych data set”;
proc print data=psych;
run;
*Split the data into two separate data sets by scoretype;
data ige ipr; set psych;
if scoretype=”IGE” then output ige;
else output ipr;
run;
title “IGE data set”;
proc print data=ige;
run;
title “IPR data set”;
proc print data=ipr;
run;
*There are several ways that we can convert the data from long to wide format.
*ONE WAY – using IGE data set – using proc transpose;
*https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-long-to-wide-using-proc-transpose/;
*Converting the data from long format to wide format;
proc transpose data=ige out=ige1 prefix=lt;
by id;
id grade;
var lt;
run;
title “Converted IGE data set”;
proc print data=ige1;
run;
*How do we fix the errors?;
proc sort data=ige;
by id grade;
run;
proc transpose data=ige out=ige1 prefix=lt;
by id;
id grade;
var lt;
run;
*What is placed into the ige1 data set?;
*How are the variables created?;
title “Converted IGE data set”;
proc print data=ige1;
run;
*Transposing multiple variables using proc transpose?;
*Do we need to sort the data again?;
proc transpose data=ige out=ige1 prefix=lt;
by id;
id grade;
var lt;
run;
proc print; run;
proc transpose data=ige out=ige2 prefix=wt;
by id;
id grade;
var wt;
run;
proc print; run;
proc transpose data=ige out=ige3 prefix=mt;
by id;
id grade;
var mt;
run;
proc print; run;
*Combine the three data sets;
data full_ige; merge ige1(drop=_name_) ige2(drop=_name_) ige3(drop=_name_);
by id;
run;
title “Fully transformed IGE data”;
proc print data=full_ige;
run;
*Drop the label as well;
data full_ige1; merge ige1(drop=_name_ _label_) ige2(drop=_name_ _label_) ige3(drop=_name_ _label_);
by id;
run;
title “Fully transformed IGE data”;
proc print data=full_ige1;
run;
*ANOTHER WAY – using IPR data set – using arrays;
*Using arrays to create the same data set for igf;
proc sort data=ipr;
by id grade;
run;
*Introduction of the do statement;
data full_ipr (keep = id scoretype lt3-lt5 wt3-wt5 mt3-mt5); set ipr;
by ID grade ;
retain i lt3-lt5 wt3-wt5 mt3-mt5; *Keeps variable value from one iteration to the next.;
array a {*} lt3-lt5;
array b {*} wt3-wt5;
array c {*} mt3-mt5;
if first.id then i=0;
i+1;
if grade=3 then do;
a{1}=lt;
b{1}=wt;
c{1}=mt;
end;
if grade=4 then do;
a{2}=lt;
b{2}=wt;
c{2}=mt;
end;
if grade=5 then do;
a{3}=lt;
b{3}=wt;
c{3}=mt;
end;
if last.id then output;
run;
title “The complete IPR data set”;
proc print data=full_ipr;
run;
*How can we make the above code more efficient?;
*One possible way;
data full_ipr1 (keep = id scoretype lt3-lt5 wt3-wt5 mt3-mt5); set ipr;
by ID grade ;
retain i lt3-lt5 wt3-wt5 mt3-mt5; *Keeps variable value from one iteration to the next.;
array a {*} lt3-lt5;
array b {*} wt3-wt5;
array c {*} mt3-mt5;
if first.id then i=3;
if grade=i then do;
a{i-2}=lt;
b{i-2}=wt;
c{i-2}=mt;
end;
i+1;
if last.id then output;
run;
title “Full IPR data set”;
proc print data=full_ipr1;
run;
*Getting the full data set back together;
data full; set full_ige full_ipr;
run;
title “Full data set”;
proc print data=full;
run;
*If want to correct the scoretype and to drop the label;
data full1; set full;
if scoretype=” ” then scoretype=”IGE”;
drop _LABEL_;
run;
title “Edited full data set”;
proc print data=full1;
run;
/* TOPIC: USING THE IN FUNCTION: (1) Create Variables; (2) TO HELP SUBSET DATA*/
*A data frame with 50 observations on 4 variables.
Murder numeric Murder arrests (per 100,000)
Assault numeric Assault arrests (per 100,000)
UrbanPop numeric Percent urban population
Rape numeric Rape arrests (per 100,000);
proc import datafile=”&pathname.UsArrests.csv” DBMS=CSV out=arrests replace;
run;
title ‘First observations of the Arrest Data set’;
proc print data=arrests(obs=5);
run;
*Use the IN function to subset to the tri-state area – instead of a string of if/else statements;
data arrests1; set arrests;
if state IN(” “, “Connecticut”, ” “) then tristate=1; else tristate=0;
run;
*Create a subset of the data set using IN function;
data arrests2; set arrests;
if state IN(“Oregon”, “Washington”, “California”, ‘Hawaii’, “Alaska”);
run;
title “Pacific Coast Subset”;
proc print data=arrests2;
run;
*TOPIC: WHERE STATEMENT;
title “Tri-State data”;
proc print data=arrests1;
where tristate=1;
run;