Exploring and Transforming Data
Exploring and Transforming Data
Dr. Randall R. Rojas
Note: To access your textbook resources type the following on the console:
#library(car)
#carWeb()
I. Univariate Characterizations
In a regression setting we first start by looking at a univariate characterization of the data. This entails
looking at the respective statistical summaries, features, and distributions of all the variables we have.
Among the popular univariate graphical tools, we will consider histograms, density plots, qqplots, and
boxplots.
I.0 Statistical Summary
Looking at the summary statistics of our data allows to identify any potential issues we may encounter when
building our regression model. For example, there may NAs, values outside the expected range, etc. Below
are two examples using the pastecs and psych libraries.
library(pastecs)
##
## Attaching package: ‘pastecs’
## The following object is masked from ‘package:tidyr’:
##
## extract
## The following objects are masked from ‘package:xts’:
##
## first, last
## The following objects are masked from ‘package:dplyr’:
##
## first, last
# Summary for 1 variable:
stat.desc(rnorm(100),norm=TRUE)
## nbr.val nbr.null nbr.na min max
## 100.000000000 0.000000000 0.000000000 -2.270031399 3.158588298
## range sum median mean SE.mean
## 5.428619697 11.061751029 -0.009273364 0.110617510 0.096806935
## CI.mean.0.95 var std.dev coef.var skewness
## 0.192085961 0.937158263 0.968069348 8.751501870 0.375694605
## skew.2SE kurtosis kurt.2SE normtest.W normtest.p
## 0.778223027 0.009654453 0.010091809 0.984344744 0.284579109
# Summary for an entire datacube:
library(car)
## Loading required package: carData
1
##
## Attaching package: ‘car’
## The following object is masked from ‘package:dplyr’:
##
## recode
attach(Prestige)
## The following object is masked from package:datasets:
##
## women
stat.desc(Prestige, norm=TRUE)
## education income women prestige
## nbr.val 1.020000e+02 1.020000e+02 1.020000e+02 102.0000000
## nbr.null 0.000000e+00 0.000000e+00 5.000000e+00 0.0000000
## nbr.na 0.000000e+00 0.000000e+00 0.000000e+00 0.0000000
## min 6.380000e+00 6.110000e+02 0.000000e+00 14.8000000
## max 1.597000e+01 2.587900e+04 9.751000e+01 87.2000000
## range 9.590000e+00 2.526800e+04 9.751000e+01 72.4000000
## sum 1.095280e+03 6.933860e+05 2.955860e+03 4777.0000000
## median 1.054000e+01 5.930500e+03 1.360000e+01 43.6000000
## mean 1.073804e+01 6.797902e+03 2.897902e+01 46.8333333
## SE.mean 2.701562e-01 4.204089e+02 3.141236e+00 1.7034979
## CI.mean.0.95 5.359173e-01 8.339783e+02 6.231368e+00 3.3792816
## var 7.444408e+00 1.802786e+07 1.006471e+03 295.9943234
## std.dev 2.728444e+00 4.245922e+03 3.172493e+01 17.2044856
## coef.var 2.540915e-01 6.245930e-01 1.094755e+00 0.3673556
## skewness 3.247507e-01 2.129019e+00 8.987765e-01 0.3287011
## skew.2SE 6.791988e-01 4.452731e+00 1.879744e+00 0.6874610
## kurtosis -1.028405e+00 6.287745e+00 -6.757897e-01 -0.7925793
## kurt.2SE -1.085203e+00 6.635016e+00 -7.131135e-01 -0.8363533
## normtest.W 9.495767e-01 8.150513e-01 8.157943e-01 0.9719786
## normtest.p 6.772830e-04 5.633526e-10 5.956847e-10 0.0287498
## census type
## nbr.val 1.020000e+02 NA
## nbr.null 0.000000e+00 NA
## nbr.na 0.000000e+00 NA
## min 1.113000e+03 NA
## max 9.517000e+03 NA
## range 8.404000e+03 NA
## sum 5.509810e+05 NA
## median 5.135000e+03 NA
## mean 5.401775e+03 NA
## SE.mean 2.618934e+02 NA
## CI.mean.0.95 5.195260e+02 NA
## var 6.995989e+06 NA
## std.dev 2.644993e+03 NA
## coef.var 4.896527e-01 NA
## skewness 1.116090e-01 NA
## skew.2SE 2.334243e-01 NA
## kurtosis -1.490291e+00 NA
## kurt.2SE -1.572599e+00 NA
## normtest.W 9.008953e-01 NA
2
## normtest.p 1.270617e-06 NA
library(psych)
##
## Attaching package: ‘psych’
## The following object is masked from ‘Prestige’:
##
## income
## The following object is masked from ‘package:car’:
##
## logit
## The following objects are masked from ‘package:scales’:
##
## alpha, rescale
describe(Prestige)
## vars n mean sd median trimmed mad min
## education 1 102 10.74 2.73 10.54 10.63 3.15 6.38
## income 2 102 6797.90 4245.92 5930.50 6161.49 3060.83 611.00
## women 3 102 28.98 31.72 13.60 24.74 18.73 0.00
## prestige 4 102 46.83 17.20 43.60 46.20 19.20 14.80
## census 5 102 5401.77 2644.99 5135.00 5393.87 4097.91 1113.00
## type* 6 98 1.79 0.80 2.00 1.74 1.48 1.00
## max range skew kurtosis se
## education 15.97 9.59 0.32 -1.03 0.27
## income 25879.00 25268.00 2.13 6.29 420.41
## women 97.51 97.51 0.90 -0.68 3.14
## prestige 87.20 72.40 0.33 -0.79 1.70
## census 9517.00 8404.00 0.11 -1.49 261.89
## type* 3.00 2.00 0.40 -1.36 0.08
# Note: The flag norm allows us to also include normal
# distribution statistics such as skewness, kurtosis, etc.
I.1 Histograms
Histograms should be one our first analysis tool since they can help us learn about each variable’s properties.
Things to look for:
• Outliers: This can influene the regression fit in many ways
• Shape: Is the distribution symmetric, skewed, etc?
• Range: Is the range of values very large? For exmaple several orders of magnitude?
• Density of observaions and/or clustering feautures. Are the data all clustered in certain regions? In high density
regions your regression fit may be robust but then become unstable in sless dense ones.
The near optimal number of bins to use based on the number of observations n is given by:
k = 1 + log2(n)
However, more sample-specific formulas exist such as Freedman’s and Diaconis (FD)’ suggested:
n1/3(max−min)
2(Q3 −Q1)
3
For example, we can compare the two methods for 2 different data sets:
#Compare histograms using k vs FD:
quartz()
par(mfrow=c(2,2))
n = 1000;k = 1 + log2(n)
hist(rnorm(n),breaks = k,col=”skyblue4″, main = “Normal Distr. Using k “)
hist(rnorm(n),breaks = “FD”,col=”skyblue4″, main = “Normal Distr. Using FD”)
n = 1000;k = 1 + log2(n)
hist(rexp(n,0.5),breaks = k,col=”skyblue2″, main =”Exp Distr. Using k”)
hist(rexp(n,0.5),breaks = “FD”,col=”skyblue2″, main =”Exp Distr. Using FD”)
Normal Distr. Using k
rnorm(n)
F
re
q
u
e
n
cy
−3 −2 −1 0 1 2 3
0
1
0
0
Normal Distr. Using FD
rnorm(n)
F
re
q
u
e
n
cy
−3 −2 −1 0 1 2 3
0
4
0
8
0
Exp Distr. Using k
rexp(n, 0.5)
F
re
q
u
e
n
cy
0 5 10 15
0
3
0
0
Exp Distr. Using FD
rexp(n, 0.5)
F
re
q
u
e
n
cy
0 2 4 6 8 10 12 14
0
1
0
0
I.2 Density Estimation
Histograms are very informative but being able to visualize the density plot, along with the histogram and
1D scatterplot (rug-plot) can be more informative. A common kernel-density (nonparametric) estimate used
is given by:
p̂(x) =
1
nh
n
∑
i=1
K
(
x− xi
h
)
where K is kernel function (e.g., Normal distribution), h is the bandwidth (amount of smoothness), and n the
number of observations.
quartz()
hist(rexp(n,0.5),breaks =”FD”,col=”skyblue2″, freq = FALSE, ylab = “Density”)
lines(density(rexp(n,0.5)),lwd = 2, col =”red”)
4
lines(density(rexp(n,0.5), adjust =0.5),lwd = 2, col = “blue”, type =’l’, lty=2)
rug(rexp(n,0.5))
## Warning in rug(rexp(n, 0.5)): some values will be clipped
Histogram of rexp(n, 0.5)
rexp(n, 0.5)
D
e
n
si
ty
0 2 4 6 8 10 12
0
.0
0
.1
0
.2
0
.3
0
.4
#You can also show the denisty with color
library(ggplot2)
##
## Attaching package: ‘ggplot2’
## The following objects are masked from ‘package:psych’:
##
## %+%, alpha
## The following object is masked from ‘package:NLP’:
##
## annotate
data=data.frame(value=rnorm(10000))
ggplot(data, aes(x=value))+geom_histogram(binwidth = 0.2, aes(fill = ..count..) )
5
0
200
400
600
800
−4 −2 0 2 4
value
co
u
n
t
0
200
400
600
800
count
I.3 Quantile Plots
These plots are very useful for comparing the distribution of a variable to a theoretical reference distribution.
By default the Normal distribution is used in the function ‘qqPlot’ (Quantile – Quantile) but we can provide
any other distribution.
#The dataset Prestige from the car package has 102 observations
library(car)
attach(Prestige)
## The following object is masked from package:psych:
##
## income
## The following objects are masked from Prestige (pos = 5):
##
## census, education, income, prestige, type, women
## The following object is masked from package:datasets:
##
## women
qqPlot(~ income, data=Prestige, id=list(n=3))
6
−2 −1 0 1 2
0
5
0
0
0
1
5
0
0
0
2
5
0
0
0
norm quantiles
in
co
m
e
general.managersphysicians
lawyers
## general.managers physicians lawyers
## 2 24 17
qqPlot(income ~ type, data=Prestige, layout=c(1, 3))
−2 −1 0 1 2
5
0
0
0
1
0
0
0
0
1
5
0
0
0
2
0
0
0
0
2
5
0
0
0
type = prof
norm quantiles
in
co
m
e
general.managers
physicians
−2 −1 0 1 2
5
0
0
0
1
0
0
0
0
1
5
0
0
0
2
0
0
0
0
2
5
0
0
0
type = bc
norm quantiles
in
co
m
e
farm.workers
firefighters
−2 −1 0 1 2
5
0
0
0
1
0
0
0
0
1
5
0
0
0
2
0
0
0
0
2
5
0
0
0
type = wc
norm quantiles
in
co
m
e
commercial.travellers
insurance.agents
# Type = Professional (Prof), White Collar (WC), “Blue Collar (BC)”
7
qqPlot(lm(prestige ~ income + education + type, data=Duncan),
envelope=.99)
−2 −1 0 1 2
−
1
0
1
2
3
4
t Quantiles
S
tu
d
e
n
tiz
e
d
R
e
si
d
u
a
ls
(l
m
(p
re
st
ig
e
~
in
co
m
e
+
e
d
u
ca
tio
n
+
t
yp
e
,
d
a
ta
=
D
u
n
ca
n
))
minister
machinist
## minister machinist
## 6 28
I.4 Boxplots
Boxplots allow us know more quantitative aspects of a variables distribution such as the min, max, Q1,
Q3, IQR, and Median. Another helpful property is that they are very convenient when comparing many
distributions simultaneously and/or conditioning on other variables.
#Boxplot of Income
Boxplot(~income, data=Prestige)
8
0
5
0
0
0
1
5
0
0
0
2
5
0
0
0
in
co
m
e
general.managers
lawyers
physicians
veterinarians
osteopaths.chiropractors
## [1] “general.managers” “lawyers”
## [3] “physicians” “veterinarians”
## [5] “osteopaths.chiropractors”
# Note: For more choices of boxplots, look at: https://www.r-graph-gallery.com/boxplot/
# These are knownn as Parallel Boxplots
Boxplot(Ornstein$interlocks~Ornstein$nation)
CAN OTH UK US
0
2
0
4
0
6
0
8
0
1
0
0
Ornstein$nation
O
rn
st
e
in
$
in
te
rl
o
ck
s
1
2
3
56
9
13
27
## [1] “1” “2” “3” “5” “6” “9” “13” “27”
# Note: interlocks = Number of interlocking director and executive positions shared with other major firms.
library(“plotrix”)
##
## Attaching package: ‘plotrix’
9
## The following object is masked from ‘package:psych’:
##
## rescale
## The following object is masked from ‘package:rgl’:
##
## mtext3d
## The following object is masked from ‘package:scales’:
##
## rescale
means <- Tapply(interlocks ~ nation, mean, data=Ornstein)
sds <- Tapply(interlocks ~ nation, sd, data=Ornstein)
plotCI(1:4, means, sds, xaxt="n", xlab="Nation of Control",
ylab="interlocks", ylim=range(Ornstein$interlocks))
lines(1:4, means)
axis(1, at=1:4, labels = names(means))
0
2
0
4
0
6
0
8
0
1
0
0
Nation of Control
in
te
rl
o
ck
s
CAN OTH UK US
II Bivariate Characterizations
Once you have the univariate description of the variables, we can then exam the pairwise associations via
e.g., scatterplots
II.1 Scatterplots
library(car)
attach(Prestige)
## The following objects are masked from Prestige (pos = 4):
##
10
## census, education, income, prestige, type, women
## The following object is masked from package:psych:
##
## income
## The following objects are masked from Prestige (pos = 7):
##
## census, education, income, prestige, type, women
## The following object is masked from package:datasets:
##
## women
scatterplot(prestige ~income, lwd=3, id=TRUE)
0 5000 10000 15000 20000 25000
2
0
4
0
6
0
8
0
income
p
re
st
ig
e
2
24
## [1] 2 24
#Note: Lowess smoother is a 'locally weighted smoother'
#We may want to condition on other variables, for exmaple, on type:
scatterplot(prestige ~income|type, lwd=3,col=c("blue", "green", "red"),id=TRUE)
11
5000 10000 15000 20000 25000
2
0
3
0
4
0
5
0
6
0
7
0
8
0
income
p
re
st
ig
e
82
58
2
24
31
51
type
bc prof wc
## 31 58 51 82
## 1 2 3 19 24 25
#If have sevaral predictors in our model, it is more practical to simply look at as scatterplot matrix of the data. For example:
scatterplotMatrix(~ prestige + income + education + women)
prestige
0
1
5
0
0
0
20 40 60 80
0
4
0
8
0
0 10000 20000
income
education
6 8 10 12 14 16
0 20 40 60 80
2
0
6
0
6
1
0
1
4
women
12
#We can also look at all the variables conditioned on type:
scatterplotMatrix(~ prestige + income + education + women | type,smooth=FALSE, ellipse=list(levels=0.5))
#We can also look at 3D visualizations
scatter3d(prestige ~ income + education)
## Loading required namespace: mgcv
axis(1, at=1:4, labels = names(means))
bc
prof
wc
prestige
5
0
0
0
2
0
0
0
0
20 40 60 80
0
4
0
8
0
5000 15000 25000
income
education
6 8 10 12 14 16
0 20 40 60 80
2
0
5
0
8
0
6
1
0
1
4
women
CAN
II.2 Jittering Scatterplots
Jittering the data (adding noise) helps separate overplotted observations
plot(Vocab$vocabulary~Vocab$education)
13
0 5 10 15 20
0
2
4
6
8
1
0
Vocab$education
V
o
ca
b
$
vo
ca
b
u
la
ry
#Add some random noise to the variables --> jitter the data
plot(jitter(Vocab$vocabulary)~jitter(Vocab$education))
0 5 10 15 20
0
2
4
6
8
1
0
jitter(Vocab$education)
jit
te
r(
V
o
ca
b
$
vo
ca
b
u
la
ry
)
# Increse the amount of jitter
plot(jitter(Vocab$vocabulary,factor =2)~jitter(Vocab$education, factor =2),col=”gray”,cex=0.5)
abline(lm(Vocab$vocabulary~ Vocab$education),lwd= 3, lty=”dashed”)
lines(lowess(Vocab$education,Vocab$vocabulary,f=0.2), lwd =3, col=”red”)
14
0 5 10 15 20
0
2
4
6
8
1
0
jitter(Vocab$education, factor = 2)
jit
te
r(
V
o
ca
b
$
vo
ca
b
u
la
ry
,
fa
ct
o
r
=
2
)
III. Transforming Data
III.1 Logarithms
In econometrics logarithms are very useful when dealing with variables that have span different orders
of magnitude yet must be included in the regression model. For example consider a model that includes
the predictors GDP ( 1010) and real interest rates ( 10−2). Logarithms can help us mitigate violations to the
constant variance assumption often used in Least Squares. By default R uses natural logs when you use the
log function.
# Here we look at the Ornstein data which includes financial assets of 248 Canadian companies.
par(mfrow=c(1, 2), mar=c(5, 4, 6, 2) + 0.1)
densityPlot(~ assets, data=Ornstein, xlab=”assets”, main=”Untransformed”)
densityPlot(~ log10(assets), data=Ornstein, adjust=0.65,
xlab=expression(log[10]~”(assets)”), main=”Log-Log Plot”)
basicPowerAxis(0, base=10, side=”above”, at=10^(2:5),
axis.title=””)
15
0 50000 150000
0
.0
0
0
0
0
0
.0
0
0
1
5
0
.0
0
0
3
0
Untransformed
assets
D
e
n
si
ty
2 3 4 5
0
.0
0
.2
0
.4
0
.6
0
.8
Log−Log Plot
log10 (assets)
D
e
n
si
ty
100 10000
scatterplot(infantMortality ~ ppgdp, data=UN, xlab=”GDP per Capita”,
ylab=”Infant Mortality Rate (per 1000 births)”, main=”Untransformed”)
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
0
2
0
4
0
6
0
8
0
1
0
0
1
2
0
Untransformed
GDP per Capita
In
fa
n
t
M
o
rt
a
lit
y
R
a
te
(
p
e
r
1
0
0
0
b
ir
th
s)
scatterplot(infantMortality ~ ppgdp, data=UN, xlab=”GDP per capita”,
ylab=”Infant Mortality Rate (per 1000 births)”, main=”Log-Log Plot”,
log=”xy”, id=list(n=3))
16
1e+02 5e+02 5e+03 5e+04
2
5
1
0
2
0
5
0
1
0
0
Log−Log Plot
GDP per capita
In
fa
n
t
M
o
rt
a
lit
y
R
a
te
(
p
e
r
1
0
0
0
b
ir
th
s)
Equatorial GuineaAngola
Gabon
## Angola Equatorial Guinea Gabon
## 4 54 62
III.2 Power Transformations
Power transformations of the predictors and/or response variable(s) can help improve their respective
distributions for modeling/estimation purposes (e.g., Least Squares estimates of regression coefficients) and
sampling/robustness (e.g., Bootstrapping and Bayesian inference). A popular transformation is the Box-Cox
scale power transformation given by:
TBC(x, λ) = x
(λ) =
{
xλ−1
λ
when λ 6= 0;
logex when λ = 0.
The R function symbox automatically produces boxplots of the variable to transform for different values of λ
to quickly gauge which power-law is more appropriate. For example, we can use this function on the Per
Person GDP (ppgdp) as follows:
library(car)
symbox(~ppgdp,data=UN)
17
−1 −0.5 log 0.5 1
Powers
Tr
a
n
sf
o
rm
a
tio
n
s
o
f
p
p
g
d
p
# Or we can look at income from the Prestige dataset
symbox(~ income, data=Prestige)
−1 −0.5 log 0.5 1
Powers
Tr
a
n
sf
o
rm
a
tio
n
s
o
f
in
co
m
e
Often times our variables take on negative values (e.g., S&P 500 returns), yet transformations such as logs
would not work. Instead we can use the Yeo-Johnson transformations TBC(x + 1, λ) for nonnegative values
of x and TBC(−x + 1, 2− λ) for negative values of x.
# The function yjPower computes takes as inputs your variable (x) and lambda, and outputs the respective transformed variable.
x=-5:5
yjPower(x, 3)
## [1] -0.8333333 -0.8000000 -0.7500000 -0.6666667 -0.5000000 0.0000000
18
## [7] 2.3333333 8.6666667 21.0000000 41.3333333 71.6666667
For regression purposes, transformations for normality, linearity and/or constant variance can be easily
implemented and tested statistically with the powerTransform function. For example, consider the variable
infant mortality from the UN dataset.
hist(UN$infantMortality)
Histogram of UN$infantMortality
UN$infantMortality
F
re
q
u
e
n
cy
0 20 40 60 80 100 120
0
1
0
2
0
3
0
4
0
5
0
6
0
# The histogram shows that the distribution is left-skewed. Therefore, we can try and (1) test if a transformation is needed and (2) if it is needed, transform it.
p1 <- powerTransform(infantMortality ~ 1, data=UN, family="bcPower")
summary(p1)
## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 0.0468 0 -0.0879 0.1814
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 0.4644634 1 0.49555
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 172.8143 1 < 2.22e-16
testTransform(p1, lambda=1/2)
## LRT df pval
## LR test, lambda = (0.5) 41.95826 1 9.3243e-11
In the case of Multiple Regression, we may need to transform more than one variable at the same time (each
with its own power-law). For example, we can look the Multivariate Box Cox Highway1 data. The following
19
example if from the bcPower document ion:
# Multivariate Box Cox uses Highway1 data
summary(powerTransform(cbind(len, adt, trks, sigs1) ~ 1, Highway1))
## bcPower Transformations to Multinormality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## len 0.1439 0 -0.2728 0.5606
## adt 0.0876 0 -0.1712 0.3464
## trks -0.6954 0 -1.9046 0.5139
## sigs1 -0.2654 0 -0.5575 0.0267
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0 0 0) 6.014218 4 0.19809
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1 1 1) 127.7221 4 < 2.22e-16
# Multivariate transformation to normality within levels of 'htype'
summary(a3 <- powerTransform(cbind(len, adt, trks, sigs1) ~ htype, Highway1))
## bcPower Transformations to Multinormality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## len 0.1451 0.00 -0.2733 0.5636
## adt 0.2396 0.33 0.0255 0.4536
## trks -0.7336 0.00 -1.9408 0.4735
## sigs1 -0.2959 -0.50 -0.5511 -0.0408
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0 0 0) 13.1339 4 0.01064
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1 1 1) 140.5853 4 < 2.22e-16
# test lambda = (0 0 0 -1)
testTransform(a3, c(0, 0, 0, -1))
## LRT df pval
## LR test, lambda = (0 0 0 -1) 31.12644 4 2.8849e-06
# save the rounded transformed values, plot them with a separate
# color for each highway type
transformedY <- bcPower(with(Highway1, cbind(len, adt, trks, sigs1)),
coef(a3, round=TRUE))
scatterplotMatrix( ~ transformedY|htype, Highway1)
## Warning in smoother(x[subs], y[subs], col = smoother.args$col[i], log.x =
## FALSE, : could not fit smooth
20
FAI
MA
MC
PA
len.0
0
4
8
1.0 2.0 3.0
−
6
−
2
0 2 4 6 8
adt.0.33
trks.0
1.8 2.0 2.2 2.4 2.6
−6 −4 −2 0
1
.0
2
.0
3
.0
1
.8
2
.2
2
.6
sigs1..0.5
III.3 Transforming Restricted-Range Variables
In regression models such as probit and logistic where the response variable is a probability, their range is
restricted to the interval [0, 1], however, this can present problems for such models because values can cluster
near the end points. Therefore, transformations that can spread the values such as the arcsine square-root are
good choices for these cases. This transformation is known for its variance stabilizing properties and is given
by:
Tasnisqrt(x) = sin
−1(
√
x)
par(mfrow=c(2,2))
hist(seq(0,1,length=100),main="Original", xlab="x")
hist(asin(sqrt(seq(0,1,length=100))),main="Transformed", xlab="x")
plot(seq(0,1,length=100),main="Original",type='l')
plot(asin(sqrt(seq(0,1,length=100))),main="Transformed", type='l')
21
Original
x
F
re
q
u
e
n
cy
0.0 0.2 0.4 0.6 0.8 1.0
0
4
8
Transformed
x
F
re
q
u
e
n
cy
0.0 0.5 1.0 1.5
0
1
0
2
0
0 20 40 60 80 100
0
.0
0
.6
Original
Index
se
q
(0
,
1
,
le
n
g
th
=
1
0
0
)
0 20 40 60 80 100
0
.0
1
.0
Transformed
Indexa
si
n
(s
q
rt
(s
e
q
(0
,
1
,
le
n
g
th
=
1
0
0
))
)
Another popular transformation choice is the logit transformation:
Tlogit = logit(x) = ln
(
x
1− x
)
#library(car)
#attach(Prestige)
#par(mfrow=c(1, 3))
#densityPlot(~women,main="(a) Untransformed")
#densityPlot(~logit(women),main="(b) Logit")
#densityPlot(~asin(sqrt(women/100)),main="(c) Arcsine square-root")
III.4 Transformations to Equalize Spread
Consider again the Ornstein dataset showing the number of interlocks by country.
Boxplot(interlocks ~ nation, data=Ornstein)
22
0 50 100
0
.0
0
0
0
.0
1
0
0
.0
2
0
(a) Untransformed
women
D
e
n
s
it
y
−6 −4 −2 0 2 4
0
.0
0
0
.1
0
0
.2
0
(b) Logit
logit(women)
D
e
n
s
it
y
−0.5 0.0 0.5 1.0 1.5
0
.0
0
.4
0
.8
1
.2
(c) Arcsine square−root
asin(sqrt(women/100))
D
e
n
s
it
y
Figure 1: The bulge points in the South-West direction: p = 2, q = 1
23
CAN OTH UK US
0
2
0
4
0
6
0
8
0
1
0
0
nation
in
te
rl
o
ck
s
1
2
3
56
9
13
27
## [1] "1" "2" "3" "5" "6" "9" "13" "27"
From the figure we can see that it would be helpful to even out the observed spread across countries. A
measure of the spread-level can easily be obtain with Tukey’s Spread-Level plot which also provides a
suggested power-law transformation to stabilize the variance:
spreadLevelPlot(interlocks+1 ~ nation,Ornstein)
6 8 10 12 14 16
1
0
1
2
1
4
1
6
1
8
2
2
Spread−Level Plot for interlocks + 1 by nation
Median
H
in
g
e
−
S
p
re
a
d
CAN
US
OTH
UK
## LowerHinge Median UpperHinge Hinge-Spread
## US 2 6.0 13 11
24
## UK 4 9.0 14 10
## CAN 6 13.0 30 24
## OTH 4 15.5 24 20
##
## Suggested power transformation: 0.1534487
According to the output, the choice of λ = 0.15 would be an appropriate one. Since this value is close to 0,
which would correspond to a log function, we can try a log to see if it works:
Boxplot(log10(interlocks+1) ~ nation, data=Ornstein)
CAN OTH UK US
0
.0
0
.5
1
.0
1
.5
2
.0
nation
lo
g
1
0
(i
n
te
rl
o
ck
s
+
1
)
III.5 Transformations Toward Linearity
We can consider transforming both x and y in an effort to linearize the relationship between. For example, we
can find the values of the postie exponents p and q such that the linear regression model Yqi = β0 + β1X
q
i + ε
exhibits a more linear relationship. The choice of these exponents can be guided by Mosteller & Tukey’s
Bulging Rule for Linearization as shown in the figure below.
For example, below is a visualization of this rule take from (https://dzone.com/articles/tukey-and-mosteller\
T1\textquoterights-bulging)
25
https://dzone.com/articles/tukey-and-mosteller\T1\textquoteright s-bulging
https://dzone.com/articles/tukey-and-mosteller\T1\textquoteright s-bulging
fakedataMT=function(p=1,q=1,n=99,s=.1){
set.seed(1)
X=seq(1/(n+1),1-1/(n+1),length=n)
Y=(5+2*X^p+rnorm(n,sd=s))^(1/q)
return(data.frame(x=X,y=Y))}
par(mfrow=c(2,2))
plot(fakedataMT(p=.5,q=2),main="(p=1/2,q=2)")
plot(fakedataMT(p=3,q=-5),main="(p=3,q=-5)")
plot(fakedataMT(p=.5,q=-1),main="(p=1/2,q=-1)")
plot(fakedataMT(p=3,q=5),main="(p=3,q=5)")
0.0 0.2 0.4 0.6 0.8 1.0
2
.3
2
.5
(p=1/2,q=2)
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0
.6
8
0
.7
1
(p=3,q=−5)
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0
.1
4
0
.1
8
(p=1/2,q=−1)
x
y
0.0 0.2 0.4 0.6 0.8 1.0
1
.3
8
1
.4
4
(p=3,q=5)
x
y
26
I. Univariate Characterizations
I.0 Statistical Summary
I.1 Histograms
I.2 Density Estimation
I.3 Quantile Plots
I.4 Boxplots
II Bivariate Characterizations
II.1 Scatterplots
II.2 Jittering Scatterplots
III. Transforming Data
III.1 Logarithms
III.2 Power Transformations
III.3 Transforming Restricted-Range Variables
III.4 Transformations to Equalize Spread
III.5 Transformations Toward Linearity