EM 623
Homework 2
Lars Petter Bryn
EM 623 Homework 2 Lars Petter Bryn
1
Table of Contents
Homework 2 – Analysing dataset “Churn” using rattle ………………………………………………………………………….. 2
1. Missing values ………………………………………………………………………………………………………………………….. 2
2. Comparing area code and state field……………………………………………………………………………………………. 2
3. Outliers…………………………………………………………………………………………………………………………………….. 3
4. Normalization of Night Minute Calls …………………………………………………………………………………………….. 3
5. Analyze and interpret the correlations of all the variables with the variable “Churn?” …………………………. 4
EM 623 Homework 2 Lars Petter Bryn
2
Homework 2 – Analysing dataset “Churn” using rattle
1. Missing values
Exploring missing values in a dataset can be done using rattle. Under the “explore” tab, check the
“show missing”-option and press “execute”. In the bottom of the explore field, a table for the missing
values are presented. In the table, a 1 indicates present values, whereas a 0 indicates a missing
value. The left column presents the number of observations with the corresponding pattern of
missing values. The rightmost column presents the number of missing values within that pattern. If
there are no missing values, this will show 0. The bottom row shows the number of missing values
for each variable in the dataset.
In the case of dataset “Churn”, there are no missing values.
2. Comparing area code and state field
The bar plot is made with rattle. This is done by first setting variable “area code” as target in the data
tab, and then pressing execute. Then, the explore tab is entered and the “Distributions”-option is
marked. Under the categorical variables, the bar plot option for the state variable is marked. Now,
the execution button is pressed.
The bar plot will unfortunately not show all the labels for the state codes along with the x-axis.
Nevertheless, the bar plot shows that the area codes are evenly spread over the states. Area code
415 seems to be higher represented all over than the other area codes. West Virginia is the state
EM 623 Homework 2 Lars Petter Bryn
3
with the most entries, and this is also the state where area code 415 seems to be highest
represented. All the area codes are from California, around the San Francisco bay area. The area
code represent the area where the phone number is registered. It is not clear what the state code
variable in the dataset represents, but given that this dataset is from a telecom company, it could be
that this variable represents the state of the billing address of the subscribers.
3. Outliers
The histogram plot is made with rattle. This is done setting the phone variable as target, and then
pressing the execute button. Now enter the explore tab, press the distribution button, mark the
CustServ.Calls variable with the histogram option and press execute. The histogram plot will now
show in Rstudio. Unfortunately, I was not able to change the labels and tick marks on the x and y-
axis to be more accurate. However, the plot shows that the number of customer service calls varies
from 0 to 9, and that the highest represented amount of service calls is 1 per subscriber (the phone
variable). It is just a few subscribers that has made as much as 8 and 9 service calls, and these can
be seen as outliers in this case. On the other hand, in a short range like 0 to 9, it is discussable if 8
and 9 can be seen as outliers.
4. Normalization of Night Minute Calls
Rattle can normalize data with different methods. The method used for this assignment is the z-
score method. When doing this in rattle, the first move is to load the dataset under the data-tab. The
next move is to go to the transform-tab and mark the rescale-button, and the recenter-button for the
normalize-option. Make sure to mark the designated variable for normalization, in this case this is the
Night.Min-variable. Press the execute-button. Note that a new variable is shown at the bottom of the
lists of variables, called RRC_Night.Mins. The text under the “Data type and Number Missing”-
column shows that the z-scores has a range of -3.51 to 3.84, that there is 1591 unique values, that
the mean value is 0.0 and that the median value is 0.01. To make a plot of the normalized values,
EM 623 Homework 2 Lars Petter Bryn
4
first go to the Data-tab. Mark the Phone-variable at the target-option, the Night.Min-variable at the
ignore-option, and make sure that the new variable, RRC_Night.Mins is marked at the input-option.
The execute-button is now pressed. Go to the explore-tab, and select the distribution-button. Scroll
down the numeric variable list, and mark the RCC_Night.Mins-variable with the histogram-option.
Press execute. A histogram-plot of the normalized night minute calls values are now shown in R-
studio. The plot is shown below.
5. Analyze and interpret the correlations of all the variables with the variable “Churn?”
Rattle can be used to find correlations between variables in a dataset. This can be done by loading
the dataset, enter the exploration tab, press the correlation button and press execute. Rattle will
generate a table with all the variables compared to each other with a correlation factor between -1
and 1. Also, a color plot of the correlation will be generated in Rstudio. In the color plot, a dark blue
color represents correlation factor 1 (absolute positive correlation), white represents 0 (no
correlation) and the color red represents -1 (absolute negative correlation). The purpose of this task
was to find the correlation between the variable “Churn” and all the other variables in the dataset.
The values of the churn variable was in the dataset put as either “True.” Or “False.”, which are
categorical values. In order for Rattle to process and generate correlation values between the churn
variable and the other variables, the categorical values must be transformed to numerical values.
There are several ways to do this. One of the easiest ways is to open the dataset in notepad or
similar, and use the find and replace function. In this case, the value “True.”(it is important to write
the value exactly as it is written in the dataset) was replaced by the numerical value 1, and the value
“False.” Was replaced by the numerical value 0. After saving the dataset, it was loaded into rattle.
Before running the correlation-function, it is important to set the churn variable as input and uncheck
the partition-option (under the data tab). The correlation values and the color plot generated are
shown below. The correlation of the churn variable with the other variables seems to be very low,
this can be found both by looking at the generated values and the color plot. Not surprisingly, the
variable with the biggest correlation to the churn variable is the customer service calls variable. This
is because a churn often will be made during a service call. The other most correlated variables are
EM 623 Homework 2 Lars Petter Bryn
5
the day minute calls and the day minute charge. This is not surprising either, as the service calls will
be made during the day, and that the day charge is dependent on the day minute calls. More
surprising is the fact that there seems to be a small negative correlation with the variable voice mail
message. It is not easy to find a rational explanation for this, and it could be that this (very low)
correlation is accidental.