Andrew Falcone
EM-623 A
Homework #1
I pledge my honor that I have abided by the Stevens Honor System.
Andrew Falcone 10/3/2016
Homework #1: Analysis of Churn.txt
1.) Explore whether there are missing values for any of the variables.
Excel:
o In order to check for missing values in the document, the following measures
were taken:
1) Utilizing an “IF” function to check for blank cells
A copy of each column of data was created prior to employing the
“IF” function
The “IF” function states that if the original cell is blank, it will
return a value of “No Value” in the new cell, otherwise it will
return the original value
A search of the file was then completed in order to find any cells
which contain “No Value”
From the search completed above, one can conclude that there is
no missing values in the document
Rattle
o There are two methods that can be utilized in Rattle in order to determine if there
are any missing values, both of which will be illustrated.
o First, the dataset “churn.txt” must be imported into Rattle—this is completed by
going to the “Data” tab and browsing within the “Filename” section until the data
is found
1) Note that the separator for this data set is a comma, so a comma is inserted
into the Separator box
o Next, the Explore tab is chosen, and a report can be ran: in this case, the options
“Summary”, “Describe”, and “Show Missing” are utilized
By doing so, both the “Description” and “Show Missing” options will show how many
values are missing per column of data
o
From the “Description” method, it can be seen that there are 0 missing numbers
o
Similarly, the “Show Missing” function outputs a value of 0 missing values (in [2,] row)
for each variable
2.) Compare the area code and state fields. Discuss any apparent abnormalities.
Excel
o In order to properly compare the two fields, new columns of data were first
created that isolated them
o Next, the State column was sorted A through Z in order to observe the data on a
state-to-state basis
o After analyzing the data, it seems as if every state within the file contains a
random combination of the same three Area Codes, namely 408, 415, and 510
(snippets of Alaska and Delaware can be seen above). This seems like a clear
discrepancy, as every state usually boasts their own unique set of area codes, and
not instead share area codes throughout the entire country.
o After doing further research on actual Area Codes by state (source:
http://www.worldatlas.com/na/us/area-codes.html) it was discovered that the three
apparent area codes (408,415, and 510) are all Area Codes specific to one state—
California
o Therefore, the abnormality is confirmed, and the Area Codes are incorrect for
every state besides California
Rattle
o Another “Describe” exploration was executed, and the State and Area Code
variables were focused on:
o By looking at the results for the Area Code variable, it can be seen that there are
only 3 unique values, when in reality each state should have at least one unique
Area Code. Thus, the unique value should be greater than or equal to 51, which is
the unique value for the State variable—this is the abnormality
o As discovered before, all three unique values within the Area Code variable (408,
415, 510) are all Area Codes for California; thus, the abnormality is that the other
states have incorrect inputs for Area Code
http://www.worldatlas.com/na/us/area-codes.html
3.) Use a graph to determine visually whether there are any outliers among the number of calls to
customer service.
Excel
o The first step to building a histogram is to determine the range of the Customer
Service calls data set; this was completed by calculating the Min and Max values
of the data set by utilizing the MIN() and MAX() functions
o Next, the “Histogram” function within the Data Analysis plug-in was utilized in
order to construct the graph
In the above snippet, the Input Range is the entire data set for Customer Service
calls, and the Bin Range is the range of the data set (0-9). After selecting the
“Chart Output” option, the graph was generated, as well as the associated chart
o
o From the graph above, it can be assumed that the sub-datasets represented by 7, 8,
and 9 Calls to Customer Service are definitely outliers
Rattle
o Under the Describe exploration, the following information was obtained for the
Customer Service Calls variable:
o From this information, it seems like there are a few values which can be
considered outliers; however a graphical representation is better
o To generate graphs for this information, the first step was to transform the
Customer Service Calls variable from numerical to categorical—this is completed
by going to the “Transform” tab, then selecting “Recode” and “As Categoric”,
and finally choose the variable to complete the transformation
0
200
400
600
800
1000
1200
1400
F
re
q
u
e
n
cy
Number of Calls Made to Customer Servcie
Calls to Customer Service
Frequency
o Next, the graph is generated by returning to the “Explore” tab and selecting
“Distributions”, then going down to the Categoric section and selecting “Bar Plot”
next to the transformed Customer Service Call variable
o
o From the above histogram, it can be concluded that 7, 8, and 9 Customer Service
Calls are almost certainly outliers
4.) Normalize the night minutes attribute using either Z-score or Min-Max transformation
Excel
o For this scenario, Min-Max will be utilized in order to normalize the data;
meaning that the following formula will be employed:
o First, the Min and Max of the dataset was generated using the MIN() and MAX()
functions as seen below:
o Next, the Range was calculated by subtracting the Max value by the Min value
o Finally, the dataset was normalized by using the Min-Max formula as seen above:
In this case, “BG3” is the original data point, “BJ2” is the Min value, and “BJ4” is
the range. The dollar signs ($) are utilized for the Min and Range values in order
to keep these values constant while carrying out the normalization for the entire
dataset. A snippet of the normalized data can be seen below:
Note: the Min Value (33) will have a value of 0, and Max (175) will be 1.
Rattle
o Normalizing data is relatively simple in Rattle, as it can be done by selecting
“Rescale” under the Transform tab, and then selecting “Scale [0-1]” before
clicking on the Night Minutes variable and clicking Execute
o Once executed, Rattle will generate a new variable, which is the normalized Night
Minutes variable
o Below are the original values, followed by the normalized dataset