EX03
1. Explore whether there are missing values for any of the variables
Story_1: Rattle is a good tool to explore missing values in a dataset. While
load the “Hazardous_Alcohol_Consumption” dataset, don’t forget to
choose all variables as the input and not to choose partition button which
split the dataset into train/validate/test datesets randomly. Then, let’s go to
the “Explore” tab and choose “Summary” type. Check the “Describe” and
“Show Missing” subtypes and remember to press the “Explore”
button(shown in Fig 1.1), which makes the final result. The result is shown
in Fig 1.2(n means the number of valid values; missing means the number
of missing values). In Fig 1.2, the ten variables(“day”, “day_flag”, “mth”,
“week”, “week_flag”, “lt1m”, “lt1m_flag”, “nvr_occ_flag”, “nbinge” and
“nbinge_flag”) have missing values.
Fig 1.1
Fig 1.2
2. Explore the metadata file and comment its role in understanding the
content of the csv file.
Story_2: Excel would be the best tool to display the metafile and complete
some simple operations. The metafile’s content is shown in Fig 2. As
shown in Fig 2, the metafile mainly introduces the name of the data table,
the source of the data and column’s information(name , describe and id).
The most important is that the table name could tell us the aim and use of
the dataset and the colum_description could make us understand the
variables’ meaning to data mine better.
Fig 2
3. Determine whether there are outliers among the numeric variables.
Story_3: To deal with outliers, KNIME is a good friend. Choose File
Reader Node to load the dataset and choose Numeric Outliers Node to
detect the outliers. Set Numeric Outlier Node parameters like Fig 3.1.
Execute these two Node and the result is shown in Fig 3.2. In Fig 3.2, the
values which less than the lower bound or more than the upper bound is
defined as the outliers. Obviously, the five variables(“day”, “mth”,
“week”, “lt1m”, “nbinge”) have outliers according to the Fig 3.2.
【Explain Numeric Outliers Node from KNIME’s help : This node
detects and treats the outliers for each of the selected columns
individually by means of interquartile range (IQR). To detect the outliers
for a given column, the first and third quartile (Q1, Q3) is computed. An
observation is flagged an outlier if it lies outside the range R = [Q1 –
k(IQR), Q3 + k(IQR)] with IQR = Q3 – Q1 and k >= 0. Setting k = 1.5
the smallest value in R corresponds, typically, to the lower end of a
boxplot’s whisker and largest value to its upper end.】
Fig 3.1
Fig 3.2
However, it’s not intuitive. So let’s draw these variables’ histogram
plot by using rattle(The reason why choose rattle to draw is that
KNIME’s histogram isn’t friendly and beautiful). Using “Recode”
function under “Transform” tab in rattle(shown in Fig 3.3), the numeric
variables are divided 10 levels. Then, choose “Distributions” function to
draw these variables’ histogram plot(shown in Fig 3.4). Comparing
KNIME’s outliers detection with rattle’s histogram plots, they are
consistent basically.
Fig 3.3
Fig 3.4
4. Normalize the nbinge attribute using either Z-sore or Min-Max
transformation.
Story_4: Using KNINE completes the normalize function. Choose
Normalizer node and set Min-Max Normalization methods with including
“nbinge” variable(shown in Fig 4.1). Using Histogram Node visualizes the
normalization result.(shown in Fig 4.2). In Fig 4.2, the “nbinge” variable’s
value has range from 0 to 1, which meet the normalization requirement.
Fig 4.1
Fig 4.2
5. Anlyze and interpret the correlations of all the variables with the
variables “sex” and “age”.
Story_5: Using rattle completes the correlation assignment. Because only
numeric variables can be analyzed for correlations, the categoric variables
“sex” and “age” should be transformed to numeric variables firstly. Under
the “Transform” tab, Using “As Numeric” function in “Recode” type
transform these two variables(operations and result shown in Fig 5.1).
Come back to the “Data” tab and make all numeric variables as input
except the “time” variable which has the constant value. Finally, go to the
“Explore” tab and choose “Correlation” type(shown in Fig 5.2). Press
“Execute” button and the result is shown in Fig 5.3. In Fig 5.3, let’s look
at the “TNM_age” row and the “TNM_sex” row that represent the
correlations of all the variables with the variables “sex” and “age”. In the
“TNM_age” row, there are not any other variables correlated with the
“TNM_age” variable obviously except for itself. In the “TNM_age” row,
the most correlated variables are the “TNM_age” variable and the
“serialid” variable. After checking the dataset carefully, it’s meaningless
of this finding since the “serialid” variable is just a identification number
and its correlation with the “TNM_age” variable is accidental
coincidence. So there are not any other variables correlated with the
“age” variable and the “sex” variable obviously except for themselves.
Fig 5.1
Fig 5.2
Fig 5.3