EX03
1. Explore whether there are missing values for any of the variables
Story_1: Rattle is a good tool to explore missing values in a dataset. While load the “Hazardous_Alcohol_Consumption” dataset, don’t forget to choose all variable s as the input and not to choose partition button which split the dataset into train/validate/test datesets randomly. Then, let’s go to the “Explore” tab and choose “Summary” type. Check the “Describe” and “Show Missing” subtypes and remember to press the “Explore” button(shown in Fig 1.1), which makes the final result. The result is shown in Fig 1.2(n means the number of valid values; missing means the number of missing values). In Fig 1.2, the ten variables(“day”, “day_flag”, “mth”, “week”, “week_flag”, “lt1m”, “lt1m_flag”, “nvr_occ_flag”, “nbinge” and “nbinge_flag”) have missing values.
Fig 1.1
Fig 1.2
2. Explore the metadata file and comment its role in understanding the content of the csv file.
Story_2: Excel would be the best tool to display the metafile and complete some simple operations. The metafile’s content is shown in Fig 2. As shown in Fig 2, the metafile mainly introduces the name of the data table, the source of the data and column’s information(name , describe and id). The most important is that the table name could tell us the aim and use of the dataset and the colum_description could make us understand the variables’ meaning to data mine better.
Fig 2
3. Determine whether there are outliers among the numeric variables.
Story_3: To deal with outliers, KNIME is a good friend. Choose File Reader Node to load the dataset and choose Numeric Outliers Node to detect the outliers. Set Numeric Outlier Node parameters like Fig 3.1. Execute these two Node and the result is shown in Fig 3.2. In Fig 3.2, the values which less than the lower bound or more than the upper bound is defined as the outliers. Obviously, the five variables(“day”, “mth”, “week”, “lt1m”, “nbinge”) have outliers according to the Fig 3.2. 【Explain Numeric Outliers Node from KNIME’s help : This node detects and treats the outliers for each of the selected columns individually by means of interquartile range (IQR). To detect the outliers for a given column, the first and third quartile (Q1, Q3) is computed. An observation is flagged an outlier if it lies outside the range R = [Q1 – k(IQR), Q3 + k(IQR)] with IQR = Q3 – Q1 and k >= 0. Setting k = 1.5 the smallest value in R corresponds, typically, to the lower end of a boxplot’s whisker and largest value to its upper end.】
Fig 3.1
Fig 3.2
However, it’s not intuitive. So let’s draw these variables’ histogram plot by using rattle(The reason why choose rattle to draw is that KNIME’s histogram isn’t friendly and beautiful). Using “Recode” function under “Transform” tab in rattle(shown in Fig 3.3), the numeric variables are divided 10 levels. Then, choose “Distributions” function to draw these variables’ histogram plot(shown in Fig 3.4). Comparing KNIME’s outliers detection with rattle’s histogram plots, they are consistent basically.
Fig 3.3
Fig 3.4
4. Normalize the nbinge attribute using either Z-sore or Min-Max transformation.
Story_4: Using KNINE completes the normalize function. Choose Normalizer node and set Min-Max Normalization methods with including “nbinge” variable(shown in Fig 4.1). Using Histogram Node visualizes the normalization result.(shown in Fig 4.2). In Fig 4.2, the “nbinge” variable’s value has range from 0 to 1, which meet the normalization requirement.
Fig 4.1
Fig 4.2
5. Anlyze and interpret the correlations of all the variables with the variables “sex” and “age”.
Story_5: Using rattle completes the correlation assignment. Because only numeric variables can be analyzed for correlations, the categoric variables “sex” and “age” should be transformed to numeric variables firstly. Under the “Transform” tab, Using “As Numeric” function in “Recode” type transform these two variables(operations and result shown in Fig 5.1). Come back to the “Data” tab and make all numeric variables as input except the “time” variable which has the constant value. Finally, go to the “Explore” tab and choose “Correlation” type(shown in Fig 5.2). Press “Execute” button and the result is shown in Fig 5.3. In Fig 5.3, let’s look at the “TNM_age” row and the “TNM_sex” row that represent the correlations of all the variables with the variables “sex” and “age”. In the “TNM_age” row, there are not any other variables correlated with the “TNM_age” variable obviously except for itself. In the “TNM_age” row, the most correlated variables are the “TNM_age” variable and the “serialid” variable. After checking the dataset carefully, it’s meaningless of this finding since the “serialid” variable is just a identification number and its correlation with the “TNM_age” variable is accidental coincidence. So there are not any other variables correlated with the “age” variable and the “sex” variable obviously except for themselves.
Fig 5.1
Fig 5.2
Fig 5.3