程序代写代做代考 Excel EX03

EX03

1. Explore whether there are missing values for any of the variables

Story_1: Rattle is a good tool to explore missing values in a dataset. While

load the “Hazardous_Alcohol_Consumption” dataset, don’t forget to

choose all variables as the input and not to choose partition button which

split the dataset into train/validate/test datesets randomly. Then, let’s go to

the “Explore” tab and choose “Summary” type. Check the “Describe” and

“Show Missing” subtypes and remember to press the “Explore”

button(shown in Fig 1.1), which makes the final result. The result is shown

in Fig 1.2(n means the number of valid values; missing means the number

of missing values). In Fig 1.2, the ten variables(“day”, “day_flag”, “mth”,

“week”, “week_flag”, “lt1m”, “lt1m_flag”, “nvr_occ_flag”, “nbinge” and

“nbinge_flag”) have missing values.

Fig 1.1

Fig 1.2

2. Explore the metadata file and comment its role in understanding the

content of the csv file.

Story_2: Excel would be the best tool to display the metafile and complete

some simple operations. The metafile’s content is shown in Fig 2. As

shown in Fig 2, the metafile mainly introduces the name of the data table,

the source of the data and column’s information(name , describe and id).

The most important is that the table name could tell us the aim and use of

the dataset and the colum_description could make us understand the

variables’ meaning to data mine better.

Fig 2

3. Determine whether there are outliers among the numeric variables.

Story_3: To deal with outliers, KNIME is a good friend. Choose File

Reader Node to load the dataset and choose Numeric Outliers Node to

detect the outliers. Set Numeric Outlier Node parameters like Fig 3.1.

Execute these two Node and the result is shown in Fig 3.2. In Fig 3.2, the

values which less than the lower bound or more than the upper bound is

defined as the outliers. Obviously, the five variables(“day”, “mth”,

“week”, “lt1m”, “nbinge”) have outliers according to the Fig 3.2.

【Explain Numeric Outliers Node from KNIME’s help : This node

detects and treats the outliers for each of the selected columns

individually by means of interquartile range (IQR). To detect the outliers

for a given column, the first and third quartile (Q1, Q3) is computed. An

observation is flagged an outlier if it lies outside the range R = [Q1 –

k(IQR), Q3 + k(IQR)] with IQR = Q3 – Q1 and k >= 0. Setting k = 1.5

the smallest value in R corresponds, typically, to the lower end of a

boxplot’s whisker and largest value to its upper end.】

Fig 3.1

Fig 3.2

However, it’s not intuitive. So let’s draw these variables’ histogram

plot by using rattle(The reason why choose rattle to draw is that

KNIME’s histogram isn’t friendly and beautiful). Using “Recode”

function under “Transform” tab in rattle(shown in Fig 3.3), the numeric

variables are divided 10 levels. Then, choose “Distributions” function to

draw these variables’ histogram plot(shown in Fig 3.4). Comparing

KNIME’s outliers detection with rattle’s histogram plots, they are

consistent basically.

Fig 3.3

Fig 3.4

4. Normalize the nbinge attribute using either Z-sore or Min-Max

transformation.

Story_4: Using KNINE completes the normalize function. Choose

Normalizer node and set Min-Max Normalization methods with including

“nbinge” variable(shown in Fig 4.1). Using Histogram Node visualizes the

normalization result.(shown in Fig 4.2). In Fig 4.2, the “nbinge” variable’s

value has range from 0 to 1, which meet the normalization requirement.

Fig 4.1

Fig 4.2

5. Anlyze and interpret the correlations of all the variables with the

variables “sex” and “age”.

Story_5: Using rattle completes the correlation assignment. Because only

numeric variables can be analyzed for correlations, the categoric variables

“sex” and “age” should be transformed to numeric variables firstly. Under

the “Transform” tab, Using “As Numeric” function in “Recode” type

transform these two variables(operations and result shown in Fig 5.1).

Come back to the “Data” tab and make all numeric variables as input

except the “time” variable which has the constant value. Finally, go to the

“Explore” tab and choose “Correlation” type(shown in Fig 5.2). Press

“Execute” button and the result is shown in Fig 5.3. In Fig 5.3, let’s look

at the “TNM_age” row and the “TNM_sex” row that represent the

correlations of all the variables with the variables “sex” and “age”. In the

“TNM_age” row, there are not any other variables correlated with the

“TNM_age” variable obviously except for itself. In the “TNM_age” row,

the most correlated variables are the “TNM_age” variable and the

“serialid” variable. After checking the dataset carefully, it’s meaningless

of this finding since the “serialid” variable is just a identification number

and its correlation with the “TNM_age” variable is accidental

coincidence. So there are not any other variables correlated with the

“age” variable and the “sex” variable obviously except for themselves.

Fig 5.1

Fig 5.2

Fig 5.3

Related Posts