程序代写代做 data mining MS6711 Data Mining

MS6711 Data Mining
Exercise 2

• In real-world data, observations with missing values for some attributes are a common occurrence. Describe various methods for handling this problem.

• Suppose that the data for analysis include the attribute age. The age values for the data are: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
Group the above values into 5 bins with approximately equal frequency.

• Data were collected on the nutritional information and consumer rating of 77 breakfast cereals. For each cereal the data include 16 variables. These variables are described in Figure 3a.
Figure 3a
Variable
Description
CALORIES
Calories per serving
CARBO
Grams of complex carbohydrates
CUPS
Number of cups in one serving
FAT
Grams of fat
FIBER
Grams of dietary fibre
NAME
Name of cereal
MFR
Manufacturer of cereal
POTASS
Milligrams of potassium
PROTIEN
Grams of protein
RATING
Rating (1.00-100.00) of the cereal calculated by Consumer Council
SHELF
Display shelf (1, 2, or 3, counting from the floor)
SODIUM
Milligrams of sodium
SUGARS
Grams of sugars
TYPE
Cold or hot
VITAMINS
Vitamins and minerals: 0. 25, 100, indicating the typical percentage of FDA recommended
WEIGHT
Weight in ounces of one serving

• Which variables are at binary level? Write ‘Nil’ if none of them is binary.
• Which variables are at nominal level? Write ‘Nil’ if none of them is nominal.
• Which variables are at ordinal level? Write ‘Nil’ if none of them is ordinal.
• Based on the summary statistics given in Figure 3b, name three variables that seem skewed to the right? Justify your answer.

Figure 3b

• Explain the differences between equal depth binning (quantile) and equal width binning (bucket).
• Figure 3c shows the distribution of CALORIES. If you are asked to bin the variable CALORIES for the purpose of identifying clusters of cereal brands, which binning method would you apply to the variable? Why?

Figure 3c

Exercises for SAS EM
• Data sets Adult1a, Adult1b, Adult2a, and Adult2b contain US census data. At the time of preparing the data, different set of variables of the same record were stored in two separate text files, namely the Adult1a and Adult1b. The records in the two files can be connected by the given ID variable. The columns of Adult2a and Adult2b are identical to those of Adult1a and Adult1b respectively, except that Adult1a and Adult1b contain records with income higher than 50K and Adult2a and Adult2b contain records with income not more than 50K. The descriptions of the variables are listed in the following tables.

Adult1a.sas7bdat, Adult2a.sas7bdat
Variable
Type
ID
Identification (Max length: 10)
Age
Continuous
Marital Status
Married-civ-spouse, Divorced, Never-married, …, etc. (Max length: 30)
Race
White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black (Max length: 30)
Sex
Female, Male

Adult1b.sas7bdat, Adult2b. sas7bdat
Variable
Type
ID
Identification (Max length: 10)
Education
Bachelors, Some-college, 11th, …, etc. (Max length: 30)
Education (number of years)
Continuous
Occupation
Tech-support, Craft-repair, Other-service, Sales, …, etc. (Max length: 30)
Relationship
Wife, Own-child, Husband, …, etc. (Max length: 30)
Capital-gain
Continuous
Capital-loss
Continuous
Hours-per-week
Continuous
Native-country
United-States, Cambodia, England, Puerto-Rico, Canada, …, etc. (Max length: 30)

• Create a new SAS EM project named ‘Exercise 2’. Import the four SAS data sets in to the project. Adjust the assigned Measurement Level if necessary.
• Merge the observations with the same ID in the two SAS data sets Adult1a.sas7bdat and Adult1b.sas7bdat using the Merge node. (Connect the two Data Source nodes to a Merge node. In the property panel of the Merge node, set the Merging property to Match. In the Variables window, change the Merge Role of ID to By. Run the Merge node.)
• Create a new variable named INCOME of length 20 and with value equals to ‘More than 50K’ for all observations contained in the merged data set that was created in (a).
• Repeat (b) and (c) for the two SAS data sets Adult2a.sas7bdat and Adult2b.sas7bdat but change the value of INCOME to ‘Not more than 50K’ for all observations contained in the merged data set.
• Stack the two merged data sets into a single SAS data set.
• Briefly describe the distributions of the interval variables contained in the data set. Are there any missing values? Did you observe anything unusual with these variables?
• Briefly describe the distributions of the categorical variables contained in the data set. Are there any missing values? Did you observe anything unusual with these variables?
• The variables Capital_gain and Capital_loss contain many missing values. The actually values of these missing values should be 0. Use the Impute node to replace the missing values of these two variables by 0. (In the Property panel of Impute node, set the default imputation method for interval and class variables to None; Type 0 for the Default Number Value of Default Constant Value property. In the Variables window, click the Method cell of each of the two variables and select the Constant imputation method respectively. Run the node)
• Use another Impute node to replace the missing values for all other variables. For interval variables, replace the missing values by the average of the respective variable. For all categorical variables, replace the missing values by a label ‘Unknown’.
• Use a Transform node to bin the variables Age, Education_year, Capital_gain, Capital_loss, and Hours_per_week into 10 bins of equal frequency. Did you get 10 such bins for each variable? If not, explain why you could not derive 10 bins of equal frequency for some of the variables.
• Use Sample node to select a simple random sample of 10,000 observations from the last modified data set derived in (j). (In the Property panel of Sample node, select the Random method from the drop list of Sample Method property; Select Number of Observations from the drop list of Type property; Enter 10000 into the Observations property; Run the node.)
• Use another Sample node to select another sample from the last modified data set derived in (j). This sample should have 3,000 observations from each of the two income levels. (In the Property panel of Sample node, select the Stratify method from the drop list of Sample Method property; Select Number of Observations from the drop list of Type property; Enter 6000 into the Observations property; Select Equal criterion from the drop list of Criterion in the Stratified property; Click the Sample Role cell of Income in the Variable window and select Stratification; Run the node.)