CS 412: Fall’22
Introduction To Data Mining
Assignment 4
Copyright By PowCoder代写 加微信 powcoder
(Due Friday, December 02, 11:59 pm)
• The homework is due at 11:59 pm on the due date. We will be using Gradescope for the
homework assignments. You should join Gradescope using the entry code shared on Aug 3.
Please do NOT email a copy of your solution. Contact the TAs if you are having technical
difficulties in submitting the assignment. We will NOT accept late submissions!
• Please use Slack or Canvas first if you have questions about the homework. You can also
come to our (zoom) office hours and/or send us e-mails. If you are sending us emails with
questions on the homework, please start subject with “CS 412 Fall’22: ” and send the email
to all of us (Arindam, Mukesh, Chandni, Mayank, Hang, and Shiliang) for faster response.
• Please write your code entirely by yourself.
Programming Assignment Instructions
– All programming needs to be in Python 3.
– The homework will be graded using Gradescope. You will be able to submit your code
as many times as you want.
– Two python files named homework4 q1.py, homework4 q2.py containing starter code
are available on Canvas
– For submitting on Gradescope, you need to upload both the files homework4 q1.py and
homework4 q2.py.
1. (50 points) The assignment will focus on developing code for Bayes Theorem applied to
estimating the type of candy bag we have based on candies drawn from the bag, following
the example discussed in class.
Suppose there are five types of bags of candies:
• π1 fraction are h1: p1 fraction “cherry” candies, (1− p1) fraction “lime” candies
• π2 fraction are h2: p2 fraction “cherry” candies, (1− p2) fraction “lime” candies
• π3 fraction are h3: p3 fraction “cherry” candies, (1− p3) fraction “lime” candies
• π4 fraction are h4: p4 fraction “cherry” candies, (1− p4) fraction “lime” candies
• π5 fraction are h5: p5 fraction “cherry” candies, (1− p5) fraction “lime” candies
For the specific example discussed in class:
π1 = 0.1 , π2 = 0.2 , π3 = 0.4 , π4 = 0.2 , π5 = 0.1 ,
p1 = 1 , p2 = 0.75 , p3 = 0.5 , p4 = 0.25 , p5 = 0 .
A bag is given to us, but we do not know which type of bag h ∈ {h1, h2, h3, h4, h5} it is. We
will be drawing a sequence of candies c1, c2, . . . , ci ∈ {”cherry”, ”lime”} from the given bag
and, using Bayes rule, maintain posterior probabilities of the type of bag conditioned on the
candies which have been drawn, i.e.,
After drawing c1 : p(πh|c1) , h = 1, . . . , 5
After drawing c1, c2 : p(πh|c1, c2) , h = 1, . . . , 5
. . . . . .
We will make the following assumptions regarding the setup:
• The probabilities ph, h = 1, . . . , k of drawing “cherry” candies from the bag do not change
as candies are being drawn the bag. As a result, the probabilities (1− ph), h = 1, . . . , k
of drawing “lime” candies from the bag also do not change as candies are being drawn
• The joint probability of drawing different candies from a bag are conditionally indepen-
dent given the bag, i.e.,
p(c1, c2|πh) = p(c1|πh)p(c2|πh) ,
p(c1, c2, c3|πh) = p(c1|πh)p(c2|πh)p(c3|πh) ,
and so on.
You will have to develop code for the following function:
my Bayes candy(π, p, c1:10), which uses prior probabilities π, conditional probabilities p, and
sequence of 10 candy draws c1:10 to compute posterior probabilities of each type of bag.
The function will have the following input:
• Prior probability of each type of bag: π = [π1, π2, π3, π4, π5]
• Conditional probability of cherry candies in each type of bag: p = [p1, p2, p3, p4, p5]
• Sequence of 10 candies drawn from the bag under consideration: c1:10 = [c1, c2, . . . , c10]
where ci ∈ 0, 1 with ”0” denoting cherry and ”1” denoting lime candy. For the specific
example discussed in class:
c = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] .
The function will have the following output:
(a) A list of posterior probabilities of each type of bag πh, h = 1, . . . , 5 after every subse-
quence of candy draws, i.e., c1:t = [c1, . . . , ct], t ∈ {1, . . . , 10}.
In particular, the output will be the following posterior probabilities:
After drawing c1 : p(πh|c1) , h = 1, . . . , 5
After drawing c1, c2 : p(πh|c1, c2) , h = 1, . . . , 5
. . . . . .
After drawing c1, . . . , c10 : p(πh|c1, . . . , c10) , h = 1, . . . , 5
The function my Bayes candy will return a two-dimensional list of size 10×5 containing the
aforementioned posterior probabilities. Specifically, the returned list must contain probabili-
ties in the format:
[[p(π1|c1), p(π2|c1), …, p(π5|c1)],
[p(π1|c1, c2), …, p(π5|c1, c2)],
[p(π1|c1, .., c10), …, p(π5|c1, .., c10)]]
Note that your code should work for any given π, p, c1:10 and we will always consider 5 types
of bags and 2 types of candies in each bag. The specific example discussed in class can serve
as a test case.
2. (50 points) The assignment will focus on developing your own code for: random train-test
split validation.
Your code will be evaluated using five standard classification models applied to a multi-class
classification dataset.
Dataset: We will be using the following dataset for the assignment.
Digits: The Digits dataset comes prepackaged with scikit-learn (sklearn.datasets.load digits).
The dataset has 1797 points, 64 features, and 10 classes corresponding to ten numbers
0, 1, . . . , 9. The dataset was (likely) created from the following dataset:
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
Classification Methods. We will consider five classification methods from scikit-learn:
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
– Linear support vector classifier: LinearSVC,
– Support vector classifier: SVC,
– Logistic Regression: LogisticRegression,
– Random Forest Classifier: RandomForestClassifier, and
– Gradient Boosting Classifier: XGBClassifier.
Use the following parameters for these methods:
– LinearSVC: max iter=2000
– SVC: gamma=‘scale’, C=10
– LogisticRegression: penalty=‘l2’, solver=‘lbfgs’, multi class=‘multinomial’
– RandomForestClassifier: max depth=20, random state=0, n estimators=500
– XGBClassifier: max depth=5
Develop code for my train test(method,X,y,π,k), which performs random splits on the data
(X,y) so that π ∈ [0, 1] fraction of the data is used for training using method, rest is used
for testing, and the process is repeated k times, after which the code returns the error rate
for each such train-test split. Your my train test will be tested with π = 0.75 and k = 10
on the five methods: LinearSVC, SVC, LogisticRegression, RandomForestClassifier, and
XGBClassifier applied to the Digits dataset.
You will have to develop code for the following function:
my train test(method,X,y,π,k), which does random train-test split based evaluation of
method with π fraction used for training for each split.
The function will have the following input:
(1) method, which specifies the (class) name of one of the five classification methods under
consideration,
(2) X,y which is data for the classification problem,
(3) π, the fraction of data chosen randomly to be used for training,
(4) k, the number of times the train-test split will be repeated.
The function will have the following output:
(a) A list of the test set error rates for each of the k splits.
Error rate should be calculated as
# of wrong predictions
# of total predictions
. The (auto)grader will compare the
mean and standard deviation of your list with our solution; it must be within three standard
deviations.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com