FTEC2101/ESTR2520 Optimization Methods Spring 2022
Project Specification – Binary Classification
Last Updated: April 25, 2022, Deadline: May 13, 2022, 23:59 (HKT)
Thus far we have learnt a number of optimization methods, ranging from the simplex method for linear programming, modeling techniques for integer programming, gradient/Newton methods for unconstrained optimization, KKT conditions, SOCP formulation, etc.. Besides theories, please remember that optimization methods are practical tools for solving real world problems. The aim of this project is to practice our skill on the latter aspect. We will make use of the Julia1 environment.
Copyright By PowCoder代写 加微信 powcoder
The project’s theme is about practical solutions for binary classification. It can be considered as an extension to what we have experimented in Computing Lab 2. We will explore additional aspects about binary classification and more importantly, implement and compare these ideas on a real-world dataset.
Background. Let us recall that the binary classification task aims at learning a linear classifier that distin- guishes a feature vector as positive (+1) or negative ( 1). Let d 2 N be the dimension of the feature vector, the (linear) classifier is described by the tuple (w, b) where w 2 Rd is the direction parameters and b 2 R is the bias parameter2, both specifying the linear model. Given (w, b), a feature vector x 2 Rd is classified into
alabely2{±1}suchthat (+1, if w>x+b=w1x1+···+wdxd+b 0,
y= 1, if w>x+b=w1x1+···+wdxd+b<0. (1.1)
As an illustration, the following figure shows a case with d = 2 such that the classifier is described by w = (w1, w2) and the scalar b:
e , Wixitwzlz -112--0 2×1
According to the rule in (1.1), all the feature vectors that lie above the dashed line shall be classified as +1 (blue points); those that lie below the dashed line shall be classified as 1 (red points).
1As an alternative, you are welcomed to use Python with optimization modeling packages supporting SOCP and MI-NLP such as cvxpy. However, you have to notify the instructor about the choice and the plan on how you wish to accomplish the project in the latter case on or before May 1, 2022, i.e., two weeks before the deadline. In the project, you are not allowed to use any pre-built solver for binary classification such as MLJ.jl, ScikitLearn.jl, flux.jl scikit-learn as your final submission (though you are encouraged to try out these packages as extra work to support your solution).
2Note that this di↵ers slightly from Lecture 16 as we include the bias parameter in the classifier design. 1
FTEC2101/ESTR2520 Project 2
Dataset & Folder Setting We have m 2 N samples of training data described by {x(i),y(i)}mi=1, where x(i) 2 Rd is the ith feature vector and y(i) 2 {±1} is the associated label. We will use a curated version of the GiveMeSomeCredits dataset taken from https://www.kaggle.com/c/GiveMeSomeCredit. It includes customer data such as their debt ratios, credit utilization, age, etc., and the history of defaulting. With these information, our goal is to learn a classifier that predicts if a customer has any risk of defaulting before the bank approves loan(s) for him/her.
To prepare for the project, please retrieve the Jupyter notebook template ftec2101-project-template-2022.ipynb and the zip archive ftec project files.zip from the Blackboard. Place the *.ipynb files in a working direc-
tory, extract the *.zip file and move the *.csv files inside the same working directory. Your working directory should have the following content:
Notice that:
• ftec-cs-groupi-train.csv – the training dataset for students in group i, i = 1, 2, 3, 4, 5. This is a csv file that contains 20 samples of customer data to be used in the Compulsory Tasks.
• ftec-cs-groupi-test.csv – the testing dataset for students in group i, i = 1, 2, 3, 4, 5. This is a csv file that contains 30 samples of customer data to be used in the Compulsory Tasks.
• ftec-cs-full-train.csv – the training dataset that contains 16657 samples of customer data to be used in the Competitive Tasks.
• ftec-cs-full-test.csv – the training dataset that contains 34870 samples of customer data to be used in the Competitive Tasks.
Lastly, the Jupyter notebook ftec2101 project 2022.ipynb provided detailed descriptions and helper codes to guide you through the tasks required by this project. Please pay attention to the comments provided inside the code cells of the Jupyter notebook as well.
1.1 Compulsory Tasks (50%)
The compulsory tasks of this project is divided into two parts. You are required to
(A) answer questions related to the optimization theory and modeling related to binary classification; and
(B) implement the modeled optimization problems on computer and solve them with the GiveMeSomeCredits dataset.
FTEC2101/ESTR2520 Project 3
Theory and Modeling Denote the training dataset with m samples as {x(i), y(i)}mi=1, where x(i) 2 Rd is the ith feature vector and y(i) 2 {±1} is the associated label. Let > 0 be a fixed scalar. The following optimization problem designs what is known as the hard-margin classifier [Sa, 2022]:
w2Rd,b2R 2 (1.2)
s.t. y(i) (x(i))>w+b 1, i=1,…,m. It can be easily shown that (1.2) is a convex optimization problem.
Task 1: Hard-margin Formulation (7.5%)
Answer the following question:
(a) Explain how solving (1.2) finds a classifier (w?,b?) that can correctly distinguish the m training samples into the +1 or 1 labels. (Hint: focus on the constraints in (1.2) and compare it to (1.1) as well as the figure that explains it.)
(b) Derive the KKT condition for (1.2). Using the KKT condition, show that the optimal w? is a linear combination of x(i) for the samples i such that 1 = y(i)((x(i))>w? + b?).
(c) Give an example of training dataset with d = 2 where (1.2) is infeasible. You may describe such dataset by drawing it on a 2-D plane.
Due to the issue of infeasibility as investigated in part (c) of Task 1, the hard-margin formulation (1.2) is seldom used in practice. Instead, we consider a soft-margin formulation, which introduces a non-negative ‘auxiliary decision variable’3 ⇠(i) to each of the constraint, yielding:
y(i) (x(i))>w+b 1 ⇠i, ⇠i 0, i=1,…,m. (1.3) Instead of forcing y(i) (x(i))>w + b 1 for all training samples, i.e., requiring all the training samples are
correctly classified, the above constraint with the auxiliary variable ⇠i allows for erroneous classification.
The auxiliary variable ⇠i is interpreted as the amount of error for the ith sample. Notice that ⇠i > 0 if and only if the ith training sample is mis-classified for the sample, i.e., y(i) (x(i))>w + b ⇤ 1. Ideally, we should find (w, b) such that the number of samples i with ⇠i > 0 is minimized.
Below, you will formulate a mixed integer program (in fact, a mixed-integer non-linear program, MI-NLP) optimization problem for the soft constraint optimization.
Task 2: Mixed Integer Formulation (5%)
Answer the following question: Based on the soft-margin constraint (1.3) [as well as the max-margin problem (1.2)], formulate an optimization problem with the following specification:
• The objective is to minimize the total number of auxiliary decision variables ⇠i with ⇠i > 0 (i.e., the total number of errors).
• The amount of error is bounded such that
0⇠i M, i=1,…,m,
3Note that it bears some resemblance to the artificial variable technique we learnt in the Big-M simplex method.
FTEC2101/ESTR2520 Project 4
where M is a known/given constant.
• The soft-margin constraints (1.3) are satisfied.
• The directional parameters w 2 Rd satisfies the following shaping constraint:
w>⌃w + c>w 1,
where ⌃ 2 Rd⇥d is a given symmetric, positive definite matrix, and c 2 Rd is a given vector.
The MIP is di cult to solve for large-scale problems with m 1. As a remedy, we shall approximate the objective function in Task 2 using a linear function in {⇠i}mi=1. Consider the following task:
Task 3: SOCP Formulation (5%)
Answer the following question:
(a) Based on the soft-margin constraint (1.3) [and the max-margin problem (1.2)], formulate an opti- mization problem with the following specifications (notice that unlike in Task 2, we do not impose the constraint ⇠i M):
• The objective is to minimize the weighted sum of constraint violation over all training sample:
where `i > 0, i = 1,…,m is a set of given weights.
• The soft-margin constraints (1.3) are satisfied.
• The directional parameters w 2 Rd satisfies the following shaping constraint:
w>⌃w + c>w 1,
where ⌃ 2 Rd⇥d is a given symmetric, positive definite matrix and c 2 Rd is a given vector.
(b) Show that the problem in part (a) can be reformulated as an SOCP.
Computational We next focus on solving the optimization problems formulated in Task 2 & 3 on computers. We structure the tasks into 3 stages: data analysis, optimization, interpretation of results. The Jupyter notebook provided descriptions and helper codes to guide you through most of the following tasks. Please pay attention to the comments in the Jupyter notebook as well.
Data Analysis. In the compulsory tasks, we will focus on a small dataset that only consist of m = 20 samples (stored in ftec-cs-groupi-train.csv). The training dataset contains the information of
”RevolvingUtilizationOfUnsecuredLines”, ”age”, ”DebtRatio”, ”MonthlyIncome”, ”NumberOfOpenCreditLinesAndLoans”, ”NumberRealEstateLoansOrLines”, ”NumberOfDependents”
for m = 20 customers. Notice that the dataset have been preprocessed such that the above features are normalized with typical values ranging between 0 and 1 (there are some exceptions though).
FTEC2101/ESTR2520 Project 5
Moreover, the training dataset contains the information of whether the customer has a history of “Default” or not. We treat the latter information as the label yi to train our classifier to decide if the customer is worth granting the grant or not.
We shall first get ourselves familiar with the dataset. In particular, we shall visualize the dataset to observe the behavior pattern of a customer with or without a ‘Default’ history.
Task 4: Warmup Exercise (5%)
(a) Inspect the dataset by making a 2-D scatter plots of the 20 samples over the features ‘RevolvingUtilizationOfUnsecuredLines’ and ‘NumberOfOpenCreditLinesAndLoans’.
Mark the data points of ‘Default’ (resp. ‘Non-default’) customers in red (resp. blue). Comment on the pattern observed for ‘Default’ customers.
(b) Try 2-3 more combinations of pairs of features and make comments on the observations.
Remark: The program template has provided the relevant helper codes for this task, but you may have to ‘tweak’ the template to examine other pairs of features in part (b).
For part (a), you may observe a similar output to the following:
Classifier Optimization. In the following tasks, you will implement classifier designs based on MI-NLP and SOCP from Task 2 & 3.
As a first step, we shall design the classifier based on only two features in the data, namely, ‘RevolvingUtilizationOfUnsecuredLines’ and ‘NumberOfOpenCreditLinesAndLoans’. (1.4)
In other words, we consider the classifier design problems [cf. Task 2 & 3] with d = 2 first.
Task 5: Optimization-based Formulation (17.5%)
(a) Based on the selected features in (1.4), implement and solve the MI-NLP from Task 2 with the
FTEC2101/ESTR2520 Project 6
following parameters:
M=100,⌃= 1 ✓1 0◆,c=✓0◆,`i=1,i=1,…,m.
You may use the solver Juniper (with Ipopt) in JuMP for tackling the MI-NLP.
(b) Based on the selected features in (1.4), implement and solve the SOCP from Task 3 with the following
parameters:
⌃= 1 ✓1 0◆,c=✓0◆,`i=1,i=1,…,m. 20001 0
You may use the solver ECOS in JuMP for tackling the SOCP.
(c) Make a scatter plot of the training dataset for the features in (1.4) as in Task 4-(a). Then, overlay the linear classifiers found in part (a) and part (b) on top of this scatter plot. Comment on whether the classifiers found are reasonable.
Notice that it may take a long time to solve the MI-NLP in Task 5-(a) since an MI-NLP problem is quite challenging to solve in general. The scatter plot in Task 5-(c) may look like
which indicate the classifiers found by the MIP and SOCP formulations.
Similar to the Computing Lab 2, the performance of a classifier can be evaluated by the error rate when applied on a certain set of data. For instance, recalling that {x(i),y(i)}mi=1 is the training dataset, the training error rate for a given classifier (w, b) can be estimated as
1 Xm ( + 1 , i f w > x ( i ) + b 0 , Training Error Rate = m (y(i) 6= yˆ(i)) where yˆ(i) = 1, if w>x(i) + b < 0,
i.e., it is the number of errors made by the classifier divided by the number of samples. Notice that 0
Training Error Rate 1.
As our aim is to design a classifier that makes prediction on whether a future customer (i.e., not found in the training dataset) will ‘default’ on his/her loans, it is necessary to evaluate the error rate on a testing dataset
that is unseen by the program. Denote the testing dataset with m
samples as {x(i) , y(i) }mtest , the testing test test test i=1
FTEC2101/ESTR2520 Project
error rate for a classifier (w, b) can be estimated as
(+1, if w>x(i) +b 0, test
1, if w>x(i) +b<0. test
1 mXtest Testing Error Rate = (y(i)
To this end, we consider the following task:
Task 6: Error Performance (5%)
Notice that for our project, the testing dataset is stored in ftec-cs-groupi-test.csv. (a) Write a function to evaluate the testing/training error rate as defined in the above.
(b) Evaluate and compare the error rate performances for the two classifiers you have found in Task 5.
Remark: Try to make the function for evaluating error in (a) general such that it takes dataset of any size and features of any dimension. You will have to reuse the same function in Task 7 & 8.
Intuitively, the performance of a classifier can be improve when more features are included. As the last task in the compulsory section, we shall use all the d = 7 features of the GiveMeSomeCredit dataset to design our classifier:
Task 7: Full Classifier (5%)
(a) Repeat Task 5-(a), 5-(b), but design the classifiers based on all the d = 7 features with the following parameters:
M =100, ⌃= 1 I7, c=0, `i =1, i=1,...,m. 200
where I7 is the 7 ⇥ 7 identity matrix, and 0 is a 7 ⇥ 1 all zero vector. By inspecting the optimized classifier from the SOCP, describe which features will be important for deciding if a customer is likely to default. Comment on whether this observation is reasonable or not.
(b) Evaluate and compare the error rate performances for the two classifiers you have found. Is the training error rate better than the classifier based on only d = 2 features in Task 5? What about the testing error rate? Comment on your observations.
(c) Change parameters such as adjusting the shaping constraint, e.g., change ⌃ to 1 I7 and observe its
e↵ects on the error rate, etc. Comment on your observations.
1.2 Competitive Tasks (30%)
Unlike the compulsory tasks, the goal of this competitive task is to implement your own (approximate) solver to the binary classifier problem, without relying on JuMP and its optimizers such as ECOS, Juniper, etc.. To motivate, we observe that while optimization packages such as JuMP are convenient to use, they are often limited by scalability to large-scale problems when the number of training samples m 1 and/or the feature is high dimensional d 1. The task would require considerably more advanced coding skills.
Our objectives are to find the linear classifier with the best training/testing error and the lowest computation cost using a custom-made iterative algorithm such as projected gradient descent. With the shaping constraint defined as:
⌃=I7, c=0.
FTEC2101/ESTR2520 Project
Under the above specialization, the SOCP problem in Task 3-(a) will be equivalent to:
1Xm n (i) (i)> o min f(w,b):= `imax 0,1 y (x ) w+b
w2Rd,b2R m i=1 s.t. w>w 1,
Note that the above problem admits a constraint set for (w, b) 2 Rd+1 as X = {w 2 Rd,b 2 R : w>w 1}.
An obstacle in applying gradient-based algorithms on (1.5) lies on the non-di↵erentiability of f(w;b). Taking this viewpoint, we shall consider some di↵erentiable function fb(w, b) that approximates f (w, b) (to be specified later) and solve the constrained optimization:
min fb(w,b) s.t. (w,b)2X. (1.7) w2Rd ,b2R
It should be easy to apply iterative algorithms such as projected gradient descent (PGD), conditional gradient, etc. (see Appendix A), to tackle the above problem.
Among other options, a potential candidate for approximating f(w;b) is to consider the logistic loss4 as we have done in Lecture 16 / Computing Lab 2:
1Xm ⇣ ⇣(i)(i)> ⌘⌘
fb(w,b)=m `ilog 1+exp y ((x ) w+b) . (1.8)
Minimizing the above function leads to a solution (w, b) such that y(i)((x(i))>w + b) is large for all i = 1, …, m,
which makes a desired feature for a good classifier.
Task 8: Customized Solver for Classifier Optimization (30%)
Using the dataset with the training data from m = 16657 samples in ftec-cs-full-train.csv. Imple- ment an iterative algorithm to tackle (1.7). Notice that due to (1.6) the optimized classifier must satisfy w>w 1. Moreover, you are required to initialize your algorithm by w0 = 0, b0 = 0.
Suggestion: As the first attempt, you may consider the projected gradient descent (PGD) method using a constant step size with fb(w, b) selected as the logistic function (1.8). You may need to tune the weights {`i}mi=1 to obtain a better result while paying attention to the ‘size’ (i.e., kx(i)k) for each of the feature vector. Other candidates are also suggested in Appendix A.
Assessment You will receive a maximum of 10% for correctly implementing at least one numerical algo- rithm (e.g., projected gradient), together with
1. plotting the trajectory of the algorithm is show that the objective value in (1.7) to be decreasing to a certain value asymptotically and providing comments on the algorithm(s) implemented,
2. providing derivations and justifications on why the implemented algorithm is used.
Besides achieving a low training/testing error rates, a good binary classifier should also be e ciently com- putable. As we are dealing with a relatively large dataset, the complexity of your algorithm will be measured by the number of accesses to the training dataset5. Every time when you evaluate the value of the
4Notice that the logistic objective function can be interpreted alternatively as a formulation for the classifier design task with the maximum-likelihood (ML) principle from statistics. This is beyond the scope of this project specification.
5This is sometimes known as the sample complexity.
FTEC2101/ESTR2520 Project 9 bbb
objective function f(w,b), or the gradient of f(w,b), will incur one access to the training dataset.
For instance when you are applying a projected GD method with constant step size for 1000 iterations. As the algorithm evaluates the gradient of f(w,b) only once per iteration, the total number of accesses to the training dataset will be equal to 1000.
Secondly, another desired property is that the binary classifier should not be biased towards a particular feature. In particular, here we wish to design a classifier that does not decide on the risk of default for customer entirely on the feature ‘RevolvingUtilizationOfUnsecuredLines’. Suppose this feature is denoted via the first coordinate of xt, i.e., [xt]1, in your program, we can measure the biasedness towards it by:
% weight of classifier on ‘RevolvingUtilizationOfUnsecuredLines’ = P|w1| 7i=1 |wi|
Finally, the remaining 20% of your marks in this task will be calculated according to the following formula:
Score = 5% ⇥ exp ⇣10 · max{0.25, Lowest Training Error Rate} 10 · max{0.25, Your Training Error Rate}⌘
+ 5% ⇥ exp ⇣10 · max{0.3, Lowest Testing Error Rate} 10 · max{0.3, Your Testing Rate}⌘
+ 5% ⇥ max{0.25, Lowest % weight of class
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com