Math 160 (Spring Quarter 2020) Due date: 06/05/2020, 12:30pm
Final Project
INSTRUCTIONS
• This is the final project of the course (40 points in total). No collaboration is allowed. You need to develop everything on your own. If you have difficulty understand the meaning of some questions, you can ask Prof. Ma or the TAs during office hours, or by sending emails. But note that this is only for clarification purpose. We don’t answer math or coding questions.
• You are not allowed to post the questions to Piazza, or anywhere on internet.
• If you use a source outside the notes (slides) of class you have to cite it properly.
• You need to submit BOTH PDF and CODES to Canvas. You need to write comments to your codes so that they are understandable. Do not forget to submit the data you used (especially if you reformatted things)! Make sure we can run the code from a simple call. E.g., all data files should be included. We will not download files for you or retype or copy paste your code. If it does not run straight out of the box, you will receive zero points!!.
• Be organized and use the notation appropriately. Show your work on every problem. Correct answers with no support work will not receive full credit.
1. How to influence the opinion of people when on a budget. (15 points)
Apparently it is in fashion to use social networks, like facebook, to influence voters in elections. In this problem, you will consider a simplified model of influence in social networks: the social network is represented by an undirected graph G = (V, E) and for a set of nodes S ⊆ V , we denote by N(S) the set of their neighbors:
N(S) = {v ∈ V |∃u ∈ S, (u,v) ∈ E}. The influence I(S) of a set of nodes is measured by I(S) = |N(S)|.
Imagine you work now for facebook and you are given the dataset facebookgraph.txt (availa- ble in the same canvas folder). This dataset is in fact a subgraph of the real Facebook social graph. Each line in the file contains the id of two users, indicating that these two users are friends on Facebook.
a. As the designer of a marketing campaign to influence the opinion of voters, your goal is to find a subset S ⊆ V of at most K nodes whose influence is maximal. Write a mathematical model to solve this problem. Can you solve the problem for the data you were given using MATLAB or AMPL or SCIP? If not, can you write a practical method to give an approximation to the optimum?
b. Using any of your models/methods from part (a) write a computer program which, given the social network described in the dataset and a budget K ∈ N+, returns an approximately optimal set of nodes S for the influence function I(S). The function should return both the users to influence and the value (total amount of influence) obtained.
c. Plot the influence I(S) obtained by your function as a function of the budget K. E.g., what happens when K = 1 and K = |V |?
d. In the definition of I(S) a node with largest degree is one of the most influential persons, i.e., he/she is a VIP. Can you think of a way to adapt the pagerank algorithm to identify VIP nodes in the graph? Can you invent of a third additional way to define who are the VIPs in this graph? What else could you use to measure influence?
e. A vertex-cover of a graph G is a set S of vertices of G such that each edge of G is incident with at least one vertex of S. The vertex cover number τ(G) is the size of a minimum vertex cover in G. A dominating set for a graph is a subset D of vertices such that every vertex not in D is adjacent to at least one member of D. The domination number γ(G) is the smallest dominating set. Are there any relationships among the influence function I(S), τ(G), and γ(G)? Explain.
Formulate a discrete models that given a graph find a vertex cover with smallest number of vertices and a smallest dominating set. Explain the reasoning on your variables and constraints.
2. Clustering and pairing senators by their political ideas (12 points)
The zip-file 2-congress-datafiles.zip (in the same canvas folder) contains two csv files with (real!) voting data. One file, I call F114, has information of how all USA senators (in the 114th congress) voted in 15 different bills. 1 means voted YES, 0 means voted NO, 0.5 means ABSTAINED. The other file, I call FYemen, contains votes for the 116 congress on single issue (the legality of the US involvement on the war in Yemen). This will be used later.
We always assume senators of the same party vote similarly, but let us analyze the votes using mathematics and uncover some information!
Clustering is an operation in data science where we divide data up into a fixed number of clusters while trying to ensure that the items in each cluster are as similar as possible (e.g., if we cluster by political party we presumably get similar voting records, similar opinions). K- means is a clustering algorithm that allows you to cluster into K clusters based on distances. It is implemented in MATLAB already.
a. Use K-means on F114, with K = 2 to get a clustering in two groups. Obviously, do not use the labels of the real affiliations I gave you! Choose one to be called demo- crats, the other to be the republicans (do not look at the original file). Make your best mathematical guess of the party affiliation.
You have obtained a classification of senators into two classes. Does your classification coincide exactly with the real party affiliation? Or are there surprises? To measure this compute the false-positive rate and false-negative rate of your classification:
false-positive rate := number of senators incorrectly labeled as democrats number of democratic senators
false-negative rate := number of senators incorrectly labeled as republicans number of republican senators
b. What if you cluster with K = 3? That would after all mean there are three factions in the senate (moderates, conservatives, liberals?).
c. Use part of the data to train a SVM model to classify the senators, and test the model using the unused data. Report your results.
d. Clustering can also be model as an integer optimization problem. Write such a model and code it in AMPL or SCIP. Run it in the senator’s data.
3. Recommendation Systems. (13 points). The Movielens (https://movielens.org/) is a website for movie recommendations. It is run by a research lab at University of Minnesota. The zip file ml-latest-small.zip (in the same canvas folder) contains 100836 ratings from 610 users to 9742 movies. More information of this dataset can be found here: http:// files.grouplens.org/datasets/movielens/ml-latest-small-README.html. Rearrange the ratings to a matrix with size 610 × 9724 (there are some movies that did not receive any rating and they were not included). Denote this matrix as M. Each row of M corresponds to a user, and each column of M corresponds to a movie. In this matrix there are 100836 entries (ratings) are given, that is only 100836/(610 × 9724) = 1.7% of the entries. So, 98.3% ratings are missing. You are asked to predict the missing ratings. Denote the set of indices of the known entries in the matrix as Ω, and denote the indices set of the unknowns as Ω ̄. That is:
Ω := {(i,j) | Mij is known}, Ω ̄ := {(i,j) | Mij is not known}
You can’t use all known entries in Ω to train your model, because you need some testing data to test whether your model and algorithm perform well. Therefore, you need to partition Ω to two parts: Ωtrain and Ωtest, i.e., Ω = Ωtrain Ωtest. Entries corresponding to Ωtrain are used as training data, and entries corresponding to Ωtest are used as testing data.
a. The first model is a matrix factorization model. Denote m = 610 and n = 9724. The model is:
min ∥PΩtrain(UV −M)∥2F, U ∈Rm×r ,V ∈Rr×n
where r < m, and the operator PΩtrain : Rm×n → Rm×n is defined as: Zij, if(i,j)∈Ωtrain
Implement the gradient method (with a fixed step size) to solve (MatFac). Randomly choose p% indices of Ω as Ωtrain and the remaining (1−p%) as Ωtest. Measure the qualify of (U∗,V∗) returned by your algorithm using the following mean squared error:
MSE:= 1 ((U∗V∗)ij −Mij)2 |Ωtest| (i,j)∈Ωtest
For different combinations of (p,r), report MSE in table(s). You can choose p = {99, 95, 90, 80, 70, 60, 50, . . .}, and r = {1, 10, 20, 50, 100, 200, 300}. Moreover, report the effect of the step size of the gradient method to the results.
b. In (MatFac) we used Frobenius norm in the objective. We can change it to l1 norm. min ∥PΩtrain(UV −M)∥1. (MatFacL1)
U ∈Rm×r ,V ∈Rr×n
The l1 norm of a matrix Z is defined as ∥Z∥1 = ij |Zij|, that is, the sum of absolute values of all entries. (MatFacL1) can’t be solved by gradient method, because the
[PΩtrain(Z)]ij = 0, otherwise
(MatFac)
absolute value function is not differentiable. But you can still solve it approximately by slightly perturbing the absolute value function. For a scalar z, you can approximate
√
|z| by |z| ≈ z2 + ε with a very small positive ε. Note that the latter function is
differentiable for z. Now by replacing all absolute values in the objective of (MatFacL1) by this formula, you get a differentiable objective function. Implement the gradient method (with a fixed step size) to solve (MatFacL1). Report the same everything as what you did in part (a).
c. Discuss the two models (MatFac) and (MatFacL1), which one is better? Under what situation (MatFac) is better? Under what situation (MatFacL1) is better?