CSE 4502/5717 Big Data Analytics Fall 2020 Exam 3 Helpsheet
1. Let f : Rn → R be any function on n variables. Given a series of examples to learn f, we can fit them using a linear model: f(x1,x2,…,xn)=w1x1+w2x2+···+wnxn. Linearregressioncomputestheoptimalvaluesfortheparameters by equating the gradient to zero. Let the examples be (x1i ,x2i ,…,xni ;yi) for 1 ≤ i ≤ m. Let w = (w1 w2 ··· wn)T be the parameter vector. Also, let
x1 x21 ··· xn1 x1 x2 ··· xn
X=2 2 2. ···
x1m x2m ··· xnm
Then, we showed that the optimal value for w is (XT X)−1XT y where y = (y1 y2 · · · ym)T .
2. An artificial neural network (ANN) can be thought of as a directed leveled graph G(V, E) where each node corresponds to a neuron. If there is a directed edge from node i to node j, then, a signal flows from node i to node j.
A simple example is a perceptron that has a single neuron. There are k inputs x1,x2,…,xk to the neuron. Input xi weighted by wi, for 1 ≤ i ≤ k. If the weighted input to the neuron, i.e., ki=1 wixi, is ≥ τ (τ being a threshold), then the output of the neuron will be 1; otherwise the output will be 0. We showed that any Boolean function can be represented with a NN consisting of multiple layers of perceptrons.
3. We considered forward and backward propagations in a general feed forward NN G(V,E) and showed that both steps can be completed in O(|V | + |E|) time (for each example). We also briefly discussed convolutional neural networks (CNNs), recurrent neural networks (RNNs), and generative adversarial networks (GANs).
4. Association Rules Mining. An itemset is a set of items. A k-itemset is an itemset of size k. A transaction is anitemset. AruleisrepresentedasX→Y whereX̸=∅,Y ̸=∅,X∩Y =∅.
We are given a database DB of transactions and the number of transactions in the database is n. Let I be the set of distinct items in the database and let d = |I|.
For an itemset X, we define σ(X) as the number of transactions in which X occurs, i.e. σ(X) = |{T ∈ DB|X ⊆ T}| ThesupportofanyruleX→Y is σ(X∪Y). TheconfidenceofanyruleX→Y is σ(X∪Y).
n σ(X) Association Rules Mining is defined as follows.
Input: A DB of transactions and two numbers: minSupport and minConfidence.
Output: All rules X → Y whose support is ≥ minSupport and whose confidence is ≥ minConfidence.
An itemset is frequent if σ(X) ≥ n · minSupport
We discussed the Apriori algorithm for finding all the frequent itemsets. This algorithm is based on the a priori principle: If X is not frequent then no superset of X is frequent. Also, If X is frequent then every subset of X is also frequent.
The pseudocode for the Apriori algorithm is given next. Algorithm 1: Apriori algorithm
k := 1;
Compute F1 = {i ∈ I|σ(i) ≥ n · minSupport}; while Fk ̸= ∅ do
k := k + 1;
Generate candidates Ck from Fk−1; for T ∈ DB do
forC∈Ck do
if C ⊆ T then
σ(C) := σ(C) + 1;
Fk := ∅; forC∈Ck do
if σ(C) ≥ n · minSupport then Fk := Fk ∪ {C};
We can use a hash tree to compute the support for each candidate itemset. We also discussed a randomized algorithm for finding frequent itemsets.
5. Clustering: We showed how to implement the hierarchical clusstering algorithm in O(n2) time on any input of n points. This algorithm repeatedly merges the two closest clusters until a target number of clusters is reached. To begin with each input point is a cluster on its own.