CS计算机代考程序代写 Bayesian algorithm 1. Find the most popular users from Yelp‘s user records dataset.

1. Find the most popular users from Yelp‘s user records dataset.
2. Find all businesses that these popular users have given the rating to.
3. Apply endogenous statistical techniques to compute trust rating for all businesses obtained in step 2.
4. Propose validation techniques to measure the trustworthiness of each popular user’s rating against the business’s trust rating computed in step 3. This process is to be repeated for all businesses obtained in step 2.
5. Collect statistics and analyze results.
The following sections provide detail explanation on the aforementioned steps.
4. FINDING THE MOST POPULAR USERS
4.1 Features Selection and Clustering Technique
The first step is to find the most popular users from over ± 362K user records. The number of users is quite large, hence a traceable way to find the most popular users is to cluster them. Another reason why clustering technique was chosen is due to the extensive number of features/variables that can be used to determine users’ popularity. Many of us may disagree on these features. For example, some may argue that popularity of a user is derived from the number of friends that this user has. Others may not agree and argue that a user is popular only if they have a large number of compliments and so on. Thus, clustering is a sensible way to group users as it can takes a number of features as its inputs.
The user records dataset provided by Yelp provides a number of features to work with. However, not all features can be used while others need to be transformed prior to using them. Features such as ‘user_id’, ‘type’ and ‘name’ are of non-integer type and do not correlate with the objective which is to find the most popular users. Thus, these features were omitted as inputs in the clustering process. The ‘elite’ and ‘friends’ features are collections of string type data that shows the number of years a user received ‘elite’ badge and all ‘users_id’ that have become friends with this user respectively. We transformed these features by counting the number of records each has prior to the clustering. The ‘votes’ and ‘compliments’ features contain sub-features that show the three types of votes (i.e. funny, cool and useful) and the ten types of

compliments (i.e. writer, funny, cool, hot, plain, profile, note, photos, more and list) respectively. Individually, these sub-features do not correlate with finding the popular users. However, the total sum of these sub- features (i.e. ‘total_votes’ and ‘total_compliments’) does help to find the most popular users in Yelp. The points below showcase the final features that we use to cluster Yelp’s users:
1. yelping_since – the year in which a user started yelping.
2. average_star – the average ratings of all businesses given by a user in Yelp.
3. elite_count – the total number of years in which a user is part of Yelp’s elite squad. This number is
produced by counting all years that a user is part of the elite squad.
4. fans_count – the total number of fans that a user has.
5. friends_count – the total number of friends that a user has. This number is obtained by counting all
user_ids in the “friends” feature.
6. total_votes – the total number of votes received by a user from other users. This number is obtained
by summing the three vote’s subtypes.
7. total_compliments – the total numbers of compliments received by a user from other users. This
number is obtained by summing the ten compliment’s subtypes.
X-means [44] was chosen as the clustering algorithm to group users in this study mainly due to its
clustering performance and accurate fitness evaluation score. X-means is a variant of k-means algorithm [45] that could accelerate the iterative process and find the best k. It searchers over a range of k values and select the best k clusters based on the Bayesian Information Criterion (BIC) [46] score. The algorithm loops the following operations until completion:
1. Perform k-means by selecting the lowest bound of k, by default it is 2 until the maximum bound of chosen k. This gives divided clustersC1,C2 ,…,Ck .
2. Apply k-means where k = 2 for each cluster c in C . This gives divided child clusters c1 , c2 . kkk
3. Calculate BIC for each cluster c and evaluate its relevance.
BIC is used to approximate the posteriors of the clusters, in other words, the “goodness of fit” of a cluster to a dataset. It is an approximation to the probability of the clustering given the data that has been clustered.

Thus, the higher the BIC value, the higher the probability of the clustering being a “good fit” to the data being clustered. BIC is measured through the likelihood of how well the clusters model the data. To get the likelihood, each cluster must be produced by a spherical Gaussian distribution. BIC applies to x-means algorithm due to x-means’ clustering technique that uses k-means (that is, spherical Gaussian). The following formula is used to compute BIC:
BIC(D,k)=l(D|k)− pj log(R) 2
where l(D | k) is the likelihood, R is the number of points in data, and Pj is the number of parameters to estimate which is (k – 1) + dk + 1 for (k – 1) cluster probabilities. To compute l(D | k) we use
∑k Ri l(D|k)= −
each data point to its cluster center. In x-means, BIC is computed locally in all centroid split tests (in step. 3 above) and globally when choosing the best model.
4.2 Clustering Results – Finding the Most Popular Users
We performed our clustering and visualization processes with R programming [47] and RapidMiner [48] predictive analysis tool. X-means was chosen as the main cluster algorithm and BIC score was chosen to evaluate the clusters. The clustering runs from k = 2 to k = 60. For each k, 10 maximal runs were setup with 100 maximal number of iterations per run. After such runs, x-means selected k = 4 as the optimal number for the clusters. The resulting clusters and their average values on each feature are shown in table 1.
TABLE I. CLUSTERING RESULTS FOR YELP‘S USER RECORDS
Features/Clusters Cluster_0 Cluster_1 Cluster_2 Cluster_3
yelping_since_year 2011.25 2008.37 2007.77 2007.95
log(2π)−
i=12 2 2
Rid 2 Ri −1 log(δ )−
+Ri log(Ri /R)
where Ri is the number of points in i th cluster, and δ 2 is the average variance of Eucledian distance from

average_stars elite_count
fans friends_count review_count total_votes total_compliments total_users
3.72 3.78 0.19 4.97 1.05 58.68 5.33 210.17
28.00 583.22 68.53 5440.45 9.02 1578.48
359650 2170
3.85
6.58 507.16 1030.44 1703.93 59598.49 24432.07 43
3.83
6.01 198.25 554.91 1093.00 22809.95 7949.99 240
Out of the four clusters, cluster no. 3 (Cluster_2) is the most dominant cluster. This cluster is worth of our attention as it has the highest average value for all features when compared to the other clusters. Users in this cluster have, on average, been yelping the longest and have been accepted to Yelp’s elite squad the most times. These users also have the highest average number of fans and friends, and have provided the most reviews on the businesses on Yelp platform. Additionally, they have received the highest average number of votes and compliments from other Yelp’s users. Fig. 1 shows the four clusters of Yelp users with various features in two dimensional space. This figure clearly shows that Cluster_2 has the highest number of friends, total votes, total compliments and fans.

Fig. 2. Clusters of Yelp’s users with various features.
Table 1 shows that Cluster_2 is a group of forty-three most popular users from Yelp’s user records dataset. 42 out of 43 users, denoted as P = {P1,…, P42}, from this group are selected for a further study to investigate their trustworthiness in providing credible ratings. We are particularly concerned on the ratings given by these most popular users on the businesses listed in Yelp’s business records. The reason for omitting one user from this group will be discussed in the next section.
5. DETERMINING POPULAR USERS TRUSTWORTHINESS
In this section we propose several statistical techniques to determine the trustworthiness of Yelp’s most popular users in giving credible ratings. We first find all businesses that these popular users have given the ratings to. To increase results’ validity, we decided to focus on those businesses (denoted as B) that have

Related Posts