• THE DATASET
The assignment employs a Facebook dataset provided as a courtesy by the Max Planck Institute for Software Systems. The data is described in more detail in the following publication:
Viswanath, B., Mislove, A., Cha, M. and Gummadi, K.P., 2009, August. On the Evolution of User Interaction in Facebook. In Proceedings of the 2nd ACM Workshop on Online Social Networks (pp. 37-42). ACM.
The Facebook dataset contains, exactly, 63,731 Facebook users—no duplicates are included— and 817,090 friendship relationships among them—this is the total number of links established via friendships. Do not be alarmed if importing the Facebook dataset into Neo4j takes a while. The number of friendships is large! However, if it takes longer than 10 minutes, it probably means that you are doing something wrong—perhaps, you did not index the right attributes in your database. Also, do not try to draw the whole Facebook graph at once—this may cause your computer to “freeze”. Besides, even if you successfully draw the entire graph, you would not learn much from a tangled mess of 817,090 links.
The Facebook dataset for your coursework can be downloaded from the module DLE website under the Coursework Section—look for a number of files stored in a folder named Facebook Dataset. Although the dataset is spread across different files, you are expected to import every single data item into the database that you will submit—marks will be deducted for failing to include all the data provided.
• THE PROBLEM
2.1 Recommend by number of common friends
For every Facebook user in the dataset with an ID number that is a multiple of 980, provide a list containing the first 10 friend recommendations, as determined by the number of common friends. If there are fewer than 10 recommendations, provide all the recommendations. For each case, you will have to provide the Neo4j CQL query that generated the friend recommendations, as well as the actual recommendations. Note that the recommendations alone will not receive any marks, and the Neo4j CQL queries must be provided in text format (screenshots will be awarded zero marks). Also, note that your queries will be tested. Hence, queries that do not produce the submitted results will invalidate the answers and zero marks will be awarded.
An example of the format in which your results should be delivered is listed below in Table 1. Please, note that the CQL query and the recommendations listed in Table 1 are shown only to illustrate the format, and they are not actual solutions to the problem in question.
User ID
CQL Query
Friend recommendation
…
…
…
13720
Match (friend:FacebookUser {id: ‘13790’})-[r:FRIENDS_WITH]-(
user:FacebookUser) RETURN friend
17125, 7033, 15462,
33049, 51105, 16424,
23, 7996, 1539,
17420
14700
Match (friend:FacebookUser {id: ‘14775’})-[r:FRIENDS_WITH]-(
user:FacebookUser) RETURN friend
14473, 14495, 17951,
19611, 22749, 23259,
30002, 3154, 8269,
862
15680
Match (friend:FacebookUser {id: ‘15760’})-[r:FRIENDS_WITH]-(
user:FacebookUser) RETURN friend
28606
…
…
…
Table 1
2.2 Recommend by influence scoring
For every Facebook user in the dataset with an ID that is a multiple of 980, provide a list containing the first 10 friend recommendations, as determined by influence scoring. If there are fewer than 10 recommendations, provide all the recommendations.
Once again, the output format should be the same shown in Table 1. You are expected to provide the Neo4j CQL query that generated the friend recommendations and the actual recommendations. Note that the recommendations alone will not receive any marks, and the Neo4j CQL queries must be provided in text format (screenshots will be awarded zero marks). Also, note that your queries will be tested. Hence, queries that do not produce the submitted results will invalidate the answers and zero marks will be awarded.
2.3 Algorithm comparison
Considering only those 65 Facebook users with an ID that is a multiple of 980, identify the Facebook users who have the same first 10 friend recommendations under both recommendation algorithms, and the Facebook users who have different first 10 friend recommendations under the two algorithms. Your answers should appear in the format depicted in Table 2 (see the following page. Please, note that the answers listed in Table 2 are shown only to illustrate the format in which you should submit the solutions; yet, they may not coincide with the actual solutions to the problem in question.
The code used for comparing the output of both algorithms has to be implemented using C# or Python (no other choices will be allowed), and it has to be included as part of your submission. If you do it manually, without using Python or C#, your solution will not be accepted and zero marks will be awarded.
2.4 Report: Which algorithm and database model do you propose to use?
Produce a report (1,000 words maximum) to explain which algorithm you propose to recommend new friends in Facebook, and whether you will implement it using a relational or non-relational database. Your report must be supported by evidence derived from the output you obtained in Section 4.1, Section 4.2 and Section 4.3. For example, if you propose to use influence scoring, you will have to use the results of the previous sections to explain why this is the best option.
IDs of users having the same output under both algorithms:
1960
4900
5880
…
IDs of users having different output under both algorithms:
980
2940
3920
…
Table 2
Before reaching a conclusion regarding which algorithm you wish to propose, you may want to compare both algorithms using more than the 65 Facebook users whose ID is a multiple of 980, but you will have to refer to the additional results as evidence to support your decision. It may also be the case that you propose a combination of the two algorithms explored above. While this is a suitable alternative, you have to be very clear in terms of explaining how exactly you attempt to combine the algorithms (which one you will apply first, how much weight is assigned to each algorithm, what happens if the algorithms agree or disagree, etc.).
Finally, you may want to combine the two algorithms explored with other ideas of your own (or ideas you have researched separately). You will still have to be very clear in terms of how your proposed algorithm will work and what evidence you present to support this.
• DELIVERABLES
Submit the following two files via the DLE.
• A WORD document containing (documents that are not in WORD format will not be reviewed and zero marks will be awarded):
• The solutions for Section 4.1 in the format illustrated by Table 1.
• The solutions for Section 4.2. Once again, the solutions should be submitted following the format illustrated by Table 1.
• The solutions for Section 4.3. The solutions should be submitted following the format of Table 2.
• A report explaining which algorithm you propose to use to make friend recommendations and whether you will implement it using a relational or non- relational database.
• The C# or Python code employed to obtain the answers for Section 4.3.
• ASSESSMENT AND GRADE CRITERIA
Task
Marks
1
Recommend by number of common friends
Explanation: See Section 4.1.
Note: This is a core module for the MSc Data Science and Business Analytics; thus, if the submitted Neo4j CQL queries do not execute, marks will not be granted.
Learning Outcomes: ALO2, ALO3.
25
2
Recommend by influence scoring
Explanation: See Section 4.2.
Note: This is a core module for the MSc Data Science and Business Analytics; thus, if the submitted Neo4j CQL queries do not execute, marks will not be granted.
Learning Outcomes: ALO2, ALO3.
25
3
Algorithm comparison (answers) and the source code (C# or Python) employed to compare the results of both algorithms.
Explanation: See Section 4.3.
Note: The code you submit is expected to execute correctly. If the submitted code does not compile and execute, the mark for this part of the assignment will be zero (0).
Learning Outcomes: ALO2, ALO3.
25
4
Report: Which algorithm and database model do you propose to use?
Explanation: See Section 4.4. Learning Outcomes: ALO1, ALO2.
25