Project 1: MapReduce
We have a set of 5 documents containing HTML links to some other documents. Thus, you may have a document, Doc1, which contains the word ‘manufacturer’ which is a hyperlink, www.apple.com, to the landing page of Apple, Inc. (the company) landing page. In this case we regard Doc1 as the source, the word ‘manufacturer’ as the anchor in Doc1, and the landing page (which we will call Apple Landing Page) as the target. So, we can represent this information as
Suppose you are given 5 documents which may contain hyperlinks to some targets. The MapReduce algorithm is used to compile a list of
Show the following:
• a.) Information stored on nodes n1, n2, n3 after Map has been executed
• b.) Information stored on nodes n4, n5 after Reduce has been executed
• c.) The final output
Also provide a diagram showing what information is sent to which node from which node.
You may assume that the 5 documents contains these an only these links on these anchors
Doc1: Named Doc1, contains anchor ‘Columbia University’ which points to Columbia University Landing Page, anchor ‘SPS’ which points to the SPS Landing Page, anchor ‘NYU’ which points to NYU Landing Page, and ‘Columbia’ which points to Columbia University Landing Page
Doc2: Named Doc2, contains ‘Ivy League school’ which points to Columbia University Landing Page, ‘Apple’ which points to Apple Landing Page
Doc3: SPS Landing Page contains the anchor ‘the university’ which points to Columbia University Landing Page, the anchor ‘APAN’ which points to the Applied Analytics Program page
Doc4: Named Doc4, contains ‘the university’ which points to NYU Landing Page
Doc 5: Named Doc5, contains ‘iOS’ which points to Apple Landing Page, and contains ‘windows’ which points to Microsoft Landing page.
Output:
codes in a Word file
Project 2
Create an application which uses the https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html to ingest tweets about the following list of celebrities from April 1 to the day before the project is due. Your application should be able to determine which of these celebrities are the most tweeted about and of the most tweeted celebrity, who has tweeted the most about that celebrity in this time frame.
Celebrities:
• Katy Perry (@katyperry)
• Justin Bieber (@justinbieber)
• Barack Obama (@BarackObama)
• Rihanna (@rihanna)
• Taylor Swift (@taylorswift13)
• Lady Gaga (@ladygaga)
• Ellen DeGeneres (@TheEllenShow)
• Cristiano Ronaldo (@cristiano)
• Justin Timberlake (@timberlake)
• Ariana Grande (@ArianaGrande)
Output:
• Describe the architecture of your system (which components, how do they interact, what are the inputs and outputs of each component, what type of database is used) and your justification for this architecture for solving the problem specified. (中文)
• Hadoop application. (codes in Word file)
• The query (or queries) to compute which celebrity has been the most tweeted about in the specified time frame. (codes in Word file)
• The query (queries) to compute for the most celebrity tweeted about, who has tweeted the most about that celebrity in the time specified. (codes in Word file)
• The results of executing these queries. Additionally, in a table, show the total number of tweets for each of these celebrities. ( in Word file)
Additionally, provide a link to the Google Cloud account in which the application is running, along with the credentials to enter the Google Cloud account