python代写:Assignment 3: Exploring the Twitterverse

Assignment 3: Exploring the Twitterverse

Purpose

The purpose of this assignment is to give you practise working with files, building and using dictionaries, designing functions, and writing unit tests.

Introduction: The Twitterverse

Twitter is a social networking website where users can post very short messages know as “tweets”. Each Twitter user can choose to “follow” other users, which means that they see those users’ tweets. A Twitter user sees the tweets of users they are “following”, and their tweets are seen by their “followers” (the users who follow them).

All the “follow” connections define a network among Twitter users, and it’s quite interesting to look for patterns in the connections. Tools like Twiangulate let you explore questions like “what connections do my two friends have in common?”. In this assignment, you’ll write a program that lets you ask questions (or “queries”) about a Twitter dataset.

Any tool for exploring the Twitterverse must get its data from Twitter itself. Twitter provides an API to allow programmers to write programs that interact with Twitter and extract data from it. In general, an API is a module that defines functions for accessing underlying data and performing other tasks without having to know how that data is actually stored and retrieved.

To make this assignment more manageable for you, we will assume that the information we need has already been extracted from Twitter and stored in a file.

How to tackle this assignment

This is your first experience designing a program of this size. We are providing detailed advice to help you break the task down into manageable pieces.

Make sure your twitterverse_functions.py module runs without error before submitting. If you have syntax errors in your module, comment them out before submitting, or we will not be able to test your functions!

The Twitter Data File

A Twitter data file contains a series of one or more user profiles, one after the other. Each user profile has the following elements, in this order:

  • A line containing a non-blank, non-empty username. You may assume that usernames are unique, that is, a single username will not occur more than once in the file, and that usernames do not contain any whitespace.
  • A line for the user’s actual name. If they did not provide a name, this line will be blank.
  • A line for the user’s location, or a blank line if they did not provide one.
  • A line for the URL of a website, or a blank line if they did not provide one.
  • Zero or more lines for the user’s bio, then a line with nothing but the keyword ENDBIO on it. This marks the end of the bio, and is not considered part of it. (You may assume that no bio has the string ENDBIO within it.) If the user did not provide a bio, the ENDBIO line will come immediately after the website line, with no blank line in between.
  • Zero or more lines each containing the username of someone that this user is following, then a line with the keyword END on it. (You may assume that no one has END as their username.) A user cannot be on his or her own following list. You may assume that every user on a following list has a user profile in the Twitter data file.

Notice that the keywords act as separators in this file. All of their letters are capitalized, and the keywords contain no punctuation.

Examples

Here is a sample user profile that might occur among many in a file:

tomCruise
Tom Cruise
Los Angeles, CA
http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END

The file data.txt is a smallish example of a complete Twitter data file (and was made by hand) and the file rdata.txt is a much larger example (and is made from real data extracted from Twitter). These should help you confirm your understanding of the file format and will also be useful in testing your program.

Cycles in the data

Although a user cannot be on their own following or followers lists, there can be “loops” (we call them “cycles”) such as this: user A can be following B who is following A. This is the shortest possible cycle. Of course, cycles can be longer.

The Query File

Note that the word “query” just means “question”. In computer science, we use it to mean a request for information. For this assignment, a query will be provided in a file. Below we will review the high level parts of the query, look at an example, and then describe the format of the query file.

Overview

A query has three components: a search specification, a filter specification, and a presentation specification.

The search specification describes how to generate a list of Twitter usernames, starting with an initial username (a list of length one) and then finding their followers or people they are following, then people that are those people’s followers or who they are following, and so on. When processing the search specification, don’t try to do anything to avoid cycles. For instance, if the search specification says to find the people who user A is following, and from there the people they are following, you could find yourself back at user A. Don’t try to avoid that.

After processing the search specification, we have a list of Twitter usernames. Its length could be zero. For example, if the initial username is 'dianelynnhorton' and the search specification contains a single 'followers' keyword, then the length of the list will be zero if 'dianelynnhorton' has no followers.

The filter specification describes how to filter the list of usernames produced by the search specification. The filtering can be based on

  • whether or not they are following a particular user,
  • whether or not a particular user is their follower,
  • whether their name contains a particular string (case-insensitive), or
  • whether their location contains a particular string (case-insensitive).

After processing the filter specification, we have a possibly reduced list of usernames.

Once the search results have been found and filtered, the presentation specification describes how the output should be presented. It specifies on what basis the results should be sorted, and whether the results should be presented in a short or long format.

Example query

Here is an example query:

SEARCH
tomCruise
following
following
following
FILTER
following KatieH
location-includes USA
PRESENT
sort-by popularity
format long

The search specification in this particular query has four steps.

  1. Start with a list containing the username to start the search from; ie. ['tomCruise']. Let’s call that list L1.
  2. The search keyword 'following' says to replace each username p in L1 with the usernames of the users who p is following. This yields a new list, L2.
  3. For the next 'following' keyword, we start with L2 and repeat the same operation as in the previous step, yielding another list, L3.
  4. For the final 'following' keyword, we start with L3 and repeat that operation one last time, yielding list L4.

Notice that each step yields a list of zero or more usernames that is the input to the next step. There should be no duplicates in the final results list. Duplicates should be removed after each step.

The Twitter data file diagram_data.txt contains the follower/following relationships as represented by this diagram. For those relationships, the search specification above would yield this list of usernames: ['i', 'j', 'h', 'k', 'tomCruise']. Make sure that you can see how the four lists, ending with this final one, are generated. Notice that the final list contains the users you can get to in three “steps” of the “following” relationship, starting from 'tomCruise'.

The final list generated by the search specification becomes the input to the filter specification. For our current example, the filter specification says that the list should be filtered in this way: a user should be kept only if they are following 'KatieH' and has a location that includes the string 'USA'. The presentation specification says to present the results in long format and to order the users according to their popularity.

Now let’s look in detail at the exact format and meaning of a query.

Overall format of a query

The format of the query data file is as described below, in this order:

  • The keyword SEARCH alone on a line.
  • The search specification.
  • The keyword FILTER alone on a line.
  • The filter specification.
  • The keyword PRESENT alone on a line.
  • The presentation specification.

Notice that the keywords above all act as separators in this file and are in all capital letters. There are other keywords (such as following which can be part of a search specification) that are not separators; they are not capitalized.

Format and meaning of the search specification

The format of a search specification is as shown below, in this order:

  • A single username
  • Zero or more lines, in any order, each containing one of:
    • the keyword followers
    • the keyword following

Each step in the search specification involves modifying the current list of users by going through each user in the current list and finding their followers or who they are following. Importantly, at each step, we want to create the new results list by replacing each user from the current list with the users that they are following or who are their followers.

The possible search operations for a given user are:

  • followers operation: a list of all usernames for users who are following the given user (ie. that have the given username in the value list paired with the key'following' in their data in the Twitterverse dictionary).
  • following operation: a list of all usernames for users who the given user is following (ie. the value list paired with the key 'following' in the data associated with the given user in the Twitterverse dictionary).

Note that, if the list of usernames is ['A'], and we perform a 'followers' or 'following' operation, 'A' will not be in the resulting list, since users cannot follow themselves. If 'A' is one of multiple usernames in the current list, then 'A' can appear in the results list, but only if one of the other usernames produces 'A' from the search operation.

The final list of results should not contain any duplicates. It is most efficient to remove duplicates after each operation is performed.

Format and meaning of the filter specification

The format of a filter specification is:

  • Zero or more lines, in any order, each containing one of the following:
    • the keyword name-includes, a space, and a string to match
    • the keyword location-includes, a space, and a string to match
    • the keyword follower, a space, and a string which is a valid username
    • the keyword following, a space, and a string which is a valid username

Each of the lines has exactly one blank within it (the space separating the keyword from the next string), and each keyword will appear at most once.

The filters are “additive” in that each filter step should be applied to the list that resulted from the previous filter step. That is, you start with the list from the search specification, filter it based on the first filter step to get a new list L1, filter L1 in the second filter step to get L2, and so on.

The 'following' and 'follower' filters will keep users in the list if they are following a particular user, or if a particular user is their follower. These filters are defined as:

  • The 'following' filter keeps only users who have the provided username in their 'following' list.
  • The 'follower' filter keeps only users who appear in the 'following' list for the provided username.

The filters that have 'includes' as part of the keyword will do a simple substring search on the string representing the relevant data for the user in the Twitterverse dictionary. This means that if the given string occurs anywhere in a user’s name (for a 'name-includes filter') or a user’s location (for a 'location-includes' filter), then that user is to be kept in the list. The substring search should be case-insensitive. For example, if the filter specifies users whose locations include the string “USA”, then users with location “USA”, “Modesto, California, USA”, “kansas, usa” or “Musala, Bulgaria” would be kept, and users with location “United States of America” would be excluded. This is far from perfect, but don’t try to improve on it.

Format and meaning of the presentation specification

The output from your program must include every user whose username is still in the list after the search and filter specifications have been processed.

The presentation specification describes how the output should be presented. It consists of these two lines, in order:

  • The keyword 'sort-by', then a space, then one of these keywords: 'username', 'name', 'popularity'
  • Either 'format short' or 'format long'.

The users must be listed in the order indicated by the 'sort-by' portion of the presentation specification. Output that is sorted by username or name should be in alphabetical order, so that strings starting with ‘a’ are at the beginning of the output. Although usernames are unique, names are not. If any users have the same name, sort them by username. When sorting users by popularity, the most popular users should come first. A user’s popularity is defined to be the number of followers they have. (Not the number of people they are following!) If any users are tied for popularity, sort them by username.

If the presentation specification is for short format output, the output should consist of the usernames as a Python list of strings. If the presentation specification is for long format output, the output for each user should consist of a series of lines about the user, as shown in the example below. The long format output must include a line of ten dashes between each pair of users, as well as a line of dashes at the very top and very bottom of the output.

If there are no users to be presented (i.e., the 2nd parameter in get_present_string represents an empty list), then the long format output should just include two lines of dashes.

For the long format, the series of lines describing a single user should be formatted as shown below:

tomCruise
name: Tom Cruise
location: Los Angeles, CA
website: http://www.tomcruise.com
bio:
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
following: ['katieH', 'NicoleKidman']

Notice that:

  • The name, location, and website text is reproduced exactly as it was in the Twitter data file, with the addition of the labels at the start of the line.
  • The bio begins on the line after the line containing 'bio:', unlike the other components that appear on the same line as their “tag”.
  • The bio itself is also reproduced exactly as it was in the Twitter data file, including upper and lower case, newline characters, etc.
  • The “following” list is produced simply by converting the list of username strings to a string.

Full example

If the user were to enter the name of the Twitter data file data.txt and the query query3.txt, the entire interaction with the Twitterverse program should be:

Data file: data.txt
Query file: query3.txt
----------
PerezHilton
name: Perez Hilton
location: Hollywood, California
website: http://www.PerezH...
bio:
Perez Hilton is the creator and writer of one of the most famous websites
in the world. And he also loves music - a lot!
following: ['tomCruise', 'katieH', 'NicoleKidman']
----------
tomCruise
name: Tom Cruise
location: Los Angeles, CA
website: http://www.tomcruise.com
bio:
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
following: ['katieH', 'NicoleKidman']
----------

If the query were changed to request short-format output, the entire interaction with your program should be:

Data file: data.txt
Query file: query3.txt
['PerezHilton', 'tomCruise']

Required Functions

In the starter code file twitterverse_functions.py, complete the following function definitions. In addition, you must add some helper functions to aid with the implementation of these required functions. You will find more details on the Twitterverse dictionary, and query dictionary format in the next section, and in the starter code file.

Function name:
(Parameter types) -> Return type
Full Description (paraphrase for your docstring)
process_data:
(file open for reading) -> Twitterverse dictionary
The parameter represents a Twitter data file that is already open for reading. Read the file and return the data in the Twitterverse dictionary format.

Note: in the docstring, do not provide example calls for functions that read files.

process_query:
(file open for reading) -> query dictionary
The parameter represents a query file that is already open for reading. Read the file and return the query in the query dictionary format.

Note: in the docstring, do not provide example calls for functions that read files.

all_followers:
(Twitterverse dictionary, str) -> list of str
The first parameter represents the data in the Twitterverse dictionary, and the second parameter represents a username.

Identify all the usernames that are following the user specified by the second parameter and return them as a list. This is a helper function for the sorting code below. You are also free to call it in the code that you write too.

get_search_results:
(Twitterverse dictionary, search specification dictionary) -> list of str
The first parameter represents the data in the Twitterverse dictionary format, and the second parameter represents the search specification in the search specification dictionary format.

Perform the specified search on the given Twitter data, and return a list of strings representing usernames that match the search criteria. The parameters should not be modified by the function.

get_filter_results:
(Twitterverse dictionary, list of str, filter specification dictionary) -> list of str
The first parameter represents the data in the Twitterverse dictionary format, the second parameter represents a list of usernames, and the third parameter represents the filter specification in the filter specification dictionary format.

Apply the specified filters to the given username list to determine which usernames to keep, and return the resulting list of usernames. The parameters should not be modified by the function.

get_present_string:
(Twitterverse dictionary, list of str, presentation specification dictionary) -> str
The first parameter represents the data in the Twitterverse dictionary format, the second parameter represents a list of usernames, and the third parameter represents the presentation specification in the presentation specification dictionary format.

Format the results for presentation based on the given presentation specification and return the formatted string. The parameters should not be modified by the function.

 

Data Structures

To increase the readability of our code, we will use dictionaries to represent both the Twitter data and the query in our program. In this section, we’ll look at the format of the data and query dictionaries.

The Twitterverse Data Dictionary

The type of the Twitterverse dictionary is dict of {str: dict of {str: object}}. More specifically, each key in the Twitterverse dictionary is the username of a Twitter user, and the value associated with that key is a dictionary containing additional information about the user. Each of these inner dictionaries will contain the keys 'name', 'location', 'web', 'bio', and 'following'. The value associated with the 'following' key is a list of zero or more strings, and the values associated with the rest of the keys are strings (and may be the empty string).

For example, if the following user information is included in our Twitter data file…

tomCruise
Tom Cruise
Los Angeles, CA
http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END

… then this key/value pair will be in our Twitterverse dictionary.

'tomCruise': {'name': 'Tom Cruise',
            'bio': 'Official TomCruise.com crew tweets. We love you guys!\nVisit us at Facebook!',
            'location': 'Los Angeles, CA',
            'web': 'http://www.tomcruise.com',
            'following': ['katieH', 'NicoleKidman']}

And another example: if the following user information is included in our Twitter data file (note the blank lines for name, location, and web, and zero lines for bio and following)…

quietTweeter



ENDBIO
END

… then this key/value pair will be in our Twitterverse dictionary.

'quietTweeter': {'name': '',
            'bio': '',
            'location': '',
            'web': '',
            'following': []}

Note that dictionaries are not ordered, so don’t worry if your dictionary output is in a different order, as long as the items are the same.

The Query Dictionary

We will also use dictionaries to represent queries in our program. The query dictionary contains three items; the key 'search', whose value is a dictionary representing the search specification, the key 'filter', whose value is a dictionary representing the filter specification, and the key 'present', whose value is a dictionary representing the presentation specification.

The search specification dictionary contains the keys 'username' and 'operations', whose values are the string representing the username to start at, and a list of strings representing the operations to perform, respectively. Note that the list that is the value associated with the key 'operations' may be empty, if there is only one line between the SEARCH and FILTER keywords in the query file. If there is only one line between SEARCH and FILTER, then it represents the username start at, and there are no operations to be performed in the search step. The operations must be performed in the order that they are listed in the file. (Note the change in results if the two operations are reversed in the Full Example using query3.txt.)

The filter specification is a dictionary with strings representing filter keywords as keys, and the strings to match as values. Note that this dictionary may be empty if there are no filters to apply, and that each filter keyword will appear at most once in the filter section of a query file.

The presentation specification is a dictionary with keys 'sort-by' and 'format', and string values representing the order to sort the results, and the format to display them in, respectively.

For example, if our query file looks like this…

SEARCH
tomCruise
following
following
following
FILTER
following KatieH
location-includes USA
PRESENT
sort-by popularity
format long

… then the query dictionary will be this (again, the order of items may vary).

{ 'search': {'username': 'tomCruise', 'operations': ['following', 'following', 'following']},
  'filter': {'following': 'KatieH', 'location-includes': 'USA'},
  'present': {'sort-by': 'popularity', 'format': 'long'}}

Note that there are other ways that we could have chosen to represent the data and the query in our program. However using dictionaries instead of, for example, lists, means that our code contains string literal keys that are descriptive of what is being represented. For example, using the format above, we will refer to the search specification in the query dictionary using query['search'], rather than query[0], as we might if we were using a list or tuple of query information.

The main program

Once you have correctly implemented the functions in twitterverse_functions.py, execution of the main program (twitterverse_program.py) will:

  1. Read a data file and produce the Twitterverse data dictionary.
  2. Read a query file and produce the query dictionary.
  3. Compute the results of the search operations.
  4. Apply the filters to the search results.
  5. Create the presentation string.

Required Testing (unittest)

Write (and submit) a set of unittests for the get_filter_results function. Name this file test_get_filter_results.py. For each test method, include a brief docstring description specifying what is being tested. For unittest methods, the docstring description should not include a type contract or example calls.

Files to Download

All of the files included in the download for the assignment are listed in this section. These files must all be placed in the same folder. Download the files by right-clicking on the link and using your web-browser’s “Save as” option. Do not use “cut and paste” as sometimes this can cause undesirable formatting characters to be added to our Python code.

Please download the Assignment 3 Files and extract the zip archive. The following paragraphs explain the files you have been given.

  • Python Code
    • The starter code for this assignment is in the file twitterverse_functions.py. Download this file and complete it.
    • Download the main program file twitterverse_program.py. It is complete and must not be changed.
    • Download the a3 type checker (more details below): a3_typechecker.py. It is complete and must not be changed.
    • The starter code for the required unittests is in the file test_get_filter_results.py. Download and complete the file.
  • Sample Twitter Data and Sample Queries
    • We have provided a number of text files containing Twitter data, and queries on that data. All files in the provided zip file that contain data in their filename are Twitter data files, and all files in the provided zip file that contain query in their filename are query files. Put these files in the same directory as your other A3 files.
    • The query files query1.txt, query2.txt, and query3.txt are queries on the data in data.txt, and the file query4.txt is a query on the data in rdata.txt. The files small_data.txt and typecheck_query.txt are for use with the a3_typechecker.py.

a3_typechecker.py

We are providing a type-check module that can be used to test whether your functions in twitterverse_functions.py have the correct parameter and return types. To use the type checker, place a3_typechecker.py in the same folder (directory) as your twitterverse_functions.py and run it.

If the type-checks pass: the output will tell you that the typechecker passed (and what it means for the typechecker to pass!). If the typechecker passes, then the parameters and return types match the assignment specification for each of the functions.

If the type-checks fail: Look carefully at the message provided. One or more of your parameter or return types does not match the assignment specification. Fix your code and re-run the tests. Make sure the tests pass before submitting.

We will do a much more thorough job of testing your code once you hand it in, so be sure that you have thoroughly tested it yourself. With the functions in this assignment, there are many more possible cases to test (and cases where your code could go wrong). If you want to get a great mark on the correctness of your functions, do a great job of testing your functions under all possible conditions. Then we won’t be able to find any errors that you haven’t already fixed!

Additional Requirements

  • Do not change any of the existing code. Add to it as specified in the comments.
  • Do not call print, input, or open in the code that you write. Notice that two of the required functions take an open file, not a string.
  • Do not use any break or continue statements. Any functions that do will receive a mark of zero. We are imposing this restriction (and we have not even taught you these statements) because they are very easy to “abuse,” resulting in terrible code.
  • Do not import any Python modules, other than unittest and twitterverse_functions.

Error checking

You may assume that the files actually contain a valid Twitter data file and a valid query file, respectively. You do not need to check either file to make sure that it is a valid Twitter or query file.

You may be wondering how you can be sure that the content of the Twitter data file is valid. For example, there are probably rules about legal Twitter usernames and there are certainly rules about valid URLs. Don’t confirm that the username, URL, or any other elements of the Twitter data file are valid. Simply read whatever string is in the relevant position in the file and use it as is.

Reading the Twitter data file

Typically, the easiest way to read through a data file is with a for-loop, but because of the structure of a Twitter data file, you will need to use while loops.

Watch out for how you handle \n (newline) characters. If you use readline to get a line of the file, it will have a \n on the end (except possibly on the last line in the file). You need to be aware of that before you try to compare it to keywords like END. On the other hand, there are times when you really do need a \n. For instance, when you store the contents of a user’s bio, it must include any newline characters that appear in the middle of the bio, so that you are able to display it in the same way when using the long format presentation.

Sorting your output

Part of processing the presentation specification requires you to sort the list of users according to username, name, or popularity. Since you will be sorting a list (of string usernames), you can use list.sort to accomplish the task. However, by default list.sort uses simple < to compare pairs of list elements. We know that <on strings compares them alphabetically. So list.sort will simply sort the strings into alphabetical order. In some cases, that is not the ordering you want.

Fortunately, there are ways to get around this. In the starter code for twitterverse_functions.py, we have provided our own implementation of a sorting function that has a parameter that you can use to control the ordering. All you have to do is send tweet_sort a function that takes two parameters of the same type as the elements in your list (in this case, the two parameters would be username strings) and returns:

  • -1 if the first one should appear before the second one in a sorted list
  • 1 if the first one should appear after the second one in a sorted list, and
  • 0 if they are “tied”, so that it doesn’t matter which one comes first.

We have also provided some functions that return -1, 1, or 0 for each possible sorting case.

The simple module sort_demo.py demonstrates how this can be done. It is a good example of passing a function to a function. We strongly recommend you spend some time understanding this code before you work on the sorting part of the presentation specification.

Other ways to work with functions

There will be other places where you can eliminate a great deal of duplication in your code by passing a function as an argument to a function.

Functions can be used similarly to variables. In particular, we can use them as values in dictionaries. You may find the example in the moduledict_of_funcs_demo.py useful in understanding how to call different functions in different cases, without having to write lengthy if/elif/else blocks.

Marking

These are the aspects of your work that we will focus on in the marking:

  • Correctness: Your code should meet the assignment requirements. Correctness, as measured by our tests, will count for the largest single portion of your marks.
  • Docstrings and comments: We want to see great docstrings and internal comments. See the Assignment Rules for more details.
  • Program design: Your program should be broken down into functions, both to avoid repetitive code and to make the program easier to read. Each function should have a coherent purpose. Functions should use parameters to get all the information they need from the caller, and should use a return value to give back any relevant information to the caller.
  • Programming style: Your variables names should be meaningful and your code as simple and clear as possible.
  • Formatting style: Make sure that you read the Assignment Rules page to review the rules for formatting your code.

What to Hand In

The very last thing you do before submitting should be to run the typechecker one last time. Otherwise, you could make a small error in your final changes before submitting that causes your code to receive zero for correctness.

You must hand in your work electronically, using the MarkUs online system. Instructions for doing so are posted on the Assignments page of the course website.

For this assignment, you will hand in two files:

  • The module twitterverse_functions.py.
    This should contain implementations of the required functions and all helper functions that you have written.
  • The module test_get_filter_results.py.
    This should contain your unittests for the get_filter_results function.

Once you have submitted, be sure to check that you have submitted the correct version; new or missing files will not be accepted after the due date. Remember that spelling of filenames, including case, counts. If your file is not named exactly as above, your code will receive zero for correctness.

Future Thoughts

To make your program realistic, the only piece that is missing is code to get the Twitter data directly from Twitter, rather than through a data file. As you do more computer science, you will get more experience using APIs, like the one used to make the connection to Twitter.

While the query format we used in this assignment is fairly general, there are many things you can’t express with our queries. Even the simple query “who do I follow who doesn’t follow me?” can’t be expressed because we have no way to negate. You may find it interesting to think about other sorts of queries that can’t be expressed. The format we invented for our queries defines a kind of formal “language”. (We call it a formal language to contrast with “natural” languages like Cantonese or French: a language that humans grow up speaking.) The most popular query language in computer science is called SQL. It is much more powerful than our query language, and you will learn about it if you take CSC343 (Introduction to Databases).