1 Assignment 2
1.1 Problem A (25 points)
assignment-2 October 26, 2018
A1. (5 points) Get the real-time schedule for Suburban Station using SEPTA’s API (as shown in the example in Section 3.2.2 in the lecture notes, use the correct station code) and put this in a list.
Note: Your output should look something like:
schedule = [["EMG=' No Emg Message"], ['R4S=10:55',
'Airport', '3B', ' 3 LATE', 'LOCAL ', '427 ', '<_NEXT_MSG>11:25', 'Airport',
'3B', 'ON TIME', 'LOCAL ', '429 ', '<_NEXT_MSG>11:55', 'Airport', '3B', 'ON TIME', 'LOCAL ', '431 ', '<_NEXT_MSG>12:25', 'Airport', '3B', 'ON TIME', 'LOCAL ', '433 ', ''], ...]
In [3]: import csv import pprint
1
import requests
schedule_response = requests.get(“http://www3.septa.org/ccstations/ss/sched_data.csv”) schedule_text = schedule_response.text.strip().split(“\n”)
schedule_reader = csv.reader(schedule_text)
schedule = list(schedule_reader)
pprint.pprint(schedule)
[["EMG=' No Emg Message"], ['R4S=06:25',
'Airport', '5B', 'ON TIME', 'LOCAL ', '1457 ', '<_NEXT_MSG>06:26',
'30th St', '4B', ' 7 LATE', 'LOCAL ', '6455 ', '<_NEXT_MSG>06:55', 'Airport',
'3B', 'ON TIME', 'LOCAL ', '8459 ', '<_NEXT_MSG>07:25', 'Airport', '3B', 'ON TIME', 'LOCAL '461 ', ''],
['R4N=06:35', 'Glenside', '2A', 'ON TIME', 'LOCAL
'454 ', '<_NEXT_MSG>07:05', 'Warminster', '1A', 'ON TIME', 'LOCAL '456 ', '<_NEXT_MSG>07:35', 'Glenside', '2A',
‘,
‘,
‘,
2
'ON TIME', 'LOCAL ', '458 ', '<_NEXT_MSG>WEEKDAY BUSING BTWN WARMINSTER ', '& GLENSIDE DURING MIDDAY', ''],
['R2S=06:41', 'Newark', '3A', ' 1 LATE', 'LOCAL ', '5265 ', '<_NEXT_MSG>07:11',
'30th St', '3A', 'ON TIME', 'LOCAL ', '6267 ', '<_NEXT_MSG>07:40', 'Wilmington',
'3A', 'ON TIME', 'LOCAL ', '4269 ', '<_NEXT_MSG>08:06', '30th St', '4A', 'ON TIME', 'LOCAL ', '6271 ', ''],
['R3N=06:39', 'West Trenton', '2A', 'ON TIME', 'LOCAL ', '2386 ', '<_NEXT_MSG>06:46', 'Suburban Sta',
'O', 'ON TIME', 'LOCAL ', '1386 ', '<_NEXT_MSG>07:22', 'West Trenton', '2A', ' 2 LATE', 'LOCAL ',
3
'388 ', '<_NEXT_MSG>FROM MON OCT 15 WKDY WTR TRAINS ', 'DEPARTS 10 MIN EARLY 9:05A-2:05P', ''],
['R3S=06:35', 'Elwyn', '4A', 'ON TIME', 'EXP TO FERNWOOD-YEADON ', '2371 ', '<_NEXT_MSG>07:28',
'Elwyn', '4A', ' 3 LATE', 'LOCAL '373 ', '<_NEXT_MSG>08:31', 'Elwyn', '3A', 'ON TIME', 'LOCAL '383 ', '<_NEXT_MSG>09:16', 'Elwyn', '3A', 'ON TIME', 'LOCAL '387 ', ''],
['R5N=07:00', 'Doylestown', '1B', ' 3 LATE', 'LOCAL
'584 ', '<_NEXT_MSG>07:20', 'Lansdale', '1B', 'ON TIME', 'LOCAL '586 ', '<_NEXT_MSG>07:50', 'Doylestown', '1B', 'ON TIME', 'LOCAL '588 ', '<_NEXT_MSG>08:50',
‘,
‘,
‘,
‘,
‘,
‘,
4
'Doylestown', '1B', 'ON TIME', 'LOCAL
‘592 ‘,
''], ['R5S=06:25',
'Malvern', '4B', ' 3 LATE', 'LOCAL '573 ', '<_NEXT_MSG>06:52', 'Thorndale',
'4B', ' 1 LATE', 'LOCAL '575 ', '<_NEXT_MSG>07:10', 'Thorndale', '4B', 'ON TIME', 'LOCAL '577 ', '<_NEXT_MSG>07:45', 'Malvern', '4B', 'ON TIME', 'LOCAL '579 ', ''],
['R2N=06:42', 'Norristown', '1A', '15 LATE', 'LOCAL
‘,
‘,
‘,
‘,
‘,
‘,
'264 ', '<_NEXT_MSG>06:56', 'Temple U', '2A', ' 1 LATE', 'EXP ', '9294 ', '<_NEXT_MSG>07:37', 'Norristown', '1A', 'ON TIME', 'LOCAL ',
5
'8266 ', '<_NEXT_MSG>07:45', 'Temple U', '1A', 'ON TIME', 'LOCAL ', '9268 ', ''],
['R0=06:31', 'Cynwyd', '6B', 'ON TIME', 'LOCAL ', '1087 ', '<_NEXT_MSG>07:15', 'Cynwyd',
'6B', 'ON TIME', 'LOCAL ', '1089 ', '<_NEXT_MSG>08:00', 'Cynwyd', '6B', 'ON TIME', 'LOCAL ', '1091 ', '<_NEXT_MSG>', '', '', '', '', '', ''],
['R7N=06:45', 'Temple U', '2B', ' 4 LATE', 'LOCAL ', '9766 ', '<_NEXT_MSG>07:03', 'Chestnut H East',
'2B', 'ON TIME', 'LOCAL ', '8764 ', '<_NEXT_MSG>07:23', 'Chestnut H East', '1B',
6
' 1 LATE', 'LOCAL '768 ', '<_NEXT_MSG>08:02', 'Temple U',
‘,
'2B', 'ON TIME', 'LOCAL ', '9770 ', ''],
['R7S=06:31', 'Trenton', '4A', ' 5 LATE', 'EXP TO HOLMESBURG ', '3755 ', '<_NEXT_MSG>06:48', 'Trenton',
'4A', ' 3 LATE', 'LOCAL '757 ', '<_NEXT_MSG>07:19', 'Trenton', '4A', 'ON TIME', 'LOCAL '759 ', '<_NEXT_MSG>Inbound delays btwn 5-15 mins. ', 'from 8PM Fri til end of svc. Sun', ''],
['R8N=07:23', 'Fox Chase', '1B', 'ON TIME', 'LOCAL ', '6856 ', '<_NEXT_MSG>08:16',
'Fox Chase', '1B', 'ON TIME', 'LOCAL
‘,
'860 ', '<_NEXT_MSG>09:23', 'Fox Chase', '1B', 'ON TIME', 'LOCAL ',
‘,
‘,
7
'864 ', '<_NEXT_MSG>10:30', 'Fox Chase', '1B', 'ON TIME', 'LOCAL ', '866 ', ''],
['R8S=06:44', 'Chestnut H West', '3B', 'ON TIME', 'LOCAL ', '845 ', '<_NEXT_MSG>07:15', 'Chestnut H West', '3B', 'ON TIME', 'LOCAL ', '851 ', '<_NEXT_MSG>07:49', 'Chestnut H West', '3B', 'ON TIME', 'LOCAL ', '853 ', '<_NEXT_MSG>Inbound delays btwn 5-15 mins. ', 'from 8PM Fri til end of svc. Sun', ''],
['CC-N=06:35', 'Glenside', '2A', 'ON TIME', 'LOCAL
‘,
'454 ', '<_NEXT_MSG>06:39', 'West Trenton', '2A', 'ON TIME', 'LOCAL ', '2386 ', '<_NEXT_MSG>06:42', 'Norristown', '1A', '15 LATE', 'LOCAL '264 ', '<_NEXT_MSG>06:45',
‘,
8
'Temple U', '2B', ' 4 LATE', 'LOCAL ', '9766 ',
''], ['CC-S=06:25',
'Airport', '5B', 'ON TIME', 'LOCAL ', '1457 ', '<_NEXT_MSG>06:25', 'Malvern',
'4B', ' 3 LATE', 'LOCAL '573 ', '<_NEXT_MSG>06:26', '30th St', '4B', ' 7 LATE', 'LOCAL ', '6455 ', '<_NEXT_MSG>06:31', 'Cynwyd', '6B', 'ON TIME', 'LOCAL ', '1087 ', ''],
['SERVICE=For schedule and travel information go to SEPTA.org'], ['TIMESTAMP=10/24/2018 18:24:41 PM']]
A2. (10 points) Investigate this list by printing it out. Extract three pieces of information for each train: its scheduled arrival time, destination, and its lateness/timeliness status. Store these in a list that looks like the following.
[['01:55', 'Airport', ' 1 LATE'], ['02:25', 'Airport', 'ON TIME'], ['02:55', 'Airport', 'ON TIME'], ['03:25', 'Airport', 'ON TIME'], ['02:05', 'Warminster', ' 1 LATE'], ['02:35', 'Glenside', 'ON TIME'], ['03:05', 'Warminster', 'ON TIME'], ['02:37', 'Marcus Hook', 'ON TIME'], ['03:10', 'Newark', 'ON TIME'],
‘,
9
['03:14', 'Wilmington', 'ON TIME'], ['04:09', 'Newark', 'ON TIME']]
Note: If you’re still working on A1 you can earn full credit for this part by operating on the sample output for A1.
[HINT: You will need to use regular expressions to extract the time. Each train is on a separate newline, and a variable number of train information is reported on each line. Consider using the modulus operator (%), which provides the remainder when one number is divided by another: remainder = numerator % denominator. Each of the variable number of trains takes up a fixed number of columns.]
In [14]: # QUESTION: Do we have to include next message? import re
# Use regex to extract the time
def extract_time(s):
pattern = re.compile(r'[>=]\d{2}:\d+’) match = pattern.search(s)
if match:
return match.group().strip(‘>=’)
else:
return None
# Extract three pieces of information for each train: its scheduled arrival time, dest
def A2(msg):
result = list()
for i in range(0, len(msg), 6): if i + 6 > len(msg):
break else:
train_info = msg[i : (i + 6)] extract_each_train = list() extract_each_train.append(extract_time(train_info[0])) extract_each_train.append(train_info[1]) extract_each_train.append(train_info[3]) result.append(extract_each_train) pprint.pprint(result)
return result
extracted_train_info = A2(schedule[1])
[['06:25', 'Airport', 'ON TIME']] [['06:25', 'Airport', 'ON TIME'], ['06:26', '30th St', ' 7 LATE']] [['06:25', 'Airport', 'ON TIME'],
['06:26', '30th St', ' 7 LATE'],
['06:55', 'Airport', 'ON TIME']] [['06:25', 'Airport', 'ON TIME'],
10
['06:26', '30th St', ' 7 LATE'], ['06:55', 'Airport', 'ON TIME'], ['07:25', 'Airport', 'ON TIME']]
A3. (5 points) Create a new list and use dateutil.parser (Section 4.4.4) to convert the time values into datetime objects. Store all three values for each train in the new list. Sort the list according to arrival time.
Note: If you’re still working on A2 you can earn full credit for this part by operating on the sample output (above).
In [17]: # QUESTION: why I just only has these four data? I mean it should be 7 data. from dateutil import parser
def A3(info): new_info = []
for u in info:
_u = [parser.parse(u[0])] _u += u[1:] new_info.append(_u)
pprint.pprint(new_info) new_info.sort(key = lambda _u: u[0]) return new_info
A3_extracted_train_info = A3(extracted_train_info)
[[datetime.datetime(2018, 10, 24, 6, 25), 'Airport', 'ON TIME'], [datetime.datetime(2018, 10, 24, 6, 26), '30th St', ' 7 LATE'], [datetime.datetime(2018, 10, 24, 6, 55), 'Airport', 'ON TIME'], [datetime.datetime(2018, 10, 24, 7, 25), 'Airport', 'ON TIME']]
BONUS. (10 points) Notice that one crucial piece of information missing from the arrival times is AM/PM information. This leads dateutils.parser to treat the 12-hour format timestrings as 24-hour format timestrings.
To solve this problem, utilize tools from the datetime module. Go through the original list created in A1 in a loop and use the datetime module to convert the timestrings into datetime objects containing the correct date and 24-hour time. Put these new arrival times, the destination, and lateness information in a new list.
Note: This part does not depend on A3. You can earn full credit for this part by operating on the sample output for A2.
[HINT: You can use the current system time and the fact that the schedule information only contains trains arriving in the next few hours to fix the AM/PM problem.]
In [ ]: from datetime import datetime
A4. (5 points) Create hourly log files with train information in data/trains/ named with
the appropriate timestamp containing date and hour, so that when sorted by name, they are also 11
sorted chronologically. The files should contain the 24-hour format arrival time, destination, and lateness for trains scheduled to arrive in that hour, with one train per line.
For example, some of the lines from a log file for 7 PM would look like this:
19:11, 30th St, ON TIME 19:15, Cynwyd, ON TIME 19:23, Fox Chase, ON TIME
Note: Note, even if you don’t complete the BONUS, this part does not depend on it. You can again earn full credit for this part by operating on the sample output for A2.
In [ ]: import os
def check_file_exist(filename):
if os.path.exists(filename): return
else:
with open(filename, ‘w’) as f:
f.write()
def A4(train_infos):
for u in train_infos:
print(u)
arrival_datetime = parse.parse(u[0])
print(arrival_datetime)
filename = ‘data/trains/’ + str(arrival_datetime).split(‘:’)[0] + ‘.log’ print(filename)
check_file_exist(filename)
with open(filename, ‘a+’) as f:
f.write(‘, ‘.join(u) + ‘\n’) A4(extracted_train_info)
1.2 Problem B (25 points)
B1. (10 points) For this problem, you will be using the Sportradar Soccer v3 API. Sign up for a Sportradar developer account and obtain an API key for the “Soccer Europe Trial v3” API. Use this key to download the results of matches in JSON format from the current season of the English Pre- mier League. Consult the API documentation in order to construct the correct request for getting tournament results. You will also need the tournament ID for the Premier League competition in England (coverage table). Examine the API response.
Note: the data for a single match in the json response should look something like:
results['results'][0] = {'sport_event': {'competitors': [{'abbreviation': 'MUN', 'country': 'England',
'country_code': 'ENG', 'id': 'sr:competitor:35', 'name': 'Manchester United', 'qualifier': 'home'},
12
{'abbreviation': 'LEI',
In [ ]:
'country': 'England', 'country_code': 'ENG', 'id': 'sr:competitor:31', 'name': 'Leicester City', 'qualifier': 'away'}],
'id': 'sr:match:14735957', 'scheduled': '2018-08-10T19:00:00+00:00', 'season': {'end_date': '2019-05-13',
'id': 'sr:season:54571', 'name': 'Premier League 18/19', 'start_date': '2018-08-10', 'tournament_id': 'sr:tournament:17', 'year': '18/19'},
'start_time_tbd': False, 'tournament': {'category': {'country_code': 'ENG',
'id': 'sr:category:1',
'name': 'England'}, 'id': 'sr:tournament:17',
'name': 'Premier League', 'sport': {'id': 'sr:sport:1', 'name': 'Soccer'}},
'tournament_round': {'number': 1, 'phase': 'regular season',
'type': 'group'}, 'venue': {'capacity': 75635,
'city_name': 'Manchester', 'country_code': 'ENG', 'country_name': 'England', 'id': 'sr:venue:9', 'map_coordinates': '53.463150,-2.291444', 'name': 'Old Trafford'}},
'sport_event_status': {'away_score': 1, 'home_score': 2,
'match_status': 'ended', 'period_scores': [{'away_score': 0,
'home_score': 1, 'number': 1, 'type': 'regular_period'},
{'away_score': 1, 'home_score': 1, 'number': 2, 'type': 'regular_period'}],
'status': 'closed', 'winner_id': 'sr:competitor:35'}}
B2. (5 points) The goal of this problem is to construct a standings table. Your final output should look like the table stored in data/sample-table.txt:
13
In [19]: print(open("data/sample-table.txt", "r").read())
Team G W D L GF GA GD P FCB 6 4 1 1 17 7 10 13 MAD 6 4 1 1 12 6 6 13 ATM 6 3 2 1 8 4 4 11 SFC 6 3 1 2 13 6 7 10 ALA 5 3 1 1 8 5 3 10 ESP 6 3 1 2 6 4 2 10 CEL 6 2 3 1 11 9 2 9 VIL 6 2 2 2 5 3 2 8
RSO 6 2 2 2 9 9 0 8 GIR 5 2 2 1 7 8 -1 8 GET 5 2 1 2 4 4 0 7 EIB 6 2 1 3 5 7 -2 7 ATH 5 1 3 1 7 9 -2 6 RBB 5 1 3 1 3 5 -2 6 VCF 6 0 5 1 4 6 -2 5 LUD 5 1 1 3 8 11 -3 4 LEG 6 1 1 4 6 11 -5 4 RVC 5 1 1 3 5 12 -7 4 HUE 6 1 1 4 6 16 -10 4 RVA 5 0 3 2 3 5 -2 3
Here, the fields are: abbreviation of team name, games played, games won, games drawn, games lost, goals for, goals against, goal difference, and total points. Important: in soccer leagues, winning a match results in 3 points for the winner and 0 points for the loser. Drawing results in 1 point for each team. Goal difference is the difference between goals for and goals against.
First, collect the required information from each match result and store these individual result records in one data structure.
Note: If you are still working on B1 you can earn full credit by working with B1’s sample output.
In [ ]:
B3. (5 points) Now, using the stored data for each match, construct aggregate result data for each team and store this in a list.
In [ ]:
B4. (5 points) In the English Premier League, standings are calculated on the basis of points earned. Sort (Section 2.1.1.7) the standings list by the number of points earned. Print the standings out in the format shown in B2.
Note: If you are still working on B3 you can use (load) the sample standings table (as a list of lists) to practice sorting this kind of data structure.
In [ ]:
14
BONUS. (5 points) If two teams have the same points, the team with the higher goal difference is ranked higher. If the goal differences are also same, the team with the higher number of goals scored is ranked higher. Perform the sort considering all three factors (points, goal difference, goals for).
Note: If you are still working on B3 you can use (load) the sample standings table (as a list of lists) to practice sorting this kind of data structure.
In [ ]:
1.3 Problem C (20 points)
C1. (5 points) Read the text from data/tempest.txt. Investigate the text and split it up into scenes using a pattern that matches the scene headers as a delimiter in the re.split() function (Section 4.4.1.2). Also, explain when the first element returned by re.split() function is undesirable in the response box below. [Hint: make sure that your regular expression captures your delimiters (scene headings) with grouping parentheses: – re.split(“(delim)”, test)]
Your output should look like:
scene_texts = ['\ufeffThis Etext file is presented by Project Gutenberg, in\ncooperation...', ' "\n\nOn a ship at sea; ...\n\n MASTER. Boatswain!\n BOATSWAIN. Here, master; w
Response.
In [27]: # Why just only print these results import re
import pprint
# Read the text from data/tempest.txt file = open(“data/tempest.txt”, ‘r’) if file.mode == “r”:
contents = file.read()
# split the text into header, scene and dialogue = re.split(r”(SCENE \d)”, contents)[ pprint.pprint(scene_texts)
('\n' '\n' 'On a ship at sea; a tempestuous noise of thunder and lightning\n' 'heard\n' '\n' 'Enter a SHIPMASTER and a BOATSWAIN\n' '\n' ' MASTER. Boatswain!\n' ' BOATSWAIN. Here, master; what cheer?\n' " MASTER. Good! Speak to th' mariners; fall to't yarely, or\n" ' we run ourselves aground; bestir, bestir. Exit\n' '\n' ' Enter MARINERS\n' '\n'
15
' BOATSWAIN. Heigh, my hearts! cheerly, cheerly, my hearts!\n'
" ' '\n' ' ' '\n' " ALONSO. Good boatswain, have care. Where's the master?\n" ' Play the men.\n' ' BOATSWAIN. I pray now, keep below.\n' ' ANTONIO. Where is the master, boson?\n' ' BOATSWAIN. Do you not hear him? You mar our labour;\n' ' keep your cabins; you do assist the storm.\n' ' GONZALO. Nay, good, be patient.\n' ' BOATSWAIN. When the sea is. Hence! What cares these\n'
-
' roarers for the name of king? To cabin! silence! Trouble\n'
-
' us not.\n' ' GONZALO. Good, yet remember whom thou hast aboard.\n' ' BOATSWAIN. None that I more love than myself. You are\n'
-
' counsellor; if you can command these elements to\n'
-
' silence, and work the peace of the present, we will not\n'
-
' hand a rope more. Use your authority; if you cannot, give\n'
" thanks you have liv'd so long, and make yourself ready\n"
-
' in your cabin for the mischance of the hour, if it so\n'
-
' hap.-Cheerly, good hearts!-Out of our way, I say.\n' ' Exit\n' ' GONZALO. I have great comfort from this fellow. Methinks\n'
-
' he hath no drowning mark upon him; his complexion is\n'
-
' perfect gallows. Stand fast, good Fate, to his hanging;\n'
-
' make the rope of his destiny our cable, for our own doth\n'
" little advantage. If he be not born to be hang'd, our\n"
-
' case is miserable. Exeunt\n' '\n'
-
' Re-enter BOATSWAIN\n' '\n' ' BOATSWAIN. Down with the topmast. Yare, lower, lower!\n'
" Bring her to try wi' th' maincourse. [A cry within] A\n"
-
' plague upon this howling! They are louder than the\n'
-
' weather or our office.\n' '\n'
-
' Re-enter SEBASTIAN, ANTONIO, and GONZALO\n' '\n'
" Yet again! What do you here? Shall we give o'er, and\n" ' drown? Have you a mind to sink?\n' " SEBASTIAN. A pox o' your throat, you bawling, blasphemous,\n" ' incharitable dog!\n' ' BOATSWAIN. Work you, then.\n' ' ANTONIO. Hang, cur; hang, you whoreson, insolent noisemaker;\n'
yare, yare! Take in the topsail. Tend to th' master's\n" whistle. Blow till thou burst thy wind, if room enough.\n'
Enter ALONSO, SEBASTIAN, ANTONIO, FERDINAND\n' GONZALO, and OTHERS\n'
16
" we are less afraid to be drown'd than thou art.\n" " GONZALO. I'll warrant him for drowning, though the ship were\n"
-
' no stronger than a nutshell, and as leaky as an unstanched\n'
-
' wench.\n' ' BOATSWAIN. Lay her a-hold, a-hold; set her two courses; off\n' ' to sea again; lay her off.\n' '\n' ' Enter MARINERS, Wet\n' ' MARINERS. All lost! to prayers, to prayers! all lost!\n'
' Exeunt\n' ' BOATSWAIN. What, must our mouths be cold?\n' ' GONZALO. The King and Prince at prayers!\n' " Let's assist them,\n" ' For our case is as theirs.\n' ' SEBASTIAN. I am out of patience.\n' ' ANTONIO. We are merely cheated of our lives by drunkards.\n' " This wide-chopp'd rascal-would thou mightst lie drowning\n" ' The washing of ten tides!\n' " GONZALO. He'll be hang'd yet,\n" ' Though every drop of water swear against it,\n' " And gape at wid'st to glut him.\n"
-
' [A confused noise within: Mercy on us!\n'
-
' We split, we split! Farewell, my wife and children!\n'
-
' Farewell, brother! We split, we split, we split!]\n' " ANTONIO. Let's all sink wi' th' King.\n" " SEBASTIAN. Let's take leave of him.\n"
' Exeunt ANTONIO and SEBASTIAN\n' ' GONZALO. Now would I give a thousand furlongs of sea for\n'
-
' an acre of barren ground-long heath, brown furze, any\n'
-
' thing. The wills above be done, but I would fain die\n'
-
' dry death. Exeunt\n' '\n' '\n' '\n' '\n')
C2. (10 points) Ignore the text before the start of the first scene. For each scene, use regu- lar expressions to separate the (newline delimited) lines into speakers and speeches, and utilize defaultdict to store each scenes speaker/speech data as a list of tuples, i.e., list are values and scene headings are keys.
Note: If you’re still working on C1 you can earn full credit for C2 using the C1’s sample output. Note: Your output for this part should look something like:
scenes = {"SCENE 1": [('MASTER', 'Boatswain!'), ('BOATSWAIN', 'Here, master; what cheer?'),
('MASTER', "Good! Speak to th' mariners; fall to't yarely, or"), ('BOATSWAIN', 'Heigh, my hearts! cheerly, cheerly, my hearts!'),
17
('ALONSO', "Good boatswain, have care. Where's the master?"), ('BOATSWAIN', 'I pray now, keep below.'), ('ANTONIO', 'Where is the master, boson?'), ('BOATSWAIN', 'Do you not hear him? You mar our labour;'), ('GONZALO', 'Nay, good, be patient.'),
('BOATSWAIN', 'When the sea is. Hence! What cares these'), ('GONZALO', 'Good, yet remember whom thou hast aboard.'), ('BOATSWAIN', 'None that I more love than myself. You are'), ('GONZALO', 'I have great comfort from this fellow. Methinks'), ('BOATSWAIN', 'Down with the topmast. Yare, lower, lower!'), ('SEBASTIAN', "A pox o' your throat, you bawling, blasphemous,"), ('PERMISSION', ' ELECTRONIC AND MACHINE READABLE COPIES MAY BE'), ('COMMERCIALLY', ' PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY'),
"SCENE 2": [...], ...}
In [23]: import pprint
def extract_multi_dialog(s):
pattern = re.compile(r"[A-Z]+\.\s.+") matched_list = pattern.findall(s)
if matched_list:
pprint.pprint(matched_list)
print(len(matched_list))
return list(map(lambda each_matched_line: tuple(each_matched_line.split(“. “,
scenes = extract_multi_dialog(scene_texts)
['MASTER. Boatswain!', 'BOATSWAIN. Here, master; what cheer?', "MASTER. Good! Speak to th' mariners; fall to't yarely, or", 'BOATSWAIN. Heigh, my hearts! cheerly, cheerly, my hearts!', "ALONSO. Good boatswain, have care. Where's the master?", 'BOATSWAIN. I pray now, keep below.', 'ANTONIO. Where is the master, boson?', 'BOATSWAIN. Do you not hear him? You mar our labour;', 'GONZALO. Nay, good, be patient.', 'BOATSWAIN. When the sea is. Hence! What cares these', 'GONZALO. Good, yet remember whom thou hast aboard.', 'BOATSWAIN. None that I more love than myself. You are', 'GONZALO. I have great comfort from this fellow. Methinks', 'BOATSWAIN. Down with the topmast. Yare, lower, lower!', "SEBASTIAN. A pox o' your throat, you bawling, blasphemous,", 'BOATSWAIN. Work you, then.', 'ANTONIO. Hang, cur; hang, you whoreson, insolent noisemaker;', "GONZALO. I'll warrant him for drowning, though the ship were", 'BOATSWAIN. Lay her a-hold, a-hold; set her two courses; off', 'MARINERS. All lost! to prayers, to prayers! all lost!', 'BOATSWAIN. What, must our mouths be cold?',
18
'GONZALO. The King and Prince at prayers!', 'SEBASTIAN. I am out of patience.', 'ANTONIO. We are merely cheated of our lives by drunkards.', "GONZALO. He'll be hang'd yet,", "ANTONIO. Let's all sink wi' th' King.", "SEBASTIAN. Let's take leave of him.", 'GONZALO. Now would I give a thousand furlongs of sea for']
28
In [24]: pprint.pprint(scenes)
[('MASTER', 'Boatswain!'), ('BOATSWAIN', 'Here, master; what cheer?'), ('MASTER', "Good! Speak to th' mariners; fall to't yarely, or"), ('BOATSWAIN', 'Heigh, my hearts! cheerly, cheerly, my hearts!'), ('ALONSO', "Good boatswain, have care. Where's the master?"), ('BOATSWAIN', 'I pray now, keep below.'), ('ANTONIO', 'Where is the master, boson?'), ('BOATSWAIN', 'Do you not hear him? You mar our labour;'), ('GONZALO', 'Nay, good, be patient.'), ('BOATSWAIN', 'When the sea is. Hence! What cares these'), ('GONZALO', 'Good, yet remember whom thou hast aboard.'), ('BOATSWAIN', 'None that I more love than myself. You are'), ('GONZALO', 'I have great comfort from this fellow. Methinks'), ('BOATSWAIN', 'Down with the topmast. Yare, lower, lower!'), ('SEBASTIAN', "A pox o' your throat, you bawling, blasphemous,"), ('BOATSWAIN', 'Work you, then.'), ('ANTONIO', 'Hang, cur; hang, you whoreson, insolent noisemaker;'), ('GONZALO', "I'll warrant him for drowning, though the ship were"), ('BOATSWAIN', 'Lay her a-hold, a-hold; set her two courses; off'), ('MARINERS', 'All lost! to prayers, to prayers! all lost!'), ('BOATSWAIN', 'What, must our mouths be cold?'), ('GONZALO', 'The King and Prince at prayers!'), ('SEBASTIAN', 'I am out of patience.'), ('ANTONIO', 'We are merely cheated of our lives by drunkards.'), ('GONZALO', "He'll be hang'd yet,"), ('ANTONIO', "Let's all sink wi' th' King."), ('SEBASTIAN', "Let's take leave of him."), ('GONZALO', 'Now would I give a thousand furlongs of sea for')]
In [ ]: # utilize defaultdict to store each scenes speaker/speech data as a list of tuples from collections import defaultdict
# Question: Not sure how to add the key into dictionary??
C3. (5 points) Count up the number of times each character spoke in the entire book. Print out the speakers and their speech counts from most to least. Remark on any limitations your work in
19
this exercise’s preprocessing in the response box below. Do you see any artifacts of imprecision in your regex?
Note: You can complete this part even without having completed C2. To do so, just use the sample ouput provided in C2’s prompt.
Response.
In [ ]: from collections import Counter scenes = [“Scene 1”, “Scene 2”]
each_character = Counter(scenes) 1.4 Problem D (20 points)
D1. (5 points) Download and extract the loan.csv file from the Lending Club Loan Dataset and put it in the data/ directory. Read the csv in (the first line contains headers). Create a dictionary named statuses whose keys are the entries in the loan_status and values are boolean values, describing ’good’ or ’bad’ loans, where loans that have “Current”, “Fully Paid”, or “Issued” in the loan_status field are ’good’ loans, and others are ’bad’ loans.
In [ ]:
D2. (10 pts) The desc field contains text descriptions of loans. Tokenize each loan description and count the words for ’good’ and ’bad’ loan descriptions, putting counts into two separate Counter() data structures according to each loan’s status in statuses. Print out the 50 most common words used to describe ’good’ and ’bad’ loans, respectively.
Note: If you’re still working on D1, for partial (near-full) credit you can still count all words, regardless of loan good/bad status, and print out the 50 most common.
In [ ]:
D3. (5 points) Discuss your output and the choice of tokenization that you used. Do the two different highly-frequeny word printouts make sense? What impacts do you think your choice had on the output? What were the challenges and what could have worked better or worse?
Note: even if you are still working on D1 you can count words as discussed in D2 and. com- ment on your choices for tokenization.
Response.
In [ ]:
20