CIT 594 Group Project
In this project, you will apply what you’ve learned this semester about data structures, design principles, and design patterns to develop a Java application that read text files as input and performs some analysis.
This project builds on what you implemented in the Solo Project and, as with that assignment, design is a significant portion of the assessment.
1 Background 3
Copyright By PowCoder代写 加微信 powcoder
1.1 COVIDData …………………………………. 3 1.2 PropertiesData………………………………… 3
2 Input Data Format 3
2.1 COVID-19Data ……………………………….. 4 2.2 PropertyValues………………………………… 5 2.3 PopulationData ……………………………….. 6
3 Functional Specifications 6
3.0 GeneralFunctionality …………………………….. 6 3.1 AvailableDataSets………………………………. 8 3.2 TotalPopulationforAllZIPCodes………………………. 9 3.3 TotalPartialorFullVaccinationsPerCapita …………………. 9 3.4 AverageMarketValue …………………………….. 10 3.5 AverageTotalLivableArea ………………………….. 10 3.6 TotalMarketValuePerCapita ………………………… 10
3.7 AdditionalFeature ………………………………. 11 3.8 Logging ……………………………………. 11
4 Design Specification 12
4.1 N-TierArchitecture ……………………………… 12 4.2 DesignPatterns………………………………… 13 4.3 Efficiency …………………………………… 13
5 Project Report 13
5.1 AdditionalFeature ………………………………. 13 5.2 UseofDataStructures…………………………….. 13 5.3 LessonsLearned ……………………………….. 14
6 Resources 14
6.1 UsingEdDiscussion ……………………………… 14 6.2 Testing…………………………………….. 15 6.3 GitHub ……………………………………. 15
7 Grading 15
8 Submission 16
A CSV Parsing 16
A.1 GeneralStructure……………………………….. 17 A.2 Headers ……………………………………. 17 A.3 PeculiaritiesInTheProvidedCSVFiles ……………………. 18 A.4 GeneralCSVProcessingAdvice………………………… 18
1 Background
The OpenDataPhilly portal1 offers, for free, more than 300 data sets, applications, and APIs related to the city of Philadelphia. This resource enables government officials, researchers, and the general public to gain a deeper understanding of what is happening in our fair city. The available data sets cover topics such as the environment, real estate, health and human services, transportation, and public safety. The United States Census Bureau2 publishes similar information (and much more) for the nation as a whole.
For this assignment, you will use course-provided files containing data from these sources. Specifi- cally, you will be given:
• “COVID” data, from the Philadelphia Department of Public Health
• “Properties” data (information about land parcels in the city), from the Philadelphia Office of Property Assessment
• 2020 populations of Philadelphia ZIP Codes, from the US Census Bureau
1.1 COVID Data
This data set tracks reported COVID cases, hospitalizations, vaccinations, and deaths in the city of Philadelphia for each day, updated daily.3 OpenDataPhilly has pointers to explore the data on the department’s own site as well as a GitHub repository4 that stores historic snapshots of the data. All three sites have more details about the collection methodology and other information about the data sets.
The files provided with the assignment include these COVID data in a combined form (all 4 sets as a single file) indexed by recording time and ZIP Code. Note that the ZIP Codes in these data sets are for the reporting locations, which may not match the patients’ home ZIP Codes. For simplicity, we will ignore this issue and assume reporting ZIP Codes and home ZIP Codes are the same.
1.2 Properties Data
Your program will also use a data set of property values of houses and other properties in Philadelphia. This data set includes details about each property including its ZIP Code and current market value (the estimated dollar value of the property, which is used by the city to calculate property taxes). It also includes the total livable area for the property, a measure of the floor space of the structure[s] on the property in square feet.
2 Input Data Format
As the OpenDataPhilly data sets are very large and have quite a lot of extra data you do not need, we will provide somewhat simplified versions for you to use for this assignment. You do not need to download anything from the OpenDataPhilly site.
1 https://www.opendataphilly.org/
2 https://www.census.gov
3The reporting frequency was reduced to weekly starting April 4th, 2022. 4 https://github.com/ambientpointcorp/covid19-philadelphia
Your program will need to support reading all three types of data from CSV (Comma-Separated Values) files, as well as an additional JSON file for the COVID data. All valid CSV files will start with a header row that will include all of the designated fields for each data set. Your program should use the header row to determine the order of the columns at runtime.
See Appendix A for more details about parsing CSV files for this assignment.
2.1 COVID-19 Data
Your program needs to be able to read the set of vaccination data from both CSV and JSON files; the type should be inferred from the file name extension (the portion of the name following the last “.”). The format only determines the organization of the data and is independent of the actual contents (the provided CSV and JSON files contain the same information). Each invocation of the program will be given at most one COVID data file, which will be in one of these two formats.
Each record contains statistics relating to COVID-19 for a single ZIP Code on a single day. The fields include:
• The ZIP Code where the vaccinations were provided.
• The timestamp at which the vaccination data for that ZIP Code were reported, in “YYYY- MM-DD hh:mm:ss” format.
• The total number of persons who have received their first dose in the ZIP Code but not their second dose (“partially vaccinated”), as of the reporting date.
• The total number of persons who have received their second dose (“fully vaccinated”) in the ZIP Code, as of the reporting date.
The record for each ZIP Code also contains statistics for the total number of COVID infection tests conducted as of each date (both positive and negative results), the total number of booster doses administered as of each date, the total number of COVID patients hospitalized as of that date (including previously hospitalized persons who have recovered or died), and the total number of deaths attributed to the disease to date. You may, if you wish, use these additional fields for the free-form analysis in subsection 3.7.
Note that all of the above-described data fields are cumulative, with two caveats. First, when a person who is “partially vaccinated” receives their second dose, they are removed from that count and added to the “fully vaccinated” count, which may result in overall decreases in the “partially vaccinated” count. Second, the reporting agencies may have made occasional data corrections or errors which result in one of the other cumulative fields temporarily decreasing in value.
You should ignore any records where the ZIP Code is not 5 digits or the timestamp is not in the specified format. For any other fields, an empty value should be interpreted as being 0. For example, if the record for ZIP Code 19000 on 2021-06-01 has an empty field for fully_vaccinated then this should be interpreted as meaning there were no fully vaccinated people as of this date in that ZIP Code.
The JSON format is an array of objects much like the flu tweets in the Solo Project. You will need 4
to use the same JSON.simple library (included with the starter files) for parsing the JSON file. Review your solution to that assignment if you do not recall how to set up and use that library.
2.2 Property Values
The property values data set will only be provided as a CSV file; there is no JSON file for these data. Each row of the CSV file represents data about one property (residential, commercial, vacant land, etc.). For the prescribed activities you will need three fields:
• market_value
• total_livable_area • zip_code
You may also use any of the other fields and records in the included properties.csv for your free-form activity. We would not recommend having your program store fields that you will not use in your analysis since this file is quite large and doing so would take up a lot of memory.
The zip_code field of the property values data may make use of extended forms of ZIP Codes. In your analysis you should use only the first 5 characters. For example, if the value read is “19104-3333” or “191043333”, it should be interpreted as “19104”. If the ZIP Code has fewer than 5 characters or the first 5 characters are not all numeric then you should ignore that record entirely.
Because this is real world data, sometimes there will be errors in the data sets, such as missing ZIP Codes, market values that are non-numeric, etc. For the property file, if your program encounters data that is malformed but is needed for a particular calculation, then your program should ignore it for the purposes of that calculation and produce the result based only on the well-formed data.
For instance, let’s say these are the entries for ZIP Code 19000: zip_code market_value total_livable_area
100000 1000 2000
200000 dog
If your program were attempting to calculate the average market value, it should ignore the second entry because its market value is not listed, but should consider the third entry even though its total livable area is non-numeric since for this calculation we don’t need the total livable area. Thus, the program should produce 150000 as the average market value in this case. However, if your program were attempting to calculate the average livable area then it would include the second entry but ignore the third, and should produce 1500.
Do not check for “semantic” problems related to the meaning of the data, e.g., market values or total livable areas that are zero or negative, ZIP Codes that are not in Philadelphia, etc. Your program should consider those to be valid, as long as they exist in the data file and are of the right type.
Note: The inputs used for grading may differ from the included files even when they represent the same data. You must use general CSV parsing for this and the other input files. Do not assume
that different files (such as the ones that will be used for grading) will have the same column order or record order as the provided files, or that optional quoting of fields will remain the same.
2.3 Population Data
Your program will need to do some computations using the populations of Philadelphia’s ZIP Codes. This information will be provided in a CSV file with the columns: “zip_code” and “population”.
You should ignore any records where the ZIP Code is not exactly 5 digits or the population figure is not an integer.
3 Functional Specifications
This section describes the specification that your program must follow. Some parts may be under- specified; you are free to interpret those parts any way you like, within reason, but you should ask a member of the instruction staff if you feel that something needs to be clarified. Your program must be written in Java (you may use language features up to and including Java 11).
3.0 General Functionality
There are 4 optional runtime arguments to the program (the String array passed to main):
• covid: The name of the COVID data file
• properties: The name of the property values file • population: The name of the population data file • log: The name of the log file (described below)
Runtime arguments should be in the form “–name=value”. This is explicitly defined by this regular expression which you may use in your code: “^–(?
For example:
java edu.upenn.cit594.Main –population=example_population_file.csv \
–log=events.log –covid=covid-data.json \
–properties=house_hunting_info.csv
Given that invocation, in main(String[] args) args would be populated with the array: [ “–population=example_population_file.csv”, “–log=events.log”,
“–covid=covid-data.json”, “–properties=house_hunting_info.csv” ]
You can also include these arguments in an IDE run configuration. In the arguments box you might put:
–population=pop.csv –covid=cov.csv –properties=props.csv –log=log.txt Or, if you want to invoke your main from another function:
edu.upenn.cit594.Main.main(new String[]{
“–population=pop.csv”, “–covid=covdat.json”
Do not prompt the user for the file information! It should only be specified as part of the invocation (i.e., when the program is started).
The program should display an error message and immediately terminate under any of the following conditions:
• Any arguments to main do not match the form “–name=value”.
• The name of an argument is not one of the names listed above.
• The name of an argument is used more than once (e.g., “–log=a.log –log=a.log”).
• The logger cannot be correctly initialized (e.g., the given log file cannot be opened for writing).
• The format of the COVID data file can not be determined from the filename extension (“csv” or “json”).
• The specified input files do not exist or cannot be opened for reading (e.g., because of file permissions).5
For simplicity, you may assume that the input files are well-formed according to the specified formats if they exist and are readable. However, some data may be missing for some records. This is intentional and is something your program must handle as described in section 2.
Assuming the provided input files exist and can be read, the program should then display a menu of possible actions and prompt the user to specify the action to be performed. The user should be able to do this by typing the action number (from the list below) and hitting return.
0. Exit the program.
1. Show the available data sets (subsection 3.1).
2. Show the total population for all ZIP Codes (subsection 3.2).
3. Show the total vaccinations per capita for each ZIP Code for the specified date (subsection 3.3). 4. Show the average market value for properties in a specified ZIP Code (subsection 3.4).
5. Show the average total livable area for properties in a specified ZIP Code (subsection 3.5). 6. Show the total market value of properties, per capita, for a specified ZIP Code (subsection 3.6). 7. Show the results of your custom feature (subsection 3.7).
The text menu explaining the actions listed above should be followed by an input prompt line. The prompt line, which should be displayed any time the program wants data from the user (not just after the menu) should have the form of a new line which begins with a greater than sign followed
5Hint: take a look at the documentation for the java.io.File class, at: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/File.html.
by a space (“> ”). In order to ensure that the prompt actually appears for the user, you must flush the output buffer after printing it, using the command “System.out.flush()”. If the user enters anything other than an integer between 0-7, the program should show an error message and prompt the user for another selection. This includes inputs such as:
• “1 2” • “ 1”
• “4dog” • “1.0”
Please do not spend too much time worrying about how to handle these inputs, or which inputs you do and do not need to handle. This is definitely a very minor part of the program!
Some of the actions in this program will require additional input from the user which you should prompt them for when necessary. If the input to that prompt does not match the form described in that respective section you should reprompt them until a valid input is provided.
After an action is completed you should redisplay the main action menu and prompt for the next action. If the user requests a valid operation for which the data is not available (i.e., the corre- sponding file was not provided on the command line), the program should display an error message to the user, redisplay the main menu, and then reprompt.6
To separate calculation outputs from all other outputs (including user interactions), your program should start a response with a line containing only “BEGIN OUTPUT” and follow the response with a line containing “END OUTPUT”. Between those markers, the formats must match those specified be- low7. Outside of the markers and the prompts, formatting will be ignored by automated evaluation, but please keep things reasonable for comfortable human interaction.
NOTE: do not write any other output between “BEGIN OUTPUT” and “END OUTPUT” (even an extra blank line). Doing so may result in failure to process the results and loss of points for the affected functions.
3.1 Available Data Sets
If the user enters a 1 when prompted for input at the main menu, the program should display (to the console, using System.out) a sorted list of the names of the data sets considered available by your application. These should be the data sets (but not the log file!) that were in the arguments passed to main. They should be printed with one name per line, sorted by Java’s “natural string ordering”. The “name” of a data set in this context is not the file name, but the name of the command line argument. For example if the command line included:
–population=example_population_file.csv
Then “population” should be one of the lines of your output. Any data sets which were not given in the command line should not be listed. For example, if the command line included only a population data set and a COVID data set, “properties” should not appear on the list.
6Example response: https://www.youtube.com/watch?v=ARJ8cAGm6JE&t=55s 7Inhuman graders will be processing some of those outputs
Also, be careful to avoid violating the N-tier architecture (subsection 4.1). The user interface will need to get the list of available datasets from the processor tier, which in turn must get that information from the data management tier.
3.2 Total Population for All ZIP Codes
If the user enters a 2 at the main menu, the program should display the total population for all of the ZIP Codes in the population input file.
Your program must not write any other information to the console. It must only display the total population, i.e., the sum of the populations of all ZIP Codes in the input file, and then it should display the main menu and await the next input.
Hint! For this feature, your program should print 1603797 when run on the data files we have provided. If it does not print this, then your program is not working correctly. This is the only feature for which we will provide the correct output in advance! Each group must determine for themselves what the correct output should be for other parts of this assignment.
3.3 Total Partial or Full Vaccinations Per Capita
If the user enters a 3 at the main menu, your program should prompt the user to type “partial” or “full” by printing a question to this effect followed by an input prompt line (as specified above). Once the user inputs a valid response, your program should then prompt the user to type in a date in the format: YYYY-MM-DD. After receiving a valid response, your program should display (to the console) the total number of partial or full vaccinations per capita for each ZIP Code on that day, i.e., the total number of vaccinations for the specified day divided by the population of that ZIP Code, as provided in the population input file.
When writing to the screen, write one ZIP Code per line and list the ZIP Code, then a single space, then the vaccinations per capita, like this:
BEGIN OUTP
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com