Due date: Sunday, November 1 at 11:59PM
This assignment is worth 20 marks, and will count as 20% of your final mark in this subject.
This is an individual project and your submission for this project is to be your own individual work. While you may discuss the problems with your classmates in a very general sense, you must not show written solutions to another student, use written solutions from another student, or develop code together. Furthermore, reuse of code or excessive influence in algorithm choice and development will be considered misconduct. We will check submissions for originality and will invoke the University’s Academic Misconduct policy (http://academichonesty.unimelb.edu.au/policy.html) where inappropriate collusion or plagiarism appears to have taken place. Your code will be passed through our plagiarism detection software, which can also detect code that has high structural similarity, not just identical code.
Late submissions: If you would like to submit after the deadline, please contact [yujing.jiang@unimelb.edu.au] to request to re-open GROK and discuss corresponding penalties.
Question / Available mark
Submission
Q1. Clean It Up
(5 marks)
Correctness of Output (2)
Computational Approach (2)
Coding Style, Comments (1)
Q2. Expenditure Statistics
(5 marks)
Correctness of Output (2)
Computational Approach (2)
Coding Style, Comments (1)
Q3. Money, Money, Money
(5 marks)
Correctness of Output (2)
Computational Approach (2)
Coding Style, Comments (1)
Q4. Top 5
(5 marks)
Correctness of Output (1)
Computational Approach (3)
Coding Style, Comments (1)
Total Mark (20)
Things to look out for in solving the questions are:
don’t be afraid to create extra variables, e.g. to break up the code into conceptual sub-parts, improve readability, or avoid redundancy in your code
we also encourage you to write helper functions to simplify your code – you can write as many functions as you like, as long as one of them is the function you are asked to write
commenting of code is one thing that you will be marked on; get some practice writing comments in your code, focusing on:
describing key variables when they are first defined, but not things like index variables in for loops; and
describing what ‘chunks’ of code do, i.e. not every line, but chunks of code that perform a particular operation, such as
#find the maximum value in the list
or
#count the number of vowels.
be aware of tradeoffs with extra variables, functions, and comments – whilst we encourage these, if we can’t see the important parts of your code easily, it is a style issue and may be penalized.
start early and seek help from the Canvas Discussion Forum if you have trouble understanding what the question asks you to do, or if your output does not match expectations.
The health expenditure data is given to you in one or more comma-separated values (CSV) files.
CSV is a simple file format which is widely used for storing tabular data (data that consists of columns and rows). In CSV, columns are separated by commas, and rows are separated by newlines (so every line of the text file corresponds to a row of the data).
Usually, the first row of the file is a header row which gives names for the columns.
If CSV uses commas to separate columns, what do we do when our data itself contains commas? The solution is to put quotation marks around the cell containing a comma. Python’s csv library does this automatically.
The health expenditure data contains the following columns:
ID
A unique ID assigned to each expenditure(row).
fin_year
The financial year of the expenditure.
state
The state allocated the funds.
area
The health area where the expenditure was provided.
funding
The broad source of the funding (either Government or Non-government).
detailed_funding
The organisation that provided funding.
expenditure
The amount provided in millions of dollars (rounded to whole numbers).
Here is a sample of the health expenditure data provided to you by the AIHW:
You’ll notice that in this assignment there are multiple files in the Grok workspace (the area to the right of this slide, where you write your code). A quick explanation of these files is in order:
program.py
The file where you will write your code. We have included a little bit of code in program.py to get you started.
header.py
A file containing some useful functions and constants. We have already imported the relevant functions and constants for you in each question.
Various CSV files
You will see some files in the workspace called noisy_sample.csv, noisy_data.csv, cleaned_sample.csv, or cleaned_data.csv. (The exact files vary from question to question.) These are census CSV files provided to you by the ABS. You can use them to test your functions as you work through the questions. The sample files are quite small (only a few lines), while the data files are relatively long (1000 lines).
Problem 1:
You have been provided with large CSV files containing health expediture data about Australia. Unfortunately, the data is “noisy”: some people have made data entry mistakes, or intentionally entered incorrect data. Your first task as a programmer-analyst is to clean up the noisy data for later analysis.
There are a few particular errors in this data:
Typos have occured in the expediture resulting in some non-numeric values.
People have entered expenditure areas that are out-of-date and no longer valid. The valid areas are listed in a variable called VALID_AREAS, which is given to you.
Some people have formatted the financial year incorrectly. Using words instead of digits, e.g. inputting `twenty-ten to eleven’ instead of `2010-11′ or using too many or too few digits, e.g. `10-11′ — others have entered years outside the range of the dataset. The data provided is for financial years within the range 1997-98 and 2011-12.
Write a function clean_data(data) which takes one argument, a dictionary of data in the format returned by read_data. This data has been read directly from a CSV file, and is noisy! Your function should construct and return a new data dictionary which is identical to the input dictionary, except that invalid data values (as defined by the three points above) have been replaced with None. You should not modify the argument dictionary, data.
There is also one further type of potential error in the noisy data that needs to be fixed. Australian states/territories in the states column are represented by the standard abbreviations: “ACT”, “NSW”, “NT”, “QLD”, “SA”, “TAS”, “VIC”, “WA”. However, values in the state column could be incorrectly spelt or contain extra characters. Examples of such incorrect values are “ACTT”, “NSQ”, “NTxx”, “aQLD”, “SOZ”, “TAZ”, “VIK”, “AW”.
Your program should attempt to replace incorrect values with their correct state abbreviation value by using the following set similarity measure, which we use to provide a number representing the similarity between two strings. Suppose we have the following two example strings:
string1 = “Aaa bBb ccC” # The set representation of string1 is {‘a’, ‘b’, ‘c’, ‘ ‘}
string2 = “bbb ccc ddd” # The set representation of string2 is {‘b’, ‘c’, ‘d’, ‘ ‘}
Notice that for our purposes/definition case does not matter (e.g. ‘A’ is the same as ‘a’) and that space is also a character.
The set similarity (Sim) measure for string1 and string2 is given by the formula:
Sim(string1,string2)=|S1∩S2||S1∪S2|Where ∩ is set intersection, ∪ is set union, and |X| is the length of set X
So, for example:
Sim(string1,string2)=|{a,b,c,(space)}∩{b,c,d,(space)}||{a,b,c,(space)}∪{b,c,d,(space)}|
=|{b,c,(space)}||{a,b,c,d,(space)}|=35=0.6
Now, when your program comes across an incorrect state value, for example, “NSQ”, that is not one of ACT, NSW, NT, QLD, SA, TAS, VIC or WA, it should compare that incorrect value with all of the correct state strings using the Sim function, and then replace the incorrect string with the valid state value that it has the highest Sim measure with.
For example, when the incorrect value is “NSQ”, the measure comparisons are:
Sim(“ACT”, “NSQ”) = 0
Sim(“NSW”, “NSQ”) = 0.5
Sim(“NT”, “NSQ”) = 0.25
Sim(“QLD”, “NSQ”) = 0.2
Sim(“SA”, “NSQ”) = 0.25
Sim(“TAS”, “NSQ”) = 0.2
Sim(“VIC”, “NSQ”) = 0
Sim(“WA”, “NSQ”) = 0
Since Sim(“NSW”, “NSQ”) is the highest of all these values, the incorrect “NSQ” would be replaced by “NSW”.
In cases where an incorrect string has the same similarity score with two or more correct state values, since the program can’t decide on which one might be correct, it just replaces the incorrect value with None. For example, when the incorrect state string value is “AA”, the two highest similarities with correct state values are:
Sim(“WA”, “AA”) = Sim(“SA”, “AA”) = 0.5
Since the program has no further way of deciding between the two, it just replaces “AA” with None.
You can assume the following:
the input data dictionary does not contain None values;
all numeric expediture values are valid;
Problem 2:
Write a function called expenditure_stats_by_state(data, start, end) which takes three arguments, a dictionary of data in the format returned by read_data, a start range of financial year in format XXXX-XX, and an end range in same format. You can assume the financial year input start and end is valid.
For each state, the function calculates the following four statistics within the provided range of financial years:
average expenditure, rounded to the closest whole number.
sum total expenditure
minimum instance of expenditure
maximum instance of expenditure
If the start is greater than end the function should return -1.
The result is a dictionary with this information, as shown in the example below. If a state does not have this information, then the value is set to None.
You may assume the health expenditure data in data is ‘clean’, that is all invalid values have been replaced by None or empty string. If a nested dictionary contains a None value for the fin_year key or expenditure key, you should ignore it in your calculation. (If the dictionary has None for a different key, e.g. area, you should still include it in the calculation.)
Here are some examples of what your function should return:
>>> data_cleaned = read_data(“cleaned_sample.csv”)
>>> expenditure_stats_by_state(data_cleaned, ‘1997-98’, ‘2011-12’)
{‘ACT’: {‘avg’: None, ‘max’: None, ‘min’: None, ‘sum’: None},
‘NSW’: {‘avg’: 214, ‘max’: 315, ‘min’: 12, ‘sum’: 641},
‘NT’: {‘avg’: None, ‘max’: None, ‘min’: None, ‘sum’: None},
‘QLD’: {‘avg’: None, ‘max’: None, ‘min’: None, ‘sum’: None},
‘SA’: {‘avg’: None, ‘max’: None, ‘min’: None, ‘sum’: None},
‘TAS’: {‘avg’: None, ‘max’: None, ‘min’: None, ‘sum’: None},
‘VIC’: {‘avg’: None, ‘max’: None, ‘min’: None, ‘sum’: None},
‘WA’: {‘avg’: None, ‘max’: None, ‘min’: None, ‘sum’: None}}
>>> data_cleaned = read_data(“cleaned_data.csv”)
>>> expenditure_stats_by_state(data_cleaned, ‘2000-01’, ‘2010-11’)
{‘ACT’: {‘avg’: 44, ‘max’: 499, ‘min’: -3, ‘sum’: 4699},
‘NSW’: {‘avg’: 578, ‘max’: 5982, ‘min’: -185, ‘sum’: 53792},
‘NT’: {‘avg’: 30, ‘max’: 340, ‘min’: 0, ‘sum’: 2559},
‘QLD’: {‘avg’: 377, ‘max’: 3827, ‘min’: -53, ‘sum’: 37351},
‘SA’: {‘avg’: 169, ‘max’: 1778, ‘min’: -22, ‘sum’: 15852},
‘TAS’: {‘avg’: 44, ‘max’: 368, ‘min’: -7, ‘sum’: 4275},
‘VIC’: {‘avg’: 359, ‘max’: 3879, ‘min’: -125, ‘sum’: 32309},
‘WA’: {‘avg’: 206, ‘max’: 2068, ‘min’: 0, ‘sum’: 1
Problem 3:
Your employers are interested in the distribution of expenditure across different areas of the health sector. One way to analyse this is to divide the range of possible expenditures into a number of equal-sized ‘bins’ — where a bin is just a subset of the overall range — then count the number of expendures falling into each bin (if you’ve ever worked with histograms before, this should be very familiar).
For example, we could divide the total expenditure range [0, 5000] into ten bins: [0, 499], [500, 999], [1000, 1499], and so on, up to [4500, 5000]. The distribution of expenditures would then be summarised by 10 integers corresponding to the ten bins. In general, we experiment with the number of bins to find the number that gives the most informative distribution.
Write a function called funding_dist(data, n_bins, area)
Here is an example of how your function should behave:
>>> data = read_data(“cleaned_data.csv”)
>>> funding_dist(data, 3, “Research”)
[47, 3, 2]
>>> funding_dist(data, 8, “Private hospitals”)
[56, 14, 3, 4, 2, 2, 1, 2]
>>> funding_dist(data, 8, “Research”)
[39, 6, 3, 1, 0, 1, 1, 1]
>>> funding_dist(data, 12, “Public hospitals”)
[75, 6, 2, 5, 1, 2, 3, 1, 0, 0, 1, 1]
funding_dist(data, n_bins, area), which calculates the distribution of expenditures greater than or equal to the minimum expenditure and less than or equal to the max expenditure for a given area, by dividing that range into n_bins bins and counting the number of expenditures that fall into each bin. The bin width should be an integer. Your function should return a list of ints, with each integer representing the number of expenditures falling in the corresponding bin.
If a nested dictionary in data contains a None value for the expenditure or area key, you should ignore it in your calculation. (If the dictionary has None for a different key, you should still include it in the calculation.)
You may assume that n_bins is a positive integer. Notice that including the maximum expenditure in the last bin may make the last bin slightly ‘wider’ than the others. For example, if
max_expend == 101 and min_expend == -20,
n_bins == 6, and bin_width = [101 – (-20)] // 6 == 20
the bins would be [-20, -1], [0, 19], [20, 39], [40, 59], [60, 79], and [80, 101]. Noted that in this example, the last bin is wider than all others.
Problem 4
Write a function called area_expenditure_counts(data, lower_spent, upper_spent) which creates a dictionary of the number of expenditures in the given expenditure amount bracket for each area. That is, each key in the dictionary should be an area name, and the value for that area should be an int corresponding to the number of expenditures in the area who fall in the expenditure amount bracket specified by lower_spent and upper_spent (inclusive). Your dictionary should have a key for all areas in VALID_AREAS, even the ones that have no expenditures in the expenditure amount bracket. The function should return the top 5 areas by expenditure count as a list of tuples[(area, expenditure_count), …]. Top 5 areas should be listed in descending order by count, and ties should be broken by alphabetical order.
In this question, you should ignore any nested dictionary with a None value for the area key or the expenditure key. None values for other keys are acceptable. You may assume that lower_spent and upper_spent are positive. If lower_spent > upper_spent, your function should return a list with a value of 0 for every expenditure.
Here are some examples of how your function should behave:
>>> data_cleaned = read_data(“cleaned_data.csv”)
>>> area_expenditure_counts(data_cleaned, 1000, 5000)
[(‘Medical services’, 15), (‘Public hospitals’, 15),
(‘Benefit-paid pharmaceuticals’, 6), (‘All other medications’, 4),
(‘Dental services’, 3)]
>>>
>>> area_expenditure_counts(data_cleaned, 0, 1000)
[(‘Patient transport services’, 88), (‘Dental services’, 85),
(‘Private hospitals’, 82), (‘Aids and appliances’, 81),
(‘Public hospitals’, 80)]
>>>
>>> area_expenditure_counts(data_cleaned, 4000, 8000)
[(‘Public hospitals’, 3), (‘Medical services’, 2), (‘Administration’, 0),
(‘Aids and appliances’, 0), (‘All other medications’, 0)]
>>>
>>> area_expenditure_counts(data_cleaned, -1000, 0)
[(‘Community health’, 22), (‘Medical expense tax rebate’, 20),
(‘Patient transport services’, 16), (‘Dental services’, 5),
(‘All other medications’, 4)]
>>>
>>> area_expenditure_counts(data_cleaned, 5000, 1000)
[(‘Administration’, 0), (‘Aids and appliances’, 0),
(‘All other medications’, 0), (‘Benefit-paid pharmaceuticals’, 0),
(‘Capital expenditure’, 0)]