Learning Outcomes
Department of Computing and Information Systems
COMP20005 Engineering Computation Semester 1, 2015
Assignment 2
In this project you will demonstrate your understanding of two-dimensional arrays, structs, and the en- ginering processes used to assemble and test a non-trivial program.
Understanding Data
A very wide range of telemetry, sensor, scientific, and experimental data is maintained in a format known as a “CSV” file, where CSV stands for Comma Separated Values. General CSV files can include strings as well as numbers; in this project you will add functionality to a tool that can be used to analyze the data strored in purely-numeric CSV files.
Each input file will start with a header line that gives symbolic names to the data values stored in the file. It will have anywhere between 1 and 50 words, separated by commas. An example appears below. A set of as many as 1,000 data lines follows the header line. Each data line contains nothing but numbers, again separated by commas. Each data line has exactly the number of fields specified by the words in the header row. For example, the following is a valid input file, and will be used in the examples below:
year,month,day,location,mintemp,maxtemp 2015,4,28,18,6.7,12.9 2015,4,28,22,12.7,19.1 2015,4,29,18,7.6,15.3 2015,4,29,22,13.4,21.9 2015,4,30,18,7.3,21.8 2015,4,30,22,13.2,23.2 2015,5,1,18,9.4,15.9 2015,5,1,22,16.1,27.2 2015,5,2,18,8.7,16.3 2015,5,2,22,14.2,21.4
In this example, the first three fields can be thought of as describing a date, the fourth as a location flag (“18” for Melbourne, “22” for Sydney, perhaps), and the last two fields as being temperature data over a five-day period. All of the numbers in all of the data columns should be treated as being double, even if they do not include decimal points; two numbers should be regarded as being equal if they differ by less than 10−6.
Copy the skeleton file that is linked from the LMS1. You will need to read through it carefully; once you understand its structure, you will be in a position to add further functionality. There are several quite complex things that you need to note:
- The use of fopen() and fscanf() in the function read csv file() to read data from the CSV file named as the first argument on the commandline (use of the fopen() and fscanf() and etc functions will not be part of the assessment for this subject). Function read csv file() is finished, and you should not need to change it.
- The details of the process of reading the CSV data into the structure D, counting the rows and columns, and also capturing the column headings in a string and breaking that string up in to parts to make an array of strings. Note also that read csv file() converts any non-numeric fields in
1http://people.eng.unimelb.edu.au/ammoffat/teaching/20005/ass2/ass2-skel.c 1
the CSV file into nan values so that any downstream computations will get infected. You may assume that there will not be any non-numeric data in any of the data files used for assessment testing.
- The function reassign input(), which diverts stdin if there is a second file named on the commandline, and sets fileinput to be true, which then causes the input to be echoed in the remainder of the program. The function freopen() will also not be examined in this subject.
- The way in which the main program loops, reading “commands” from stdin, and executing them one at a time.
As well as reading in the CSV file, the skeleton program implements two “commands”: typing “i” to the prompt “>” generates an “i”ndex listing of the fields in the input file; and typing “d” generates a complete “d”ump of the CSV data.
mac: ass2-soln test0.csv file test0.csv:
6 columns and 10 rows of data >i
col data 1 year
2 month 3 day 4 location 5 mintemp 6 maxtemp
> ^D bye mac:
Further example output will be shown in a lecture, and can be found on the LMS.
Stage 1 – Column Averages (marks up to 5/20)
The skeleton program includes stubs for several further commands. The first of these is the “a” command, which computes the “a”verage value in the column specified by its argument – with, in all cases, column numbering starting from one from the point of view of the user, as shown in the index listing. Implement the body of the function do averge(). For the example data, sample interactions include:
>a5
average mintemp is 10.93 (over 10 values) >a6
average maxtemp is 19.50 (over 10 values)All data values that are printed, and all values derived from data values, are to be printed to two decimal places.
Stage 1 requires that less than a dozen lines be added to the program. The marks assigned to this stage are primarily a reward for you spending several hours reading through the skeleton, and understanding how it is structured.
Stage 2 – Graphing Distributions (marks up to 10/20)
The “g” command generates a “g”raph of the values in a specified column. The two bounding values of that data column, max and min, should be computed, and then the range between min − ε and max + ε broken in to 20 equal regions, where ε = 10−6. The number of values in each of the buckets is then
2
computed, and plotted as a scaled graph of at most 60 columns. For the same test data, but with the number of graph rows set to 5 rather than 20 to generate a shorter output for this handout:
>g5
graph of mintemp, scaled by a factor of 1
14.22-- 16.10 [ 1]:* 12.34-- 14.22 [ 4]:**** 10.46-- 12.34 [ 0]:
8.58-- 10.46 [ 2]:** 6.70-- 8.58 [ 3]:***
The exact details of the horizontal scaling process you incorporate in to your program will not be checked as part of the assessment testing, but the correctness of the vertical bucketing process will be. Output examples with the correct number of rows are available on the LMS.
Stage 3 – Category Averages (marks up to 15/20)
The “c” command computes “c”ategory averages, where one column is used to define the categories:
>c45 location 18.00 22.00
average mintemp 7.94 (over 5 values)
13.92 (over 5 values)
The output listing should be in ascending order of the category specified by the first column index. Any of the input columns can be used to define the categories, and any of the input columns can be the one that gets averaged. That means that you may not employ category values (such as 18 or 22) to index arrays, even if they are ints in the input file; you have to be more careful than that. You may assume that there will be at most MAXCATS categories required. Further output examples are available on the LMS.
Stage 4 – Correlation Coefficients (marks up to 18/20)
Category averages can be helpful if the values in the column being used to create the divisions are discrete, for example, locations. If the values are numeric, such as the dates or the temperatures, it is helpful to be able to compute a correlation coefficient. The “k” command computes Kendall’s tau-a correlation cofficient for n paired values, described as:
where
τA = 2(nc − nd) n(n−1)
- nc = the number of concordant pairs of values
- nd = the number of discordant pairs of values
A concordant pair is two rows of data where the values are in the same order in both of the columns being compared. A discordant pair is one where the two values are in reverse order. For example, in the dataset shown above, when comparing the final two columns, rows one and two (for April 28) are concordant, because 6.7 < 12.7 and 12.9 < 19.1. On the other hand, the April 29 and 30 rows for location 18 are discordant, because 7.6 > 7.3, whereas 15.3 < 21.8. In total, comparing the final two columns of the input data, there are 36 concordant pairs, and 9 discordant pairs. Row pairs in which the values in either of the two columns equal to each other are not counted as discordant nor counted as concordant. All row pairs are checked when counting the number of concordant and discordant pairs. The Wikipedia2 provides more information.
2 http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient 3
If two sets of paired values completely agree in their ordering, then a value of τ = 1 will be reported. If they are opposite ordered, τ = −1. When τ = 0, there is no correlation between the two columns.
>k56
tau coefficient between mintemp and maxtemp = 0.60
Stage 5 – Correlation Plots (marks up to 20/20)
For the last two marks, generate a two-dimensional density “p”lot with two columns both split in to buckets. If there are x data values falling in to that cell of the plot, then the value ⌊log2(x + 1)⌋ is plotted, with ’a’ considered to come after ’9’, and then ’b’ after ’a’, and so on. Cells with no data values falling in to them are plotted as a “.” character. Examples of the required output are shown on the LMS.
The boring stuff…
This project is worth 20% of your final mark. A rubric explaining the marking expectations will be provided on the LMS.
You need to submit your program for assessment; detailed instructions on how to do that will be posted on the LMS once submissions are opened. Submission will not be done via the LMS; instead you will need to log in to a Unix server and submit your files to a software system known as submit. You can (and should) use submit both early and often – to get used to the way it works, and also to check that your program compiles correctly on our test system, which has some different characteristics to the lab machines. Failure to follow this simple advice is highly likely to result in tears. Only the last submission that you make before the deadline will be marked.
You may discuss your work during your workshop, and with others in the class, but what gets typed into your program must be individual work, not copied from anyone else. So, do not give hard copy or soft copy of your work to anyone else; do not “lend” your memory stick to others; and do not ask others to give you their programs “just so that I can take a look and get some ideas, I won’t copy, honest”. The best way to help your friends in this regard is to say a very firm “no” when they ask for a copy of, or to see, your program, pointing out that your “no”, and their acceptance of that decision, is the only thing that will preserve your friendship. A sophisticated program that undertakes deep structural analysis of C code identifying regions of similarity will be run over all submissions in “compare every pair” mode. See https://academichonesty.unimelb.edu.au for more information.
Deadline: Programs not submitted by 10:00am on Monday 25 May will lose penalty marks at the rate of two marks per day or part day late. Students seeking extensions for medical or other “outside my control” reasons should email ammoffat@unimelb.edu.au as soon as possible after those circum- stances arise. If you attend a GP or other health care professional as a result of illness, be sure to take a Health Professional Report form with you (get it from the Special Consideration section of the Stu- dent Portal), you will need this form to be filled out if your illness develops in to something that later requires a Special Consideration application to be lodged. You should scan the HPR form and send it in connection with any non-Special Consideration assignment extension requests.
Marks and a sample solution will be available on the LMS by Monday 8 June.
And remember, programming is fun!
4