CS计算机代考程序代写 algorithm Excel Assignment 2 – CSC2062 “AIDA”

Assignment 2 – CSC2062 “AIDA”
Worth 25% of the module assessment. Assignment is marked out of 100 marks. Deadline: 11pm Friday, 19th March 2021.
This version: 2021-01-22.
Changelog:
2021-01-22: corrected a few minor typos
Introduction
In this assignment, you will:
(a) Create a dataset of handwritten symbols (which you will use for your analyses and experiments in the rest of Assignment 2, and in Assignment 3).
(b) Perform feature engineering, i.e. calculate features (variables) from the handwritten symbols which may be useful for identifying the handwritten symbols automatically.
(c) Perform statistical analysis of the datasets, using methods of statistical inference.
(d) Implement and evaluate some introductory machine learning models that perform classification on the dataset.
When you use a procedure that has an element of randomness, please use the seed value 42 (your code should give the same results each time it runs). This assignment must be completed in R. You may not use Microsoft Excel to complete any part of this assignment.
Please read carefully the information about the assessment criteria and marking process at the end of this document.
Section 1 (10 marks): Creating a dataset
This section asks you to build a dataset of images composed of written numbers, letters and mathematical symbols. Each image is represented by a black & white matrix with size 25 rows by 25 columns. In the matrix, the number “1” represents black pixels and “0” represents white pixels. As such, one image can be stored in a plaintext “.csv” file containing the matrix (and no headers), as in these examples:
Class a b
Example Image
one three
Page 1 of 8

Image Matrix csv file
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0 0,0,0,0,0,1,1,1,0,0,0,0,1,1,0,0 0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0 0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0 0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0 0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0 0,0,0,1,1,0,0,0,0,0,0,0,1,1,0,0 0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0 0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0 0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0 0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0 0,0,0,0,0,0,1,1,0,1,1,1,0,1,0,0 0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0 0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0 0,0,0,1,1,0,0,1,1,1,0,0,0,0,0,0 0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0 0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0 0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0 0,0,1,0,0,1,1,1,0,0,0,1,0,0,0,0 0,1,0,0,0,0,0,0,1,1,1,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0 0,0,0,0,1,1,1,0,1,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0 0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0 0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0 0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0 0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Figure 1: Examples of handwritten images and their 25×25 matrix representation in plaintext.
The goal is to create a dataset containing eight handwritten images of each of the digits {1,2,3,4,5,6,7}, eight handwritten images of each of the digits {a,b,c,d,e,f,g}, and 8 handwritten images of the mathematical symbols {<, >, =, ≤, ≥, ≠, ≈}. We will refer to these as the digit, letter and math datasets, respectively. Each image should be obtained by writing the hand-written symbol yourself (as described below). The quality of the drawing is not essential, as long as the digit or letter images can easily be read by a human. The images will vary from sample to sample, due to your handwriting; however, each character should fit reasonably well in the 25 x 25 box (i.e. do not draw a tiny character in one corner of the 25 x 25 box; this will make your life easier when it comes to doing analyses!). In total, you are creating 168 images.
Each image is represented by a black & white matrix with size 25 rows by 25 columns. In the matrix, the number “1” represents black pixels and “0” represents white pixels. As such, one image can be stored in a comma-delimited plaintext “.csv” file containing the matrix (and no header row); see Fig. 1 above.
You may use whatever means you prefer to obtain the 168 .csv files, provided they are handwritten and are in the comma-delimited .csv format specified above. However, it is strongly recommended that you use the software GIMP (http://www.gimp.org). GIMP is available for free for all PC OSs, and is also installed on the lab machines and the EEECS virtual machines. Using GIMP, you can create a new image with 25 by 25 points (px), advanced options 1 pixel/pt, color space grayscale, fill with background colour. This will give you a small white square, which you can magnify to e.g. 2000% in order to make it easier to draw on. To draw on the image, you can select the pencil tool and adjust the brush size to 1 pixel.
Figure 2: Creating the blank canvas of size 25 x 25 pixels in the GIMP interface.
Page 2 of 8

The standard file formats of GIMP are useful to save the images, but we need a more easily readable format. One good option is to export as PGM, type ASCII. This PGM file can be opened in GIMP, but it is also simply a text file that can be opened in a text editor (or read as text file by R code). The PGM text file has a header consisting of the following four lines:
P2
# CREATOR: … 25 25
255
The third and fourth lines of the header above specify the pixel array size and the maximum allowed pixel value, respectively. (The images are greyscale, with 0 representing fully black and 255 representing fully white).1
The remaining lines of the file specify the pixel values, with one value on each line; the total number of pixel values should correspond to the specified array size (i.e. 25*25=625).
For our purposes, a number < 128 represents a black pixel, while a number >= 128 represents a white one. Such a format can be easily converted into a matrix containing ones and zeros, as presented in Figure 1 above (you can write some R code to do this; reading in the PGM file and writing the .csv file). You shall save each image matrix as a csv file following the specification above, and using the filename STUDENTNR_LABEL_INDEX.csv, where STUDENTNR is your student number (e.g. 4123456), INDEX is a two numeral code from ‘1’ to ‘8’, indexing the set of 8 images you must create for each symbol, and LABEL is the name of the symbol in the image, as specified in Fig 3 below.
Symbol
Label
1
one
2
two
3
three
4
four
5
five
6
six
7
seven
Symbol
Label
a
a
b
b
c
c
d
d
e
e
f
f
g
g
Symbol
Label
< less >
greater
=
equal

lessequal

greaterequal

notequal

approxequal
Figure 3: labels to use for the created images
For example, if your student number is 4123456, then 4123456_notequal_08.csv would be the eighth image you created for the ≠ symbol. (As well as creating the csv files, you may also want to keep the PGM files, in case you need to inspect the data later on).
As part of your submission, upload the csv files that you create in a directory called “section1_images”, along with any code you wrote to create the csv files, in a folder called “section1_code” (see submission instructions at the end of this document).
It is very important to upload the images in the correct csv format as these files will be used to verify your calculations in the next section. The .csv files should be comma delimited, not tab-delimited or anything else. File
In your report, very briefly (2-3 sentences) explain in your own words how you created the images and obtained the matrices from them.
1 For further information about this image format, see https://en.wikipedia.org/wiki/Netpbm_format Page 3 of 8
Encoding should be UTF8 (not UTF8-BOM or anything else). You can check the
encoding in Notepad++.

Section 2 (30 marks): Feature Engineering
Using each 25×25 matrix obtained from an image as described above, you must create an array of characteristics that describe some features of the image. Each feature will be a number (i.e. each feature is a numeric variable). There are 14 features in total.
Before we describe the features, let us define some vocabulary. An n-tile is an nXn selection from the image array. This is an example of a 2-tile:
Features to be calculated (corresponding to columns of your features output file):
Feature Index
Feature Short Name
Feature Description
label
The true name of the symbol in the image (i.e. one of the 21 symbol names given in Fig. 2). The label is not a true feature, and should not be used as a feature for statistical tests or during model training.
index
The index of this image instance (a number from 1 to 8). The index is not a true feature, and should not be used as a feature for statistical tests or during model training.
1
nr_pix
The number of black pixels in the image.
2
rows_with_2
Number of rows with exactly 2 black pixels
3
cols_with_2
Number of columns with exactly 2 black pixels
4
rows_with_3p
Number of rows with 3 or more black pixels
5
cols_with_3p
Number of columns with 3 or more black pixels
6
height
The vertical distance, in pixels, between the topmost and bottommost black pixels in the image (measuring from the pixel centre).
7
width
The horizontal distance, in pixels, between the leftmost and rightmost black pixels in the image (measuring from the pixel centre)
8
left2tile
The number of unique 2-tiles in the image where the leftmost two entries are black and the rightmost two entries are white: ◧. Tiles may overlap.
9
right2tile
The number of unique 2-tiles in the image where the rightmost two entries are black and the leftmost two entries are white: ◨.
10
verticalness
The sum of the previous two features, divided by the number of black pixels in the image.
11
top2tile
The number of unique 2-tiles in the image where the top two entries are black and the bottom two entries are white.
12
bottom2tile
The number of unique 2-tiles in the image where the bottom two entries are black and the top two entries are white.
13
horizontalness
The sum of the previous two features, divided by the number of black pixels in the image.
14
[your label]
Define a custom feature based on 3-tiles. Explain and justify the rationale for your feature (It should capture information not captured by other features)
Your task in this section is to write code to calculate each of the features above.
Save your calculated features in a file called STUDENTNR_features.csv, where STUDENTNR is your student number. This file will consist of 169 rows. The first row gives the column names (i.e. the strings in the Feature Short Name column in the table above, comma-delimited). The remaining 168 rows list the comma-separated feature values for each of your 168 images. The first entry in the row will be the LABEL word, the second will be the image INDEX, and the remaining 14 entries will be the
Page 4 of 8

calculated features.
For example, the features for your first “=” image may be as follows: equal,1,24,0,12,2,0,10,15,0,0,0,24,24,48,0
The 8 rows that correspond to the 8 instances of a particular symbol should be grouped together in the features file, and the order of those 8 rows should correspond to the INDEX used in the image filenames. In other words, the 168 data rows of STUDENTNR_features.csv should be sorted first by the label (alphabetical order) and secondly by the index.
If you cannot calculate a particular feature, you may use a random integer between 0 and 10 for the feature values instead. (You will lose marks for not calculating the feature, but you can use the random values in the analyses that follow in the subsequent section. You should report that you have done this in the assignment report).
In your report, very briefly describe and explain the code you have written to calculate the features above. If you ran into difficulties, you should still explain your thought processes and attempts to calculate the features. In the case of the custom feature, you should explain your rationale for choosing the feature you did, as well as how they are calculated (i.e. you should give a justification for why you think this feature should be useful).
You should put the file STUDETNR_features.csv in a folder called section2_features. Put code for this section in a folder called section2_code. The working directory for the code should be the section2_code directory. Your code should use relative paths; i.e. it should read the image matrixes from “../section1_images” and save the feature file to “../section2_features”.
Section 3: Statistical analyses of feature data (35 marks)
In this section, you will perform statistical analyses of the feature data, in order to explore which features are important for distinguishing between different kinds of handwritten symbols.
You shall use descriptive statistics (mean, variance, etc.), hypothesis testing, and suitable visualisation to perform your analysis of the data. You are encouraged to provide tables, figures, and/or graphs in the report to support your discussions and findings. When performing tests, always consider whether multiple test correction is needed.
It is your responsibility to define the appropriate assumptions to run the tests, and to choose an appropriate test according to the data characteristics and the question that you are studying. In general, you will not be told what tests or R functions to use; it is up to you to explain and justify your choice. You are not necessarily restricted to the hypothesis tests that were discussed in the lectures. You may assume a significance level of 0.05 for the analyses when running hypothesis testing.
In particular, in the report you should address each of the following subtasks, using appropriate statistical tests, tables, graphs, etc.
1. Construct suitable histograms for the nr_pix, height, and cols_with_3p features, for the full set of 168 items. Briefly describe the shape of the distributions and comment on any interesting patterns across the datasets.
2. Suppose you randomly sample a digit image from the full set of images. What is the probability that the number of pixels in the image is greater than 20?
3. Present useful summary statistics (e.g. mean and standard deviation) about all the features, for (a) the full set of letters, (b) the full set of digits (c) the entire set of 168 items. Briefly discuss the summary statistics, and whether they already suggest which features may be useful for discriminating letters and digits. For features you feel may be interesting for
Page 5 of 8

discrimination between groups, consider suitable visualisations (e.g. histogram of feature
values for the groups2).
4. Investigate the relationship between the “height” variable and the “verticalness” variable. Are
these variables linearly associated? Consider suitable visualisation. Describe and conduct a suitable statistical test to measure the degree of linear association between these two variables.
5. Are there features which are useful to discriminate between the set of digits and the set of letters? (Consider a statistical test which test for differences between two groups). List the three most useful features. Consider suitable visualisation. Briefly interpret your findings (i.e. why might these features be useful) and the validity of any assumptions.
For all questions above, you shall explain your reasoning, assumptions and steps of the procedure (including the statistical analysis) when preparing the report. Use statistics to justify your reasoning. If you are generating p-values for analysing the statistical significance of some features, make sure to explain how they were obtained. It is your task to decide and justify what the most appropriate inference to be performed in each case is, and to discuss the results you obtained.
Put code for this section in a folder called section3_code. Your code should use relative paths; i.e. it should read the feature data from “../section2_features”.
Section 4 Regression and Machine Learning (25 marks)
1. Suppose that instead of calculating horizontalness in Section 2, you instead would like to predict the horizontalness value from the feature variables 1-12. Fit a multiple regression model to predict horizontalness as best you can from a subset of these variables (consider approaches to feature selection). Give the results table of the regression model.
2. Using any 3 features that you think should be useful (justifying your choices, e.g. on the basis of results and visualisations from section 3.6 above), use logistic regression to build a classifier that discriminates between the “letter” and “digit” classes. Use 5-fold cross-validation to evaluate the accuracy of your fitted model. Briefly (1-2 sentences) interpret your results.
3. Does your model in subsection 4.2 distinguish between the “letter” and “digit” categories for the 112 images significantly more accurately than a “random” model that just randomly responds “letter” 50% of the time and “number” 50% of the time? Perform a suitable statistical analysis using the binomial distribution.
Put code for this section in a folder called section4_code. Your code should use relative paths; i.e. it should read the feature data from “../section2_features”.
Assessment criteria and marking process
The most important criteria in marking is the completeness, accuracy, quality and clarity of your report (approximately 75% weighting, across the full submission). In your report, you should clearly demonstrate that you understand the methods used in each sub-task. Explain your chosen approach, your reasoning, and the assumptions and steps of the procedures used. You should explain and interpret your results, demonstrating understanding and independent thinking. What
2 A nice example: https://stackoverflow.com/questions/36049729/r-ggplot2-get-histogram-of-difference- between-two-groups
Page 6 of 8

are your results telling you? Are the results what you would expect? If you ran into difficulties, explain what they were and the efforts you made to try to overcome them.
Code has a weighting in marking of approximately 15%. Your code should be clear and logically organised, and do what is required, but code efficiency and code sophistication is not important (this assignment does not require complex programming). However, you should use loops and variables (rather than hard-coded values) where appropriate. If you use freely licenced code, packages, or libraries (which is encouraged), these should be appropriately referenced (e.g. by citing a URL in a comment). For example, using StackOverflow code snippits is fine, provide you acknowledge the use and provide the URL to the code snippits in the comments, and follow the MIT licence. The code must be easy to use and the comments must include information about the required steps to replicate the results that you have obtained and are presenting in your report (transparency and replicability are essential in data analysis).
Attention to detail and following the assignment instructions accurately will also be considered in marking (approximately 10% weighting). Each sub-task has a precise specification. Make sure you carefully follow the instructions, and use the features specified for each task, the specified procedures (seed value, data file specifications and file names, directory structure and names, etc). Make sure you upload your deliverable files precisely in the specified formats.
Your report should explain how you have performed the analysis, but do not explain details of code implementation in your report – use the source code comments for that.
It should be straightforward for the assessor to rerun your code to produce the same results as presented in your report. Ensure that the different subsections are clearly labelled in the source code, and in the report. Each of these subtasks should be addressed in a separate subsection of your report, following the report template.
Deliverables
You must submit your assignment online, using the module webpage, by 11pm Friday, 19th March 2021. The online uploaded file must be a ZIP file called assignment2_STUDENTNR.zip, containing multiple files and directories. The contents of the zip file are specified below (bold text indicates folder names):
• STUDENTNR_assignment2_report.pdf • section1_code
o [your code files; a single code file is preferred] • section1_images
o [168 .csv files with the following naming format: STUDENTNR_LABEL_INDEX.csv] o Optionally, also include the PGM files used to create the csv files, with the same
naming format
• section2_code
o [your code files; a single code file is preferred]
• section2_features
o STUDENTNR_features.txt
• section3_code
o [your code files; a single code file is preferred]
• section4_code
o [your code files; a single code file is preferred]
Page 7 of 8

Please use the provided report template for preparing your report (or create an equivalent LaTEX format). Ensure that the header and footer information (student name, student number) is clearly visible on the PDF. The word limit for the report is 4000 words (excluding tables and figures; you can include as many tables and figures as you feel is appropriate).
A RAR file is not a ZIP file. A broken or corrupt ZIP file is not a ZIP file. Do not include .Rdata files in your upload; these may make your zip file very large.
It is your responsibility to ensure the assignment is uploaded and double-checked in good time before the deadline. Standard university penalties apply for late submission.
By submitting this assignment, you acknowledge that it is your own work and that you are aware of university regulations regarding academic offences, including (but not restricted to) plagiarism and collusion. Collusion/plagarism will be manually and algorithmically checked for.
Page 8 of 8