Developing an OCR system
COM2004/3004 Assignment
Due: 3:00pm on Wednesday 12th December Contents
• 1. Objective
• 2. Background
• 3. The task
• 4. What you are given
– 4.1. The data
– 4.2. The code • 5. How to proceed
– Step 1: Read and understand the code provided – Step 2: Test the code provided
– Step 3: Processing the training data
– Step 4: Implement the dimensionality reduction – Step 5: Implement the classifier
– Step 6: Error correction (Difficult)
– Additional rules • 6. Submission
• 7. How your work will be assessed – Code quality (10 Marks)
– Feature extraction (10 Marks) – Classification (10 Marks)
– Error correction (10 Marks)
– Overall performance (10 Marks) • 8. Lateness penalty
1. Objective
• To build and evaluate an optical character recognition system that can process scanned book pages and turn them into text.
2. Background
In the lab classes in the second half of the course you will be ex- perimenting with nearest neighbour based classification and dimen- sionality reduction techniques. In this assignment you will use the experience you have gained in the labs to implement the classification stage of an optical character recognition (OCR) system for processing scanned book pages.
1
OCR systems typically have two stages. The first stage, document analysis, finds a sequence of bounding boxes around paragraphs, lines, words and then characters on the page. The second stage looks at the content of each character bounding box and performs the classification, i.e., mapping a set of pixel values onto a character code. In this assignment the first stage has been done for you, so you will be concentrating on the character classification step.
The data in this assignment comes from pages of books. The test data has been artificially corrupted, i.e. random offsets have been added to the pixel values to simulate the effect of a poor quality image.
3. The task
Your task is to design a classifier that:
1. uses a feature vector containing no more than 10 dimensions; 2. operates robustly even on low quality, ‘noisy’ image data.
4. What you are given
You have been given data for training and testing your systems and some code to get you started.
4.1. The data
The data is stored in a subfolder named data and is split into data for training and data for evaluation. The data comes from pages from novels. There are 10 pages for training and 6 pages for testing. The testing pages have progressive amounts of noise added to them, i.e.,test page 1 is the best quality and test page 6 is the poorest quality. For each page there are three files.
- a png format image file containing an image of the page. You should be able to view these files in any standard image viewing software.
- a file ending in the extension .bb.csv. This is a comma- separated variable file giving the bounding box coordinates of each successive character on the page. Each line represents the position of a single character.
- a label.txt file giving the correct ASCII label for each character on the page. There is a direct correspondence between the lines in the .bb.csv file and the .label.txt file.
2
4.2. The code
The code is organised into four python files: train.py, evaluate.py, utils.py and system.py. The first three of these should not be changed. Your task is to rewrite the code in system.py to produce a working OCR system.
In brief, the files have the following function:
- train.py – this runs the training stage. It will read the complete set of training data, process it and store results in a file called model.json.gz in the data folder. It uses functions in system.py that you will need to modify and extend.
- evaluate.py – this runs the evaluation code. It will first read the results of the training stage stored in model.json.gz. It will then perform OCR on the test pages and evaluate the results. It will print out a percentage correct for each page. Again, it uses functions in system.py that you will need to modify and extend.
- utils.py – these are utility functions for reading image and label data and for reading and writing the model.json.gz files.
- system.py – the code in this file is used by both train.py and evaluate.py. It stores the dimensionality reduction and classi- fication code and is the part of the software that you need to develop. The current version has dummy code which will run but which will produce poor results. The dummy dimensionality reduction simply truncates the feature vector to be 10 elements long (i.e., the first 10 pixels of the image). The dummy classifier outputs the first label in the list of valid labels regardless of the input.
Your task is to write a new version of system.py. Your solution must not change train.py, evaluate.py or utils.py. Once you are finished you will run train.py to generate your own version of model.json.gz. You will then submit the system.py along with the model.json.gz file. The program evaluate.py will then be run by the assignment assessors with the code and data that you have submitted. It will be run on a new set of test pages that you have not seen during development. The performance on these unseen test pages will form part of the assessment of your work.
5. How to proceed
The steps below should help you get started with implementing the system. Steps 3 to 6 are not necessarily sequential. Read through this section carefully before considering your approach.
3
Step 1: Read and understand the code provided
The code provided does all the file handling and feature extraction for you. However, it is important for you to understand how it is working so that you can develop your solution appropriately.
Step 2: Test the code provided
Check that you can run the code provided. Open a terminal in CoCalc. Navigate to the directory containing the assignment code,
cd com2004_labs/OCR_assignment/code/
Run the training step,
python3 train.py
Then run the evaluation step,
python3 evaluate.py dev
The code should print out the percentage of correctly classified char- acters for each page. The dummy code will produce results in the range 3% to 5% correct for each page.
Step 3: Processing the training data
The function process_training_data in system.py processes the training data and returns results in a dictionary called model_data. The program train.py calls process_training_data and saves the resulting model_data dictionary to the file model.json.gz. This file is then used by the classifier when evaluate.py is called. So, any data that your classifier needs must go into this dictionary. For example, if you are using a nearest neighbour classifier then the dictionary must contain the feature vectors and labels for the complete training set. If you are using a parametric classifier then the dictionary must contain the classifier’s parameters. The function is currently written with a nearest neighbour classifier in mind. Read it carefully and understand how to adapt it for your chosen approach.
Step 4: Implement the dimensionality reduction
You are free to use any dimensionality reduction technique of your choosing. PCA should perform well but is not necessarily the best approach. Start by looking at the function reduce_dimension in the
4
existing system.py code provided. This function currently simply returns the first 10 pixels of each image and will not work well. It will need to be rewritten.
Step 5: Implement the classifier
You are free to use any classification technique of your choosing. A nearest neighbor classifier should work well but is not necessarily the best approach. Start by looking at the function classify_page in the existing system.py code provided. This function is currently just returning the first character in the list of valid labels regardless of the input. It will need to be rewritten.
Step 6: Error correction (Difficult)
There is potential to fix classification errors by using the fact that sequences of characters must form valid words. This can be done by checking the classifier outputs for the characters making up a word against a dictionary of valid English words. If the word doesn’t appear in the list it is possibly because there has been a classification error. Errors can then be fixed by looking for the closest matching word. For example, if the classifier outputs the sequence, ‘Comquter’ this won’t be in the word list, but it can be corrected to the closest match, i.e. ‘Computer’. This simple approach is not without its problems, so feel free to experiment with this stage in order to come up with a better solution.
A suitable word list can be found at http://www-01.sil.org/linguistics/ wordlists/english/.
This step is made more difficult by the fact that it may not be clear where a word starts and ends. You may try to infer this by looking at the spacing of the bounding boxes.
Additional rules
Some additional rules have been imposed that must be obeyed,
- The file model.tar.gz must not be bigger than 3 MB
- The evalute.py program should not take more than 120 seconds
to produce a result when run on the CoCalc servers.
- You may make use of any code that has been developed in the lab classes (even code appearing in the solutions – but you may
want to improve it!).
5
• You may not use code from the python scikit-learn module. 6. Submission
Deadline: 3:00pm Wednesday 12th December.
Submission will be via MOLE. You will be asked to submit the following.
- A copy of your system.py
- A copy of your data file model.json.gz
- A form (which will appear on MOLE) consisting of several ques-
tions asking you to:
- report the performance of your system on the development set;
- explain/justify the design of your feature selection;
- explain/justify the design of the classifier;
- explain/justify the design of the error correction code.
7. How your work will be assessed
The assignment is worth 50% of the module mark.We will be looking at the Python code quality, the overall design and the general performance of your program. You will be awarded a mark out of 50 made up from the following five 10-mark components.
Code quality (10 Marks)
• Is the code well presented?
• Is it easy to read?
• Does it make appropriate use of Python’s features? • Is the code clearly documented?Feature extraction (10 Marks)
- Has an appropriate feature extraction technique been employed?
- How has the choice and design of the feature extraction been
justified?
- Has the chosen technique been well implemented?
6
Classification (10 Marks)
• Has an appropriate classification technique been employed?
• How has the choice and design of the classifier been justified? • Has the chosen technique been well implemented?
Error correction (10 Marks)
- Has any attempt been made at error correction?
- Has the choice and design of the error correction code been
justified?
- Has the chosen technique been well implemented?
Overall performance (10 Marks)
- Does the code run correctly?
- How does the performance compare to that achieved using a
standard nearest neighbour and PCA approach.
The figures below give an indication of the approximate performance that you should expect using a basic nearest neighbour and PCA based approach.
Page Score
- 1 98%
- 2 98%
- 3 83%
- 4 58%
- 5 39%
- 6 29%
8. Lateness penalty
There will be the standard 5% penalty for each working day late.
This is an individual assignment. Do not share your code with other students. Collusion will result in a loss of marks for all students involved.
(COM2004/3004 2018-19 Assignment Handout v1.0)
7