计算机代考程序代写 python database flex Semester 2 2021

Semester 2 2021
Lecture 3, Part IV: Data Formats (cont.)

Semi-structured Data – Cont.
PDF

PDF – Portable Document Format
• A format for presenting documents independently of application software / OS
• Introduced by Adobe in 1993, standardized in 2008
• Large amounts of useful data stored in PDF format (e.g. forms, invoices, etc.). Legacy data in particular.

Challenges of PDF format
• Format is very flexible, limited consistency
• May contain different types of data; similar looking documents can be
represented very differently
• Text may be stored as images, e.g., for scanned documents
• Data is unstructured
• As a result, automated extraction is difficult
• May require substantial preprocessing to extract useable data

Approaches to PDF data
Convert the PDF to text using a library (e.g. poppler, PDFMiner):
pdftotext nvsr65_05.pdf nvsr65_05.txt
National Vital Statistics Reports Volume 65, Number 5
June 30, 2016
Deaths: Leading Causes for 2014
by , Ph.D., Division of Vital Statistics
Abstract
, , ,

Approaches to PDF data
• Problem? Lose layout and formatting information
• Alternative: Convert PDF to HTML or XML using a tool like
pdftohtml
• Example: http://pdftohtml.sourceforge.net/
• Retains layout and formatting, but resultant file not easy to parse

Approaches to PDF data
Can identify the location of text on PDFs

Approaches to PDF data
And or identify important data by pattern matching

Approaches to PDF data
• Need to use Optical Character Recognition (OCR) software to process images into text
Sample Company INVOICE
123 Fake Street
Phone: 0440 000 444 INVOICE #100 DATE: 1/1/2020
To:

University of Melbourne
Parkville
COMMENTS OR SPECIAL INSTRUCTIONS: Your comments
QUANTITY DESCRIPTION UNIT PRICE TOTAL
1 Computer $998.50 $998.50 2 Monitor $223.10 $446.20 SUBTOTAL $1445.70
GST $144.57
TOTAL DUE $1590.27
Make all checks payable to Sample Company.
If you have any questions concerning this invoice, contact: Chris at 0440 000 444,
THANK YOU FOR YOUR BUSINESS!

OCR: Optical Character Recognition
• The roots of modern OCR technology date back over 100 years, but OCR remains an active research area
• Modern OCR implementations rely on machine learning • Many OCR packages available:
• Tesseract (first developed by HP in the 1980s, currently open source and sponsored by Google). A python wrapper, pytesseract is also available
• Many commercial packages (e.g. Adobe Acrobat, ABBYY)

Unstructured Data
Text

Unstructured data – Text
Text files…
• No structure
• Harder to index
• Harder to organise
• Lacks regularity and decomposable internal structure
• How can we process and search for textual information?
More on text data next week.

What you should know
• Why do we have different data formats and why do we wish to transform between different formats?
• Motivation for using relational databases to manage information
• What is a csv, what is a spreadsheet, what is the difference?
• Difference between HTML and XML and when to use each
• Be able to read and write data in XML (elements, attributes)
• Be able to read and write data in JSON
• Difference between XML and JSON; applications where each can
be used.
• What is PDF, challenges and approaches to process PDF data.

Related Posts