File Formats
File Formats
INF 551
Wensheng Wu
1
File Formats
• Specify what information bits in file encode
• Example: text file
– String of characters with particular encoding
scheme, e.g., ASCII and Unicode
– E.g., TXT, HTML, JSON, XML
• Others: xls, ppt, pdf, jpg, gif, mp3, png, etc.
2
Roadmap
• Character encoding
– ASCII
– Unicode
• JSON (done earlier)
• XML (will talk about it next)
3
Code space & points
• Code space
– A range of numerical values available for encoding
characters
– E.g., 0 to 10FFFF for Unicode, 0 to 7F for ASCII
• Code point
– A value for a character in a code space
• Unicode code point
– U+ followed by its hexadecimal value, e.g., U+0058 for
capital letter ‘X’)
4
Encoding (of code points)
• Code unit: the smallest unit (comprising a
number of bits) used to construct an encoding
for a code point
– Code unit for UTF-8: 8-bit
– UTF-16:16-bit
• UTF (Unicode Transformation Format)
encoding
– E.g., UTF-8 and UTF-16
5
Variable-length encoding
• Characters encoded using codes of different
length
• In Unicode, a code point may be represented
using multiple code units
– E.g., 1-4 in UTF-8, 1-2 in UTF-16
6
ASCII
• American Standard Code for Information
Interchange
• 128 characters: 7-bit code (code points: 0~7F)
– Digits: 0-9 (0x30 – 0x39)
– Uppercase letters: A-Z (0x41 – 0x5A)
– Lowercase letters: a-z (0x61 – 0x7A)
– White space (0x20)
– Punctuation symbols
– Control characters (e.g., Ctrl-C: 0x03)
7
ASCII
8
Windows-1253
9
• Windows code page for Latin + Greek
characters
• Use 8 bits
– 0x00 ~ 0xFF
Unicode
• Unicode supports more characters than ASCII
and various codepages
• Unicode separates code points from encoding
– In contrast to ASCII, where code point = encoding
10
Unicode
• Code space is divided into 17 planes
• Each plane = contiguous 216 code points
• Recall that code points range from 0 to 10FFFF
Total code points = 17 * 216 or 1,114,112
code points
Note 216 = 65,536
11
Planes in Unicode
12
Plane 0: BMP (Basic Multilingual Plane)
13
Block 00
Represents
0000~00FF
Each block represents 256 code points
UTF-8
• Encoding scheme for Unicode code space
• Code unit = 8 bits
• Variable length
– Code point may be represented using 1-4 code
units
14
UTF-8 Design
• ASCII characters use one code unit
– First bit is zero
• Other Unicode characters use up to 4 units
15
UTF-8 Features
• Backward compatibility
– One byte for ASCII, leading bit of byte is zero
• Clear distinction btw single- vs. multi-byte
characters
– Single-byte/multi-byte: start with 0/1 respectively
• Multiple length
– a leading byte starts with 2 or more 1’s, followed by a
0, e.g., ‘110’, ‘1110’, etc.
– One or more continuation bytes all start with ‘10’
16
UTF-8 Features
• Clear indication of code sequence length
– By # of 1’s in leading byte (for multi-byte)
• Self-synchronization
– Can find start of characters by backing up at most
3 bytes
17
Example
• Encode ‘€’ using UTF-8
• Code point = U+20AC
– 10 0000 1010 1100
• Need 3 bytes in UTF-8
18
Unicode in Python
• >>> a = u’\u20AC’ # note need u before ‘
• >>> print a
• €
• >>> e = u’€’
• >>> e
• u’\u20ac’
19
u indicates it is a Unicode string
Unicode in Python
• >>> b = ‘€’
• >>> b
• ‘\xe2\x82\xac’
– UTF-8 encoding of €
• >>> u’€’.encode(‘utf-8’)
• ‘\xe2\x82\xac’
20
Resources
• UTF-8
– https://en.wikipedia.org/wiki/UTF-8
• UTF-16
– https://en.wikipedia.org/wiki/UTF-16
21
https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16