CS计算机代考程序代写 scheme python File Formats

File Formats

INF 551

Wensheng Wu

File Formats

• Specify what information bits in file encode

• Example: text file

– String of characters with particular encoding
scheme, e.g., ASCII and Unicode

– E.g., TXT, HTML, JSON, XML

• Others: xls, ppt, pdf, jpg, gif, mp3, png, etc.

Roadmap

• Character encoding

– ASCII

– Unicode

• JSON (done earlier)

• XML (will talk about it next)

Code space & points

• Code space
– A range of numerical values available for encoding

characters

– E.g., 0 to 10FFFF for Unicode, 0 to 7F for ASCII

• Code point
– A value for a character in a code space

• Unicode code point
– U+ followed by its hexadecimal value, e.g., U+0058 for

capital letter ‘X’)
4

Encoding (of code points)

• Code unit: the smallest unit (comprising a
number of bits) used to construct an encoding
for a code point
– Code unit for UTF-8: 8-bit

– UTF-16:16-bit

• UTF (Unicode Transformation Format)
encoding
– E.g., UTF-8 and UTF-16

Variable-length encoding

• Characters encoded using codes of different
length

• In Unicode, a code point may be represented
using multiple code units

– E.g., 1-4 in UTF-8, 1-2 in UTF-16

ASCII

• American Standard Code for Information
Interchange

• 128 characters: 7-bit code (code points: 0~7F)
– Digits: 0-9 (0x30 – 0x39)
– Uppercase letters: A-Z (0x41 – 0x5A)
– Lowercase letters: a-z (0x61 – 0x7A)
– White space (0x20)
– Punctuation symbols
– Control characters (e.g., Ctrl-C: 0x03)

ASCII

Windows-1253

• Windows code page for Latin + Greek
characters

• Use 8 bits

– 0x00 ~ 0xFF

Unicode

• Unicode supports more characters than ASCII
and various codepages

• Unicode separates code points from encoding

– In contrast to ASCII, where code point = encoding

Unicode

• Code space is divided into 17 planes

• Each plane = contiguous 216 code points

• Recall that code points range from 0 to 10FFFF

Total code points = 17 * 216 or 1,114,112

code points

Note 216 = 65,536

Planes in Unicode

Plane 0: BMP (Basic Multilingual Plane)

Block 00
Represents
0000~00FF

Each block represents 256 code points

UTF-8

• Encoding scheme for Unicode code space

• Code unit = 8 bits

• Variable length

– Code point may be represented using 1-4 code
units

UTF-8 Design

• ASCII characters use one code unit

– First bit is zero

• Other Unicode characters use up to 4 units

UTF-8 Features

• Backward compatibility
– One byte for ASCII, leading bit of byte is zero

• Clear distinction btw single- vs. multi-byte
characters
– Single-byte/multi-byte: start with 0/1 respectively

• Multiple length
– a leading byte starts with 2 or more 1’s, followed by a

0, e.g., ‘110’, ‘1110’, etc.
– One or more continuation bytes all start with ‘10’

UTF-8 Features

• Clear indication of code sequence length

– By # of 1’s in leading byte (for multi-byte)

• Self-synchronization

– Can find start of characters by backing up at most
3 bytes

Example

• Encode ‘€’ using UTF-8

• Code point = U+20AC

– 10 0000 1010 1100

• Need 3 bytes in UTF-8

Unicode in Python

• >>> a = u’\u20AC’ # note need u before ‘

• >>> print a

• €

• >>> e = u’€’

• >>> e

• u’\u20ac’

u indicates it is a Unicode string

Unicode in Python

• >>> b = ‘€’

• >>> b

• ‘\xe2\x82\xac’

– UTF-8 encoding of €

• >>> u’€’.encode(‘utf-8’)

• ‘\xe2\x82\xac’

Resources

• UTF-8

– https://en.wikipedia.org/wiki/UTF-8

• UTF-16

– https://en.wikipedia.org/wiki/UTF-16

https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16

Related Posts