Unicode and UTF-8
A standard for the discrete representation of written text
Computer Science and Engineering College of Engineering The Ohio State University
Copyright By PowCoder代写 加微信 powcoder
The Big Picture
characters
code points
binary encoding
Computer Science and Engineering The Ohio State University
Latin M Cyrillic ef
Apostrophe Tei chou ten
U+0444 U+20AC U+006D U+2019
6D D1 84 E2 82 AC
E2 80 99 E5 A5 BD
The Big Picture
characters
code points
binary encoding
Computer Science and Engineering The Ohio State University
Latin M Cyrillic ef
Apostrophe Tei chou ten
U+0444 U+20AC U+006D U+2019 U+5975
6D D1 84 E2 82 AC
E2 80 99 E5 A5 BD
Computer Science and Engineering The Ohio State University
Text: A Sequence of Glyphs
Computer Science and Engineering The Ohio State University
Glyph: “An individual mark on a written medium that contributes to the meaning of what is written.”
See foyer floor in main library
One character can have many glyphs
Example: Latin E can be e, e, e, e, e, e, e… One glyph can be different characters
A is both (capital) Latin A and Greek Alpha
One unit of text can consist of multiple
An accented letter (é) is two glyphs The ligature of f+i (fi) is two glyphs
Glyphs vs Characters
Computer Science and Engineering The Ohio State University
Latin small E Greek capital alpha Latin capital A
characters
Security Issue
Computer Science and Engineering The Ohio State University
Visual homograph: Two different
characters that look the same
Would you click here: www.paypаl.com ?
Security Issue
Would you click here: www.paypаl.com ? Oops! The second ‘a’ is actually CYRILLIC SMALL LETTER A
This site successfully registered in 2005
Computer Science and Engineering The Ohio State University
Visual homograph: Two different characters t hat look the same
Other examples: combining characters ñ = LATIN SMALL LETTER N WITH TILDE
ñ = LATIN SMALL LETTER N + COMBINING TILDE “Solution”
Heuristics that warn users when languages are mixed and homographs are possible
Unicode Code Points
Contains almost 138,000 code points
emoji-versions.html#2019
one hundred and nine (109, or 0x6d) LATIN SMALL LETTER M
Convention: Write code points as U+hex Example: U+006D
As of May 2019, v12 (see unicode.org):
Computer Science and Engineering The Ohio State University
Each character is assigned a unique
code point
A code point is defined by an integer
value, and is also given a name
Covers 150 scripts (and counting…)
unicode.org/charts/
Example Recent Addition (v11)
Computer Science and Engineering The Ohio State University
Unicode: Mapping to Code Points
characters
code points
binary encoding
Computer Science and Engineering The Ohio State University
Latin M Cyrillic ef
Apostrophe Tei chou ten
U+0444 U+20AC U+006D U+2019 U+5975
6D D1 84 E2 82 AC
E2 80 99 E5 A5 BD
Organization
Code points are grouped into categories Basic Latin, Cyrillic, Arabic, Cherokee, Currency, Mathematical Operators, …
U+nnnnnn, same green ==> same plane
Standard allows for 17 x 216 code points 0 to 1,114,111 (i.e., > 1 million)
Plane 0 called basic multilingual plane
U+0000 to U+10FFFF
Each group of 216 called a plane
Others code points written without leading 0’s
Has (practically) everything you could need
Convention: code points in BMP written U+nnnn
(ie with leading 0’s if needed)
Computer Science and Engineering The Ohio State University
Basic Multilingual Plane
Computer Science and Engineering The Ohio State University
Computer Science and Engineering The Ohio State University
Encoding of code point (integer) in a sequence of bytes (octets)
Variable length
Consequence: Can not infer number of characters from size of file!
No endian-ness: just a sequence of octets
D0 BF D1 80 D0 B8 D0 B2 D0 B5 D1 82…
Other encodings might not use 8 bits (more general term: code unit)
Standard: all caps, with hyphen (UTF-8)
Some code points require 1 octet Others require 2, 3, or 4
UTF-8: Code Points & Octets
glyphs characters
code points
binary encoding
Computer Science and Engineering The Ohio State University
Latin M Cyrillic ef
Apostrophe Tei chou ten
U+0444 U+20AC U+006D U+2019 U+5975
6D D1 84 E2 82 AC
E2 80 99 E5 A5 BD
UTF-8 Encoding Recipe
Computer Science and Engineering The Ohio State University
1-byte encodings
2-byte encodings
Example: 1101 0000 1011 1111
Payload: 1101 0000 1011 1111
= 100 0011 1111 = 0x043F
Code point: U+043F
i.e. п, Cyrillic small letter pe
First bit is 0
Example: 0110 1101 (encodes U+006D)
First byte starts with 110… Second byte starts with 10…
UTF-8 Encoding Recipe
Computer Science and Engineering The Ohio State University
Generalization: An encoding of length k:
First byte starts with k 1’s, then 0 Example 1110 0110 ==> first byte of a 3-byte encoding
Example: E2 82 AC
11100010 10000010 10101100
Payload: 0x20AC (i.e., U+20AC, €)
Subsequent k-1 bytes each start with 10 Remaining bits are payload
Consequence: Stream is self-synchronizing A dropped byte affects only that character
UTF-8 Encoding Summary
Computer Science and Engineering The Ohio State University
(from wikipedia)
Computer Science and Engineering The Ohio State University
For the following UTF-8 encoding, what
is the corresponding code point(s)?
For the following Unicode code point,
what is its UTF-8 encoding?
F0 A4 AD A2
Computer Science and Engineering The Ohio State University
For the following UTF-8 encoding, what i
s the corresponding codepoint?
For the following Unicode code point,
what is its UTF-8 encoding?
F0 A4 AD A2
11110000 10100100 10101101 10100010 000 100100 101101 100010
0010 0000 1010 1100 11100010 10000010 10101100
Security Issue
Not all octet sequences are encodings “overlong” encodings are illegal
Classic security bug (IIS 2001)
Moral: Work in “code point” space!
example: C0 AF
= 1100 0000 1010 1111
= U+002F (should be encoded 2F)
Should reject URL requests with “../..” Scanned for 2E 2E 2F 2E 2E (in encoding)
Accepted “..%c0%af..” (doesn’t contain x2F) 2E 2E C0 AF 2E 2E
After accepting, server then decoded 2E 2E C0 AF 2E 2E decoded into “../..”
Computer Science and Engineering The Ohio State University
Recall: URL encoding
Computer Science and Engineering The Ohio State University
Concrete invariant (convention)
Recall: correspondence relation
No space, ;, :, & in representation
To represent these characters, use %hh
instead (hh is ASCII code in hex)
%20 for space
Q: What about % in abstract value?
Other (Older) Encodings
Computer Science and Engineering The Ohio State University
In the beginning…
Character sets were small
ASCII: only 128 characters (ie 27)
1 byte/character, leading bit always 0
Globalization means more characters… But 1 byte/character seems fundamental
Solutions:
Use that leading bit!
Text data now looks just like binary data
Use more than 1 encoding!
Must specify data + encoding used
ASCII: 128 Codes
Computer Science and Engineering The Ohio State University
4B = Latin capital K
ISO-8859 family (eg -1 Latin)
Computer Science and Engineering The Ohio State University
0-7F match ASCII
(control characters)
A0-FF differ, eg:
-1 “Western”
-2 “East European” -9 “Turkish
Windows Family (eg 1252 Latin)
Computer Science and Engineering The Ohio State University
92 = apostrophe
HTML 5 Standard
Computer Science and Engineering The Ohio State University
Early Unicode and UTF-16
Computer Science and Engineering The Ohio State University
Unicode started as 216 code points
Simple 1:1 encoding (UTF-16)
Later added code points outside of BMP
Consequence: U+D800 to U+DFFF became reserved code points in Unicode
And now we are stuck with this legacy, even for UTF-8
The BMP of modern Unicode
Bottom 256 code points match ISO-8859-1
Code point <-> 2-byte code unit (16 bits, 1 word) Simple, but leads to bloat of ASCII text
A pair of words (surrogate pairs) carry 20-bit payload split, 10 bits in each word
First: 1101 10xx xxxx xxxx (xD800-DBFF)
Second: 1101 11yy yyyy yyyy (xDC00-DFFF)
JavaScript and UTF-16
let x = “\u{1f916}” // robot face x.length
x.charCodeAt(0); x.charCodeAt(1); x.charAt(0);
x.codePointAt(0);
Ruby and string encodings
x = “\u{1f916}”
x.bytes.map { |b| b.to_s(16) }
x.encoding
x.encode! Encoding::UTF_16
x.bytes.map { |b| b.to_s(16) }
Computer Science and Engineering The Ohio State University
Basic Multilingual Plane
Computer Science and Engineering The Ohio State University
UTF-16 and Endianness
Computer Science and Engineering The Ohio State University
A multi-byte representation must distinguish between big & little endian
Example: 00 25 00 25 00 25 “%%%” if LE, “─ ─ ─” if BE
One solution: Specify encoding in name UTF-16BE or UTF-16LE
Another solution: require byte order mark
(BOM) at the start of the file
U+FEFF (ZERO WIDTH NO BREAK SPACE)
There is no U+FFFE code point
So FE FF -> BigE, while FF FE -> LittleE Not considered part of the text
BOM and UTF-8
Should we add a BOM to the start of UTF-8
files too?
UTF-8 encoding of U+FEFF is EF BB BF Advantages:
Forms magic-number for UTF-8 encoding
Disadvantages:
Not backwards-compatible to ASCII
Existing programs may no longer work
E.g., In Unix, shebang (#!, i.e. 23 21) at
start of file is significant: file is a script
#! /bin/bash
Computer Science and Engineering The Ohio State University
ZWJ: Zero Width Science and Engineering The Ohio State University
Using U+FEFF as ZWNBSP deprecated Reserved for BOM uses (at start of file)
Alternative: U+200D (“zwidge”)
Joined characters may be rendered as a
single glyph
Example: (1 “character” in Twitter)
Co-opted for use with emojis
U+1F3F4 U+200D U+2620
WAVING BLACK FLAG, ZWJ, SKULL AND
CROSSBONES
To is a “text” file? (vs “binary”) Given a file, how can you tell which it is?
A JavaScript program reads in a 5MB file
of English prose into a string. How much
memory does the string need?
How many characters does s contain? let s = . . . //JavaScript assert (s.length() == 7) //true
Which is better: UTF-8 or UTF-16?
What’s so scary about:
..%c0%af..
Computer Science and Engineering The Ohio State University
Text vs binary
In pre-historic times: most significant bit
Now: data is data
Unicode code points
Integers U+0000..U+10FFFF
BMP: Basic Multilingual Plane
A variable-length, self-synchronizing encoding of unicode code points
Backwards compatible with ISO 8859-1, and hence with ASCII too
Computer Science and Engineering The Ohio State University
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com