8/21/2020
CSC 252:
Numbers & Representations
Dr. Jonathan Misurda
jmisurda@arizona.edu
Numbers
You encounter a form (taxes, graduation, etc.) and you see a field labeled year like so:
What is the smallest year you could put in the box?
What is the largest year you could put in the box?
Why?
Well we simply can pick all of the smallest digits and all of the largest digits:
10 digits 0123456789
0
0
0
0
9
9
9
9
12
Range
How many total values can we put in the box?
Range = High – Low +1 Range = 9999 – 0000 + 1 Range = 10,000
Let’s write that a different way:
104
We see a 10 (the number of digits) and a 4 (the number of boxes). Is this a
coincidence?
No. It’s a property of how we write numbers.
Positional Number Systems
This is what we learned in grade school.
There is a ones’ place and a tens’ place and a hundreds’ place … etc.
1×103 +5×102 +2×101 +3×100
We call this a positional number system because the position of each digit tells
us the magnitude of the value.
Each position is a higher exponent on a base. In daily life, we typically use base 10, also known as decimal.
1
5
2
3
34
Why Decimal (Base 10)?
Digit
Roman Numerals
There are, in fact, non-positional number systems.
Decimal
Decimal
I
1
C
100
V
5
D
500
X
10
M
1000
L
50
Why have they fallen out of use?
MDV + XLV MDL
(Multiplication is even worse)
versus
1505 + 45 1550
56
1
8/21/2020
Do Other Bases Make Sense?
Can we still have a positional number system with a base other than 10?
Yes. Any number can be a base, but for our purposes some are more useful than others.
Base 2 – Binary Base 8 – Octal Base 16 – Hexadecimal
Base 2: Binary
When we have a base N, the allowable digits are [0,N-1].
So for base 2, we only use 0 and 1.
A binary digit is known as a bit (a contraction of binary digit).
Decimal
Binary
Decimal
Binary
0
0
8
1000
1
1
9
1001
2
10
10
1010
3
11
11
1011
4
100
12
1100
5
101
13
1101
6
110
14
1110
7
111
15
1111
78
Bits, Nibbles, and Bytes
Each 0 or 1 in a binary string is a bit. It is designated with a lowercase b.
Binary strings of length 4 are called a nibble (or nybble).
Binary strings of length 8 are called a byte. It is designated with an uppercase B.
Bytes can be aggregated into groups called words. Word size can vary depending on the computer architecture. Often a word will be 16 bits or 32 bits (2 or 4 bytes).
Oftentimes, the byte is the element that can hold one character of text in English. The byte is usually the smallest addressable memory element on a machine.
The size of a byte being 8 bits was not common until the 1970s and the term octet was sometimes used to avoid confusion.
Binary to Decimal Conversion
Take each position that has a 1 in it and add up the corresponding powers of two: 512 + 64 + 8 + 4 + 1 = 589
Quick Sanity Check: If the ones’ place (20) is 1, then the number must be odd!
1
0
0
1
0
0
1
1
0
1
512
256
128
64
32
16
8
4
2
1
29
28
27
26
25
24
23
22
21
20
9 10
Powers of Two
Knowing the right table from memory is very useful for your CS/COE future.
These are numbers you will commonly run into and are useful, but we have a trick for estimating them if we forget.
Each power of 2 that is a multiple of 10 (210, 220, etc.) represents about a thousand, a million, a billion, etc.
Use the laws of exponents to help: How big is 232 bytes?
232 =22+30 =22 ×230 So about 4 billion bytes (or 4 GiB)
20
1
21
2
22
4
23
8
24
16
25
32
26
64
27
128
28
256
29
512
210
1024
211
2048
212
4096
213
8192
214
16,384
215
32,768
216
65,536
210
Ki
220
Mi
230
Gi
240
Ti
The Kilo vs Kibi Fiasco
Technically, the SI prefixes (kilo, mega, giga, etc.) represent increasing powers of ten.
Thus you get ambiguity with a quantity such as 1 kilobye (1 KB). It could mean: 1024 (1 × 210) or 1000 (1 × 103)
The kibi (ki), mebi (mi), gigi (gi), etc., prefixes were introduced to avoid the ambiguity.
1 KiB = 1024 (1 × 210) bytes
No one uses them consistently. Context clues sometimes help. Hard drives are sold usually using powers of 10. Operating Systems usually report that space in terms of powers of two.
11 12
2
8/21/2020
Decimal to Binary Conversion
For a decimal input called value:
1. Find the biggest power of 2 smaller than value
2. if value/(that power) == 1
a) Output a “1”
b) Subtract that power from value and store back in value
3. Else
a) Output a “0”
4. Move to the next smaller power of 2
5. Go to 2 while we haven’t done the one’s place
Base 8: Octal
Bit strings can be very long and so sometimes we wish to represent numbers in a more compact representation, while still easily converting in and out of binary.
Octal is base 8.
The valid digits are then [0,7]
Every 3 bits can be represented with one octal digit.
Programming languages usually denote octal literals with a leading 0 prefix.
13 14
Base 16: Hexadecimal
Much more common than octal is base 16, called hexadecimal or just hex for short. (Even though this ends up being a bit of misnomer, hex meaning 6).
Every sequence of 4 bits is represented with a single hexadecimal digit. Thus, 32-bit numbers are compactly displayed in 8 hex digits.
Each digit ranges from [0, 15??]
We cannot use two digits for one as that will destroy the positional number system.
We need new “digits” for 10, 11, 12, 13, 14, and 15. Solution? Use letters.
Counting in the Bases
Decimal
Binary
Octal
Hex
Decimal
Binary
Octal
Hex
0
0000
0
0
8
1000
10
8
1
0001
1
1
9
1001
11
9
2
0010
2
2
10
1010
12
A
3
0011
3
3
11
1011
13
B
4
0100
4
4
12
1100
14
C
5
0101
5
5
13
1101
15
D
6
0110
6
6
14
1110
16
E
7
0111
7
7
15
1111
17
F
15 16
Recognizing the Right Base
Binary:
Octal: Decimal: Hexadecimal:
10b or 102
010 or 108
1010
0x10 or 0x00000010 or 1016 or 10H
Base 1: Unary
Often used for quick counting (hash marks):
Number of “digits” is equal to the value of the number.
This leads to long encodings for big numbers and is inconvenient to work with.
The linear relationship between representation and value is replaced by a logarithmic one when the base is greater than 1.
17 18
3
8/21/2020
Representations
How do we represent data in a computer?
Integers
For the positive numbers, assign binary strings to them in a logical way. 0 → 0, 1 → 1, 10 → 2, 11 → 3, etc.
What about negative numbers?
Positive/negative is a binary relationship. Thus, we could use a single bit to represent the sign.
20 21
Sign-Magnitude
Let us say we have an 8-bit (1-byte) storage type:
We could devote 7 bits to the magnitude (absolute value) and one bit to sign:
MSB LSB
We put the sign in the most significant bit, which is the one that represents the highest power of 2.
With 7 bits left for the magnitude, we can represent [0,127]. Thus with the sign bit, we get a full range of [-127, 127].
Sign
Pros:
•Easy
•Range is symmetric
Cons:
•Negative zero •(-X) + X ≠ 0
Ones’ Complement
Try to solve the (-X)+X ≠ 0 problem.
Choose encoding of 1 and -1 so that their sum is zero (positive or negative).
1 → 0001-1 → 1110 2 → 0010 -2 → 1101
Simply invert the bits for the negative number. The MSB is still the sign bit.
Pros:
• Equal number of signed and unsigned numbers • (-X)+(X)=0
• Encoding is simple, just a bitwise inversion (NOT)
Con:
• X + 0 = X only works for positive zero
22 23
Two’s Complement
Force there to be only one zero.
To encode -value:
1. Write value in binary
2. Invert each of the bits
3. Add1
A quick way to remember this is that -1 is the binary string that contains all 1s:
1
1
1
1
1
1
1
1
Pros:
MSB
• (-X)+(X)=0
• X+0=X
• Simple encoding
• MSB is sign bit
Cons:
•
LSB
Unequal number of positive and negative numbers
• 0 counts as a “positive”
Summary
Code
Sign-Magnitude
1’s Complement
2’s Complement
000
+0
+0
+0
001
+1
+1
+1
010
+2
+2
+2
011
+3
+3
+3
100
-0
-3
-4
101
-1
-2
-3
110
-2
-1
-2
111
-3
-0
-1
24 25
4
8/21/2020
Floating Point
We sometimes want to represent numbers like π=3.141592653589… Real numbers can have infinite representations, like pi or 1/3 in base 10.
100 . 10-1 10-2 10-3 10-4 10-5 … Positional number system still makes sense with a decimal point.
We can also do this in other bases. In base 2, we’d have a binary point:
21 20 . 2-1 2-2 2-3 2-4 …
Some numbers that have finite representations in base 10, do not in base 2 (and vice versa), such as 0.110
3
.
1
4
1
5
9
…
1
1
.
0
0
1
0
…
Storing the Infinite
How do we store the infinite?
As a bit string, we cannot. But if we want a numerical value stored (Rather than a symbolic value like π or a generating function), we have to represent it somehow.
We can view a floating point number as two parts:
• the digits that make it up and
• where the decimal/binary point is.
Each of these things can be viewed as an integer, and so we already know how to represent them.
26 27
Mantissa and Exponent
1.75
=
175 × 10-2
The value is referred to as the mantissa, significand , or coefficient. The power is referred to as the exponent.
175
-2
IEEE 754
All binary strings as we have defined them start with a leading 1, and so the leading 1 is omitted to get an extra bit of mantissa storage.
Certain encodings are reserved:
• store -∞ and +∞
• Quiet and signaling NaN (not a number, e.g., 0/0) • Denormals
32-bit floats have about 7 digits of decimal precision 64-bit floats have about 15 digits of decimal precision
28 29
Offset (Bias) Encoding
String of all zeros is lowest negative number String of all ones is highest positive number
8-bit example, with K=128:
0000 0000 = -128 1000 0000 = 0 1111 1111 = 127
Subtract K from encoding to get value.
Can be easily converted into 2’s complement by inverting the MSB.
Representing Text
Non-numerical data must be encoded using numbers. So, number the various letters.
Not quite that easy though, as there are upper and lower case letters, numbers, punctuation, and symbols.
Also control codes to represent things like whitespace.
Two basic standards for text: • ASCII
• Unicode
30 31
5
8/21/2020
ASCII
American Standard Code for Information Interchange
A 7-bit code that is product of 1960s era computing: •American-centric
•Teletypes
ASCII (2)
Has some nice properties despite having only a Latin-alphabet focus:
• Capital letters numerically sort before lowercase letters.
• Case conversion is a single bit flip (bit 6 is 0 for uppercase, 1 for lowercase)
• Conversion of a digit character into it’s numeric value is a subtract of 0x30:
‘0’ – 0x30 (48), ‘1’ – 0x31 (49)
Control characters represent nonprintable glyphs (space, tab, end of line) as well as control operations of teletypes and communication:
• BEL – Rings a bell
• CR – Carriage return
• LF – Line feed
• FF – Form feed
32 33
Unicode
What about other languages?
Could use code pages, typically by extending 7-bit ASCII to 8-bits.
Unfortunately these are limited, sometimes conflicting, and often poorly documented.
In 1988, an effort to unify all the various code pages into one standard encoding began, called Unicode.
Unicode 8.0 came out in June 2015.
• 120,737 glyphs
• Alphabets, typographical symbols, mathematical operators
• Emoji
• Both current and historical languages
• Constructed to be a superset of ASCII
Unicode (2)
120,737 >> 28
All of these glyphs do not fit into a single byte. Nor even two-bytes anymore. Unicode has reserved 1,112,064 total code points for its eventual use.
UTF-32 is a fixed-width encoding that can handle the full range. UTF-8 and UTF-16 are variable-width encodings.
UTF-8:
If the first byte starts with a leading 1 (meaning it’s outside of ASCII), the encoding spans multiple bytes with the first indicating in its high-order bits how many further.
Java’s char is 16-bit and uses UTF-16 natively.
34 35
Strings
Need to assemble characters together into strings.
Strings are variable-length data. Thus, we need to encode not only the text, but the length.
Two ways to represent length of variable-length data: 1. An explicit length variable.
2. A sentinel value.
Pascal Strings
Use an explicit length variable to store the length. Update it on each operation. Pascal put this in the first element of the data structure (array):
5
H
e
l
l
o
36 37
6
8/21/2020
C Strings
Use a sentinel value to demarcate the end of the string.
Problem with sentinel values: Choice of which one to use. The choice means that you cannot use that character in your input (easily).
C chooses to use an ASCII control character, NUL, represented by the bit string: 0000 0000
H
e
l
l
o
\0
38
7