Floating Point
á Fixedpointrepresentations á Big and Small Numbers
á ScientificNotation
á IEEE754floatingpointstandard ± Special symbols
± Underflow overflow
á Floatingpointadditionandmultiplication á Material from section 3.5 of textbook
Agenda
How to Represent Real Numbers?
Real Numbers á Positionalnotationallowsforfractions
anan-1………a1a0 . a-1a-2……..a-m
á Let’s start with fixed point representation ± Choose n and m
± Radix point is always in the same position
± Easy to implement
± Limited range
á 152.310
á 1011.012
Real Numbers
Binary to Decimal
á Integersscaledbyanappropriatefactor á Directexpansionwithpositionalweights á 0.110012
Binary to Hexadecimal á Use the same trick as before
0.1101010012
0.2BE16
Decimal to Binary
á Multiply by 2 and note the integer part
á Subtractintegerpartandrepeatuntilnofractionleft
0.62510
Decimal to Binary
á Can all decimal fractions be expressed exactly in Binary? 0.110
How to Represent Small and Big Numbers in Decimal?
How big is Coronavirus?
Particle
Size (meter)
PM10
Red Blood Cell
0.00001 0.000007
PM2.5
0.0000025
Bacteria Coronavirus
0.0000005 0.0000001
Particles filtered by masks
0.000000007
What numbers do we need?
ÔXÌðÌÒıÓY ÓXÛÌÙÓÙY
1.0 × 10-9
3.15576 × 109 1.47 × 1013 2.99792458 × 1010 6.67300 × 10-11 1.98892 × 1030 2.08 × 1022
S
1.0 × 10-15
e
Seconds per nanosecond Seconds per century
US National Debt
Speed of light in cm/s Gravitational constant
Mass of sun in kilograms Distance to Andromeda in m Size of a proton in meters
Scientific Notation for Decimal
á Weusescientificnotationforbigandsmallnumbers ± Use a single digit to the left of the decimal point
± Multiplied by base (e.g., 10) raised to some exponent
± Use e or E to denote the exponent part
1.0 × 10-15 1.0e-15 1.0E-15
á Anormalizednumberhasnoleadingzero ± 1.010 x 10-9 normalized
± 0.110 x 10-8 not normalized
± 10.010 x 10-10 not normalized
Scientific Notation for Binary
á How do we represent very small and big numbers in Binary?
á Binary numbers can be written in scientific notation too
1.02 x 2-1 1.12 x 211
How to Represent Floating Points?
Floating Point
á Thebinarypointisnotfixed,butinsteadcanmovebasedon the exponent
Normalized Binary number always has the form:
± x is the fraction / significand / coefficient / mantissa ± y is the exponent
± always has a one to the left of the binary point
1.xxxxxxx2 × 2yyyy
Floating Point Standards
á Manyoptionsforrepresentingfloatingpoint ± Number of bits for the fraction
± Number of bits for the exponent
± How to represent zero?
± How to represent negative numbers?
á Standardsareimportantforexchangingdata
Floating Point Standards á IEEE 754 used in nearly all computers today
± Defines two representations á single precision (32 bits)
á double precision (64 bits)
In high level languages, data of this type is called
á float (single precision)
á double (for double precision)
Single Precision
Sign Exponent Fraction 1 bit 8 bits 23 bits
SEF
A real number can be described as (-1)S x (1+F) x 2E
á IEEE 754 does not use 2’s complement
á Clarification:
± Fraction refers to the 23-bit number F
± Mantissa refers to the 24-bit number 1+F
á Numbers are in normalized form. Why? Base 2
S Exponent
Mantissa
Single Precision
Sign Exponent Fraction 1 bit 8 bits 23 bits
SEF
A real number can be described as (-1)S
x (1+F) x 2E
0.0011 ×20 0.011 × 2>Ì
0 0000000
001100…
0 1111111
0 1111110
011000… 110000…
0.11 × 2>Ó
All equivalent to the same real number. The encoding is wasteful
Biased Notation
Sign Exponent Fraction 1 bit 8 bits 23 bits
SEF
á In IEEE 754, actual representation is
(-1)S x (1 + Fraction) x 2(Exponent t Bias)
á In single-precision, bias = 127
á Represent negative exponents
á Wanteasyintegerstylecomparison/sorting
Single Precision Floating Point
Sign Exponent Fraction 1 bit 8 bits 23 bits
SEF
á Largest number?
á Smallestnumber?
á How many numbers can we represent?
(-1)S x (1+F) x 2E
Single Precision Floating Point
á Convert -0.75 from decimal to single precision
Single Precision Floating Point
á ConvertÌÏÏÏÏÏÏÌÏÌÏÏÏYÏÏÏÏfromsingleprecisionto decimal:
Double Precision
Sign Exponent Fraction 1 bit 11 bits 52 bits
SEF
á More bits!
á More precision
á Double precision uses a bias of 1023
á Can do more before underflow / overflow ± Approximately 1E-308 to 1E308
Double Precision
á Convert 3.25 from decimal to double precision
Tricky Questions
á What is the largest number that can be represented in single precision?
á What is the smallest number that can be represented in single precision?
Floating Point Arithmetic
á Add the significands
Floating Point Addition
á Align the radix points
± Make the smaller number to match the larger
á Normalize the result
± What if one number is positive and the other negative? ± May need to shift a lot!
± Check for overflow or underflow when shifting!
á Round so number fits in available digits/bits ± If bad luck when rounding, renormalize
Floating Point Addition
9.999e1 + 1.610e-1 with 4 digits precision
á Adding exponents
Floating Point Multiplication
á Multiplythesignificands
á Normalize the result (check for overflow)
á Round to fit in available digits/bits ± Normalize again if necessary
á Compute sign of result
± Positive if signs of operands match, negative otherwise
Floating Point Multiplication
1.110e10 times 9.200e-5 with 4 digits precision
Special Cases?
Special symbols
Exponent Fraction Object represented
000
0 1-254 255 255
Nonzero Anything 0 Nonzero
± denormalized number ± floating point number ± infinity
NaN (Not a Number)
Denormalized Numbers
á The exponent 00000000 is used to represent a set of numbers in the tiny interval ( -2-126, 2-126 )
á This includes the number 0
á Calleddenormalizednumbers
± Smallest normalized is 1.0 x 2-126 = 2-126
± Smallest denormalized is 0.000 μ μ μ 01 x 2-126 = 2-149
á Allows us to squeeze more precision out of a floating point operation
á Tricky to implement. We will come back to this topic later
Unusual events
á Nonzerodividedbyzero
± Not the end of the world!
± Results in positive or negative infinity
á 0/0(invalid),orsubtractinginfinityfrominfinity ± Results in NaN
á Notes on NaN
± Using NaN in math always results in NaN
± Allows us to avoid tests or decisions until a later time in our program
What can go wrong?
Overflow / Underflow
á Largest number that can be represented in single precision:
Approximately ±2.0 x 2128 = 2.0 x 1038
á Smallest fraction that can be represented in single precision:
Approximately ±2.0 x 2-128 = 2.0 x 10-38
á Overflow: representing a number larger than the one above;
á Underflow: representing a number smaller than the one above
Loss of Precision
https://imgur.com/r/totallynotrobots/lsNcv
}
Compare these for loops
for ( int i = 0; i <= 10; i += 1 ) {
System.out.println( i/10f );
for ( float y = 0; y <= 1; y += 0.1f ) {
}
System.out.println( y );
Same or different?
Questions
á Represent 0.110 in IEEE 754 single precision floating point
á Represent 1.110 in IEEE 754 single precision floating point?
Review and more information
á Big and Small Numbers
á Scientific Notation
á IEEE754floatingpointstandard
á Floating point addition and multiplication á Material from Section 3.5 of textbook