Introduction to Computer Systems 15-213/18-243, spring 2009
CSE 2421
Linking and Relocation
Required Reading: Computer Systems: A Programmer’s Perspective, 3rd Edition
Chapter 7 through 7.6.3 (inclusive)
2
Reminder – C Compilation Workflow
Option 1: Complete all stages of compilation
%gcc –o hello hello.c
Option 2: Complete 1st three phases first:
– Preprocessor: .c to .i
– Compiler: .i to .s
– Assembler phase: .s to .o
.o is a “relocatable” object file
% gcc –c hello.c
Produces a .o file with unresolved references to symbols
Then, complete the linker phase afterwards:
% gcc hello.o –o hello
Produces an executable by resolving references to any symbols
hello
hello.c
What are linking and relocation?
Linking is the process of collecting and combining various pieces of code and data into a single file that can be loaded (copied) into memory and executed (that is, an executable).
Relocation is the process of adjusting addresses in object modules when the modules are linked with other modules to create an executable.
Why should I care?
Will help you build large programs
Will help with missing modules/linker error resolution
Will help you avoid “dangerous programming errors”
Should you choose to use global variables
Will help you understand language scoping
Will help you understand important system concepts you will see next semester
Virtual memory/paging/memory mapping(Systems II)
Will help you exploit shared libraries
Linking can be done:
At compile time
gcc command
At load time
When an executable loads into main memory
At run time
While an executable is running from main memory
How do we(actually, the operating system) decide?
What makes the most sense with respect to what is being linked and how it’s being used?
Is the code a library function? Individual program?
Related OS concepts
When a process is running, it enhances security if the address space of the process is divided into parts that are only known to the OS:
Read only space:
Read only data (e.g. format strings used with printf or scanf in C) and
Code (i.e., instructions)
Read-write space: data which can be both read and written.
Therefore, when the linker (part of gcc) does linking and relocation, it divides the address space of the executable into these parts.
Example C Program
int sum(int *a, int n);
int array[2] = {1, 2};
int main()
{
int val = sum(array, 2);
return val;
}
extern int array[];
int sum(int *a, int n)
{
int i, s = 0;
for (i = 0; i < n; i++) {
s += a[i];
}
return s;
}
main.c
sum.c
Global
External
Linker knows nothing of these variables
Static Linking
Programs are translated and linked using a compiler driver:
linux> gcc -Og -o prog main.c sum.c
linux> ./prog
Linker (ld)
Translators
(cpp, cc1, as)
main.c
main.o
Translators
(cpp, cc1, as)
sum.c
sum.o
prog
Source files
Separately compiled
relocatable object files
Fully linked executable object file
(contains code and data for all functions
defined in main.c and sum.c)
Why Linkers?
Reason 1: Modularity
Program can be written as a collection of smaller source files, rather than one monolithic mass.
Can build libraries of common functions (more on this later)
e.g., Math library, standard C library
Why Linkers? (cont)
Reason 2: Efficiency
Time: Separate compilation
Change one source file, compile, and then relink.
No need to recompile other source files.
Consider the function of makefiles…
Space: Libraries
Common functions can be aggregated into a single file…
Yet executable files and running memory images contain only code for the functions they actually use.
What Do Linkers Do?
Step 1: Symbol resolution
Programs define and reference symbols (global variables and functions):
void swap() {…} /* define symbol swap */
swap(); /* reference symbol swap */
int *xp = &x; /* define symbol xp, reference x */
Symbol definitions are stored in object file (by assembler) in a symbol table.
Symbol table is an array of structs
Each entry includes name, size, and location of symbol among other things.
During symbol resolution step, the linker associates each symbol reference with exactly one symbol definition.
What Do Linkers Do? (cont)
Step 2: Relocation
Merges separate code and data sections into single sections
Relocates symbols from their relative locations in the .o files to their final absolute memory locations in the executable.
Updates all references to these symbols to reflect their new positions.
Let’s look at these two steps in more detail….
Three Kinds of Object Files (Modules)
Relocatable object file (.o file)
Contains code and data in a form that can be combined with other relocatable object files to form executable object file.
Each .o file is produced from exactly one source (.c) file
Executable object file (a.out file)
Contains code and data in a form that can be copied directly into memory and then executed.
Shared object file (.so file)
Special type of relocatable object file that can be loaded into memory and linked dynamically, at either load time or run-time.
Called Dynamic Link Libraries (DLLs) by Windows
Executable and Linkable Format (ELF)
Standard binary format for object files
One unified format for
Relocatable object files (.o),
Executable object files (a.out)
Shared object files (.so)
Generic name: ELF binaries
15
Object File Format/Organization –
ELF Object File Format (used in Unix/Linux)
The object file formats provide parallel views of a file’s contents, reflecting the differing needs of the linker and the loader
ELF header (Executable and Linkable Format)
-Resides at the beginning and holds a “road map” describing the file’s organization.
Program (or Segment) header table
-Tells the system how to create a process image
-Object files used to build a process image (used by the loader) , i.e., executables must have a program header table; relocatable files do not need one.
-Object files used to do linking must have a Section header table (because it has location and size information for each section); executable object files do not need one.
http://www.sco.com/developers/gabi/2000-07-17/ch4.intro.html
http://docs.oracle.com/cd/E19455-01/806-3773/6jct9o0bs/index.html
http://docs.oracle.com/cd/E19082-01/819-0690/chapter6-46512/index.html
15
16
Object File Format/Organization (cont)
Section header table
-Contains information describing the file’s sections
-Every section has an entry in the table
-Each entry gives information such as the section name, the section size (needed to compute address information), and so on.
Sections
-Hold the bulk of object file information for the linking view: instructions, data, symbol table, relocation information, etc.
-Object files used during linking must have a section header table; other object files may or may not have one.
ELF Object File Format
Elf header
Word size, byte ordering, file type (.o, exec, .so), machine type, etc.
Segment header table
Page size, virtual addresses memory segments (sections), segment sizes.
.text section
Code
.rodata section
Read only data: jump tables, …
.data section
Initialized global variables
.bss section
Uninitialized global variables
“Block Started by Symbol”
Better Save Space
Has section header but occupies no space
ELF header
Segment header table
(required for executables)
.text section
.rodata section
.bss section
.symtab section
.rel.txt section
.rel.data section
.debug section
Section header table
0
.data section
ELF Object File Format (cont.)
.symtab section
Symbol table
Procedure and static variable names
Section names and locations
.rel.text section
Relocation info for .text section
Addresses of instructions that will need to be modified in the executable during relocation step
.rel.data section
Relocation info for .data section
Addresses of pointer data that will need to be modified in the merged executable
.debug section
Info for symbolic debugging (gcc -g)
Section header table
Offsets and sizes of each section
ELF header
Segment header table
(required for executables)
.text section
.rodata section
.bss section
.symtab section
.rel.txt section
.rel.data section
.debug section
Section header table
0
.data section
Linker Symbols
Global symbols
Symbols defined by module m that can be referenced by other modules.
E.g.: non-static C functions and non-static global variables. (external linkage)
External symbols
Global symbols that are referenced by module m but defined by some other module.
Local symbols
Symbols that are defined and referenced exclusively by module m.
E.g.: C functions and global variables defined with the static attribute. (internal linkage)
Local linker symbols are not local program variables
Step 1: Symbol Resolution
int sum(int *a, int n);
int array[2] = {1, 2};
int main()
{
int val = sum(array, 2);
return val;
}
main.c
int sum(int *a, int n)
{
int i, s = 0;
for (i = 0; i < n; i++) {
s += a[i];
}
return s;
}
sum.c
Referencing
a global…
Defining
a global
Linker knows
nothing of val
Referencing
a global…
…that’s defined here
Linker knows
nothing of i or s
…that’s defined here
Local Symbols
Local non-static C variables vs. local static C variables
local non-static C variables: stored on the stack
local static C variables: stored in either .bss, or .data
int f()
{
static int x = 0;
return x;
}
int g()
{
static int x = 1;
return x;
}
Compiler allocates space in .data for each definition of x
C variables in .bss aren’t allocated space until execution time
Creates local symbols in the symbol table with unique names, e.g., x.1 and x.2 or, perhaps, x.f and x.g
How Linker Resolves Duplicate Symbol Definitions
Program symbols are either strong or weak
Strong: procedures and initialized globals
Weak: uninitialized globals
int foo=5;
p1() {
}
int foo;
p2() {
}
p1.c
p2.c
strong
weak
strong
strong
Linker’s Symbol Rules
Rule 1: Multiple strong symbols are not allowed
Each item can be defined only once
Otherwise: Linker error
Rule 2: Given a strong symbol and multiple weak symbols, choose the strong symbol
References to the weak symbol resolve to the strong symbol
Rule 3: If there are multiple weak symbols, pick an arbitrary one.
Linker Puzzles
int x;
p1() {}
int x;
p2() {}
int x;
int y;
p1() {}
double x;
p2() {}
int x=7;
int y=5;
p1() {}
double x;
p2() {}
int x=7;
p1() {}
int x;
p2() {}
int x;
p1() {}
p1() {}
Link time error: two strong symbols (p1)
References to x will refer to the same
uninitialized int. Is this what you really want?
Writes to x in p2 might overwrite y!
Writes to x in p2 will overwrite y!
Nightmare scenario: two identical weak structs, compiled by different compilers
with different alignment rules.
References to x will refer to the same initialized
variable.
Global Variables
Avoid if you can
Otherwise
Use static if you can
Initialize if you define a global variable
Use extern if you reference an external global variable
26
Step 2: Relocation
Relocation merges the input modules and assigns run-time addresses to each symbol
When an assembler generates an object module, it does not know where the code and data will ultimately be stored in main memory or the locations of any externally defined functions or global variables referenced by the module
A “relocation entry” is generated when the assembler encounters a reference to an data object, function, or jump label whose ultimate location is unknown
2 types
R_386_PC64 For PC relative relocation (for labels in jump instructions)
R_386_64 Absolute relocation (for data in .data section and for labels in call instructions)
A PC relative “address” is not an address at all! It is a displacement which is added to the current PC to get the PC for the next instruction. Jump instructions use PC relative addressing.
Absolute relocation, which is used to relocate addresses for data in the .data section, and for labels in call instructions, actually uses a 64 bit address.
27
Static linking – What do linkers do?
Step 2. Relocation
-Merges separate code and data sections into single sections
-Take the code section from each of the relocatable object files, main.o and swap.o, and merge them into a single code section.
-Take the .rodata sections from each of the relocatable object files, and merge them into a single .rodata section.
-Take the .data sections from each of the relocatable object files, and merge them into a single .data section.
-Take the .bss (unitialized file scope variables) sections from individual relocatable object files, and merge them into a single .bss section
-Relocates symbols from their relative locations in the .o files to their final absolute memory locations in the executable.
-Updates all references to these symbols (i.e., any encoded instructions which have the addresses of these symbols) to reflect their new positions.
Relocation
main()
main.o
sum()
sum.o
System code
int array[2]={1,2}
System data
Relocatable Object Files
.text
.data
.text
.data
.text
Headers
main()
swap()
0
More system code
Executable Object File
.text
.symtab
.debug
.data
System code
System data
int array[2]={1,2}
Packaging Commonly Used Functions
How to package functions commonly used by programmers?
Math, I/O, memory management, string manipulation, etc.
Awkward, given the linker framework so far:
Option 1: Put all functions into a single source file
Programmers link big object file into their programs
Space and time inefficient
Option 2: Put each function in a separate source file
Programmers explicitly link appropriate binaries into their programs
More efficient, but burdensome on the programmer
Old-fashioned Solution: Static Libraries
(You may still have to work with these)
Static libraries (.a archive files)
Concatenate related relocatable object files into a single file with an index (called an archive).
Enhance linker so that it tries to resolve unresolved external references by looking for the symbols in one or more archives.
If an archive member file resolves reference, link it into the executable.
Creating Static Libraries
Translator
atoi.c
atoi.o
Translator
printf.c
printf.o
libc.a
Archiver (ar)
...
Translator
random.c
random.o
unix> ar rs libc.a \
atoi.o printf.o … random.o
C standard library
Archiver allows incremental updates
Recompile function that changes and replace .o file in archive.
Commonly Used Libraries
libc.a (the C standard library)
4.6 MB archive of 1496 object files.
I/O, memory allocation, signal handling, string handling, data and time, random numbers, integer math
libm.a (the C math library)
2 MB archive of 444 object files.
floating point math (sin, cos, tan, log, exp, sqrt, …)
% ar –t libc.a | sort
…
fork.o
…
fprintf.o
fpu_control.o
fputc.o
freopen.o
fscanf.o
fseek.o
fstab.o
…
% ar –t libm.a | sort
…
e_acos.o
e_acosf.o
e_acosh.o
e_acoshf.o
e_acoshl.o
e_acosl.o
e_asin.o
e_asinf.o
e_asinl.o
…
Linking with Static Libraries
#include
#include “vector.h”
int x[2] = {1, 2};
int y[2] = {3, 4};
int z[2];
int main()
{
addvec(x, y, z, 2);
printf(“z = [%d %d]\n”,
z[0], z[1]);
return 0;
}
main2.c
void addvec(int *x, int *y,
int *z, int n) {
int i;
for (i = 0; i < n; i++)
z[i] = x[i] + y[i];
}
void multvec(int *x, int *y,
int *z, int n)
{
int i;
for (i = 0; i < n; i++)
z[i] = x[i] * y[i];
}
multvec.c
addvec.c
libvector.a
Linking with Static Libraries
Translators
(cpp, cc1, as)
main2.c
main2.o
libc.a
Linker (ld)
prog2c
printf.o and any other
modules called by printf.o
libvector.a
addvec.o
Static libraries
Relocatable
object files
Fully linked
executable object file
vector.h
Archiver
(ar)
addvec.o
multvec.o
“c” for “compile-time”
Using Static Libraries
Linker’s algorithm for resolving external references:
Scan .o files and .a files in the command line order.
During the scan, keep a list of the current unresolved references.
As each new .o or .a file, obj, is encountered, try to resolve each unresolved reference in the list against the symbols defined in obj.
If any entries in the unresolved list at end of scan, then error.
Problem:
Command line order matters!
Moral: put libraries at the end of the command line.
A real pain in the backside if there is a circular dependency
unix> gcc -L. libtest.o -lmine
unix> gcc -L. -lmine libtest.o
libtest.o: In function `main’:
libtest.o(.text+0x4): undefined reference to `libfun’
Modern Solution: Shared Libraries
Static libraries have the following disadvantages:
Duplication in the stored executables (every function needs libc)
Duplication in the running executables
Minor bug fixes of system libraries require each application to explicitly relink (and sometimes restart)
Modern solution: Shared Libraries
Object files that contain code and data that are loaded and linked into an application dynamically, at either load-time or run-time
Also called: dynamic link libraries, DLLs, .so files
Shared Libraries (cont.)
Dynamic linking can occur when executable is first loaded and run (load-time linking).
Common case for Linux, handled automatically by the dynamic linker (ld-linux.so).
Standard C library (libc.so) usually dynamically linked.
Dynamic linking can also occur after program has begun
(run-time linking).
In Linux, this is done by calls to the dlopen() interface.
Distributing software.
High-performance web servers.
Runtime library interpositioning.
No explicit requirement to recompile/relink after a library function update
Shared library routines can be shared by multiple processes.
Think of all running processes using the same spot in memory for the printf() program
More on this when you learn about virtual memory in Systems II
Dynamic Linking at Load-time
Translators
(cpp, cc1, as)
main2.c
main2.o
libc.so
libvector.so
Linker (ld)
prog2l
Dynamic linker (ld-linux.so)
Relocation and symbol table info
libc.so
libvector.so
Code and data
Partially linked
executable object file
Relocatable
object file
Fully linked
executable
in memory
vector.h
Loader (execve)
unix> gcc -shared -o libvector.so addvec.c multvec.c
Linking Summary
Linking is a technique that allows programs to be constructed from multiple object files.
Linking can happen at different times in a program’s lifetime:
Compile time (when a program is compiled)
Load time (when a program is loaded into memory)
Run time (while a program is executing)
Understanding linking can help you avoid nasty errors and make you a better programmer.