-6

I'm just curious, I'd like to understand how compiled code works from the moment I run an executable file. Some time ago I had found a very well written article which helped, using a hex editor, to read a binary file and, for example, find out the references to external function in external static libs. But I cannot find it any more, I only find tutorial which explain the compilation & linking pipeline.

EDIT

Many tanks to all have answered so far, but maybe I have to be more clear: I already know how a compiler works an all the main steps source -> compiling -> linking ect.

What I've never had is the opportunity to know deeper how the OS interacts with a binary executable.

Thanks again.

Daniele
  • 89
  • 3
  • 1
    Try [CppCon 2018: Matt Godbolt “The Bits Between the Bits: How We Get to main()”](https://www.youtube.com/watch?v=dOfucXtyEsU). That talk should get you started, I don't meant it to be a full answer. You will find other CppCon talks about linkers et.al. – Theraot Jun 22 '20 at 08:51
  • 1
    You probably mean operating system, not "SO" – Basile Starynkevitch Jun 22 '20 at 13:14
  • Oh, yes, my fault. I'm italian and in my mind I swapped the abbreviations. – Daniele Jun 23 '20 at 14:48
  • Maurice J. Bach "The Design of Unix OS" or Tanenbaum Minix book. But honestly unless you are **really** interested in the internals, I think one of them will suffice. – greenoldman Jun 23 '20 at 15:49

3 Answers3

4

There are a few pieces to the puzzle of executables.

First, obviously, is the code itself. To understand this, you need to understand your target CPU's instruction set, including its binary encoding.

Second, how the code is packaged into files. Different OSs use different formats for this, typically ELF on Linux and PE on Windows. This file has different sections: some contain code, some contain static data, and some contain references to other libraries (DLLs/SOs) and their functions.

Finally, you need to understand how your platform's loader works, i.e. how exactly references to external functions are resolved. This is tightly interlinked with both previous parts.

In addition, it's probably helpful, but not completely necessary, to understand how processes and threads work in your OS.

After that, it's basically: loader parses the executable, puts the code and data into memory, changes some things to resolve external references, and does all these things recursively for any dependencies. Then it sets up a process to start execution at some point in the code (typically specified by the executable).

Sebastian Redl
  • 14,950
  • 7
  • 54
  • 51
-1

Understanding how compiled code works

You need to read several books.

Several programming language specifications, e.g. for C11 its standard n1570, for C++11 its standard n3337, for Scheme the R5RS.

A book covering several programming languages. Read also about executables and object files formats, e.g. about ELF and ABIs.

A book about homoiconic languages, including Lisp or Scheme. I recommend Queinnec's Lisp in Small Pieces book.

A book on compilation, such as the Dragon book. Another about Linkers and loaders

A book on Operating systems, since most programs are running under some OSes (except e.g. the code for an Arduino). See also OSdev.org

A book on Computer architecture

A book about the instruction set of your computer. Perhaps an x86-64 one. Or (for a tablet or a RaspBerryPi) an ARM.

You may need to read something about Garbage Collection and about bytecode. In particular related to the JVM.

Be aware that many compilers are open-source.

So study the source code of e.g. nwcc, GCC, Clang, Ocaml, SBCL, OpenJDK, etc...

Study also the source code of operating systems. Several references to open source OSes are on OSDEV wiki, and the wikipage on OS gives more references.

ACM SIGOPS and SIGPLAN conferences are relevant, and past conference proceedings are useful. See also ACM Queue and Phoronix.

PS. I recommend using Linux on your computer. See also http://linuxfromscratch.org/

Since the OP is Italian: in Italy, you might attend seminars or webinars at Scuola Normale Superiore, courses from Universita de Pisa, or Universita de Parma. Roberto Bagnara is teaching very well such topics, and is a really nice person.

Basile Starynkevitch
  • 32,434
  • 6
  • 84
  • 125
-1

in here i want to completely explain what happens when a program is getting executed using an example. let's say that we have written some code in a file here:

#include<stdio.h>

main(){
printf('hello world');
return 0;
}

now the computer doesn't understand these codes , because it only understands binary or zeros and ones (0,1) which we call it machine language. so this code should be translated to that code. that's where the compiler comes in , so i choose my file name as mgh.c (my name) and run this command:

gcc mgh.c -o mgh

by saying this the gcc has done a lot of steps : 1- it has pre-processed the code 2- it has compiled the code or translate the code to the assembly language(the resulting file has the .s format if we could see the process or if we could do it step by step) 3- the built-in assembler in gcc has created the object file from the assembly code(the object file has the .o extension) 4- the built-in linker has converted the object file (with .o extension) to the binary or executable format(in linux without extenstion).

and now the system could understand what should happen. that i want to dig deeper here.

after we have created the executable file we could run it using:

./mgh

or we could analyze it further more by saying :

strace ./mgh

this command would show all the system calls that has been invoked for this program. the printf function is a built-in library(or header or package or whatever you want to call it) that prints the hello world on the screen , in order for it to be able to do it some system calls have to be executed which you could see them using the strace command.

the answer could be as long as 10 more books but in summary thats what happens for a code to be understandable for the system , even though how each of these processes are executed and loaded in the memory , creating the pcb for each process by the operating system, putting each process in the ready queue , scheduling these processes by the scheduler , dispatching the processes to the cpu, updating the registers with the pcb of that process which has been saved in the kernel stack of that specific process and many more steps should be learn using a lot of books which i see some of them has been suggested above.

[UPDATE] ok now i want to explain about how an executable get executed after we have explained how it is created.

lets say that a function like printf want =s to get executed, the printf itself is included of several other processes, so each one of these processes contains of several lines of instruction that each should get executed by the cpu.

in general a process is a program when it is loaded into memory. when its not loaded we call it an application. so when a process is loaded , it is put in some queue called ready queue. now there may be a lot of different processes in ready queue. how does the operating system decides to call which one? by running an algorithm called scheduler. this scheduler schedules the next process to be able to use the cpu resources. after scheduling the operating system would dispatchthe process to cpu.

but we know that the cpu would look at the data in the PC (program counter) register and finds out what the next instruction is to be executed. so right now the PC is pointing to one address which is actually the next instruction about the last process , (or in better words its related to the current process which is going to be discarded or its state is going to change from running to waiting) so some information needs to be changed like this data in registers, for this purpose every process has some block of information called pcb or process control block which is saved in the kernel memory(operating system core code which is loaded into memory after we turn on our system and after the boot loader has been executed these code would be loaded from the non-volatile storage like hard-disk to volatile or ram which we are calling it just memory).

now when the process is put into running state to be executed by cpu its pcb would be loaded and all the information inside it would be replaced over registers. now the PC for example is pointing to the next instruction about this process.

also the information about last process like PC regster would update the pcb related of that process. so next time that this process wants to get executed the pcb data would replace the pc register and therefore the cpu knows that it should resume executign of instructions from there.

one last note about the execution of processes and that we said cpu would execute the instructions in the addresses which the pc register is pointing to . now what ore these addresses , i mean could we change the pc of another process or the pcb in general to change the instructions to be executed and therefore hack other processes? the answer is no, and the reason is that when every process is loaded in memory a separate space in the memory is assigned to it and its called address space of that memory which the data related to that process could be only read/write in that address space and this address space is completely virtual . so in this way 2 processes could have addresses from 0,32000 and then the operating system would map these addresses to the real physical addresses.

Mgh Gh
  • 1
  • 2