Why is it necessary to have an instruction set for processors and controllers? Can't we simply convert high-level language programs, like those written in C, directly into binaries without the need for an instruction set?
-
6Program binaries _are using_ the instruction set. By analogy: "Why do we need words, if we can write characters directly?" – Alexander Apr 27 '23 at 01:18
-
1Heh, sure. Just means every instruction is a move instruction. Oh and any other work has to be done by some other chip at some other address. An instruction set is what gives those binaries their meaning. Without that it's just a bunch of numbers. – candied_orange Apr 27 '23 at 01:19
-
3Does this answer your question? [How Do Computers Work?](https://softwareengineering.stackexchange.com/questions/81624/how-do-computers-work) – gnat Apr 27 '23 at 05:17
-
The binaries you mention basically are a series of instructions. Instructions the targeted processor can read and process. The processor eats its way through these instructions. Hence, each will have to be meaningful to that processor, make it perform some planned action. The collection of possible instructions is the instruction set. A processor without an instruction set would be capable of nothing. – Martin Maat Apr 27 '23 at 06:24
-
3What do you think is in the binaries? – pjc50 Apr 27 '23 at 09:05
3 Answers
Are you confusing "instruction set" with "assembly language"?
Higher level languages often get compiled into assembly language, and then that gets "compiled" into actual machine code.
For example this is an instruction in an assembly language for an X86 processor:
XOR CL, [12H]
And the corresponding machine code would be:
00110010 00001100 00100101 00010010 00000000 00000000 00000000
(According to a random Wikibooks page I Googled; I don't actually know any assembly languages or machine code specs)
Note that even the machine code there is actually encoded into a language that you and I can read (in this case 0
and 1
characters representing digits of a binary string). You can't really "see" actual machine words, they're always encoded into something when a human is looking at them.
Machine code is no less "code" than assembly or higher level languages. There's nothing inherent in that bitstring that says to do the XOR
operation. Someone has merely defined a language where 00110010
is the "name" of a bitwise-exclusive-or operation, and designed the hardware of the CPU so that it will perform the operation when it is fed the "name" 00110010
. Other CPUs may well use the same string of bits to refer to totally different operations. You have to know what "language" a CPU speaks to know how to compile higher level code into a binary for that device.
The "instruction set" for a CPU is nothing more and nothing less than documentation of the language that the CPU understands. It doesn't make sense to imagine skipping the instruction set and compile "directly into binaries"; a program compiled to binary is encoded in the instruction set. (The binary codes of the instruction set is in fact exactly what is "binary" about compiled programs, and thus the only reason we call them "binaries" to begin with)
But if by "instruction set" you actually meant assembly language, then your question makes sense. Many compilers work by first compiling their source language into assembly language, and then further translating that into machine code. (Often it's actually separate programs doing these steps, and many compilation processes in fact go through several stages of intermediate language, whether or not these intermediate languages are written out into files where you could see them)
Assembly languages are designed to be a human-readable (ish) representation of machine code. The opcodes are represented in more memorable letter codes like XOR
instead of near-arbitrary numbers like 00110010
, they have formatting like spaces and line breaks to make the code laid out more nicely for humans, and they allow things like comments. They do often have a few features that are not quite a direct representation of machine code (like labels which need to be substituted with the actual addresses when the assembler decides where to put the final code), but for the most part mapping from textual assembly code into binary machine code is very straightforward. If you like, you can think of it a little like translating numbers between bases (e.g. 13
in decimal and d
in hexadecimal and 1101
in binary are all encodings of the same number; changing between those forms hasn't actually changed the information stored, only the encoding). Translating assembly language to machine code is more involved than changing between numeric bases, but it's closer to that sort of translation than it is to compiling a high level language into assembly language.
Note that this means that each assembly language is very closely tied to one specific instruction set (and there can in fact be more than one assembly language per instruction set). That's why I've been using "assembly languages", plural. There is no one single assembly language, there are many many different ones; the reason we group them all into one category and can talk about them in general is that there are a lot of commonalities stemming from the shared purpose of being a human-readable representation of some instruction set.
Most programmers today never need to know about assembly language. They can just know that the compilation process takes in their high level language and produces machine code. So it's reasonable to ask about why the assembly language step exists at all.
When assembly languages were first produced, they were the high level languages. Writing textual assembly language was a step up from deciding the individual bits that would get fed into a computer (as was done with the likes of punched card input).
When compilers for higher level languages started being created, there was already a lot of code being written in assembly languages. It made sense to translate higher level code into assembly, so that programmers could inspect the output of the compiler, combine it with other assembly, etc. And assembly code is barely any more complicated than machine code; it's almost the same thing, just represented in a form humans can read more easily. Directly compiling to machine code would not have made the compilers any easier to write, but it would have made the output less useful.
Even when most code began to be written in higher level languages, for a long time it was unlikely for compilers to produce machine code that was better than a human assembly coder would produce. If constraints were tight (the code needed to go as fast as possible, or be as compact as possible, etc), people would write their program in assembly language, not a high level language. Even today it's still not uncommon for certain types of code to be written manually in assembly (much more likely now to be small fragments of a larger program that is mostly written in a higher level language). And when new features are first implemented in a new version of a CPU, if they require specific machine code to use those machine codes will be accessible via assembly language long before they become incorporated as features of a higher level language.
The assembly language step remains, as it has always been, a part of the interface of the compiler. People use it. Most programmers don't need it but some occasionally do, so if the assembly language step was removed from the compilation process it would make the compiler less useful (and again, without gaining anything significant in reducing the complexity of the compiler, as you have to have done pretty much all the same work to translate high level code to machine code as you would to translate it to assembly).
So "instruction sets" exist because whatever the CPU understands will be documented as its instruction set; if you don't have an instruction set you don't have a CPU. Assembly languages are mostly just a non-binary representation of the instruction set. They exist partly for historical reasons, but also because they are useful to some programmers and removing them is of very little benefit.

- 1,017
- 6
- 10
-
Most mainstream compilers (GCC, Clang/llvm, javac/Hotspot, roslyn/dotnet) do not generate assembly any more as part of the compilation process. There is usually still a way to generate assembly, but this is either a separate process from the intermediate representation parallel to machine code generation, or just through providing a de-assembly tool alongside the compilier. Machine code is just easier and faster for compiliers to work with so there are significant complexity advantages. – user1937198 Apr 27 '23 at 13:16
The “instruction set” is literally what tells the processor what binary instructions mean. Without an instruction set, binary instructions would be useless because the processor would have no idea what the binary instructions mean.
Why do we need morse code when we can just use dots and dashes? Because without the morse code we have no idea what dots and dashes mean.

- 42,090
- 4
- 59
- 119
Programming languages need a language specification to ensure all parties agree on the behavior for any given program. The "instruction set" is one such specification that is very rigorously defined.
Could we make a CPU that used a high level language source code, like C, as input? Yes, but it would be horrible.
There is a huge mismatch between operations that are easy to implement in hardware, and operations used by a high level language. Something simple, like a method call, is actually really complex if you think about what it actually involves. Implementing all of this complexity with just just transistors would be many orders of magnitude more difficult then implementing simple operations, like adding and comparing numbers, and moving most of the complexity to software.
You could embed such a software layer in the processor, but now you have essentially just moved the compiler down to the processor firmware layer, not really gaining much. This technique is sometimes used, but mostly for things that is still fairly simple, like trigonometry instructions. This is called "Microcode".
A somewhat related technique is used by GPUs, where the shaders are written in a fairly high level language, and compiled to your specific GPUs instruction set by the GPU driver. But this has also been a source of problems since this compilation takes a fair bit of time.
But if we ignore all of these problems, what language would you pick? You better pick a good one, because switching language, or even language version, would require all your users to buy a new processors and rewrite all the software they use.
The separation between high level languages and low level instruction sets allow for a great deal of flexibility. You can easily create a new high level language as long as you write a compiler to a low level instruction set, no need to create an entirely new CPU. And if you did create a new CPU you could just write a compiler to allow you to run existing programs.
Many modern languages/compilers even use a third intermediate representation, LLVM, .Net, and Java are prominent examples. The purpose of this is even more flexibility, and reduce the need to rewrite code when creating new languages or targeting new CPUs. There have been ideas for a CPU that could take Java bytecode as input. I'm not sure how far it got, but I do not think it was ever launched.

- 3,426
- 16
- 13