6

TL;DR

What procedure is followed when selecting bytes to represent opcodes? Are byte(s) for opcodes just randomly chosen, and them mapped to mnemonics?


I recently learned from this answer that bytecode usually consists of instructions which have one opcode, which consists of a fixed number of bytes, and operands. Specifically this snippet from ratchet freak's answer:

The bytecode itself is very often a very simple syntax. Where the first few bytes indicate what operation has to be performed and what operands are needed. The bytecode will be designed so that when reading byte per byte there is a unambiguous interpretation of the instructions.

I took that advice and began designing my bytecode instruction set. But I soon came to a problem. Before I asked this question, I had tried to create opcodes using methods such as:

# pseudo code

opcodes = {
    'PUSH': convertToBytes(5),
    'POP': convertToBytes(10),
    'ADD': converToBytes(15),
     etc...
}

As you can probably tell, in the above example I used integers that were multiples of five, and converted them to byte form. I was trying to create a way in which I could orderly map my opcodes, to something such as integers. This of course did not work because as each integer became larger, So did the number of bytes relative to each integer. Which meant that my opcodes would've been variable lengths.

I then began to wonder if I was going about this the wrong way. I did some research to see how other languages design there opcodes.

I found this webpage about CPU opcodes that said:

The x86 CPU has a set of 8-16 bit codes that it recognizes and responds to. Each different code causes a different operation to take place inside the registers of the CPU or on the buses of the system board.

Here are three examples showing the bit patterns of three actual x86 opcodes, each followed by their one or more bytes of operands:

And proceed to give an example:

  Bit Pattern  ; operation performed by CPU
  -----------  -------------------------------------------------------
1. 10111000    ; MOVe the next two bytes into 16-bit register AX
2. 00000101    ; ...the LSB of the number (goes in AL)
3. 00000000    ; ...the MSB of the number (goes in AH)

1. 00000001    ; ADD to the BX register
2. 11000011    ; ...the contents of the AX register

1. 10001001    ; (2-byte opcode!) MOVe the contents of BX to
2. 00011110    ; ...the memory location pointed to
3. 00100000    ; ...by these last
4. 00000001    ; ...two bytes

This leads me to my question: What procedure is followed when selecting bytes to represent opcodes?. As in the above example, each instruction consists of one byte. How though, was that specific pattern of a byte picked?. Are byte(s) for opcodes just randomly chosen, and them mapped to mnemonics? eg:

# pseudo code

opcodes = {
    'PUSH': selectRandomByte(),
    'POP': selectRandomByte(),
    'ADD': selectRandomByte(),
     etc...
}

Note: Let me clarify: When I say opcode, I am referring to the opcodes found in Virtual Machine bytecode, not CPU's. I apologize if this was not clear before. The example I gave with the CPU opcodes was only for illustration purposes only.


Sources

Christian Dean
  • 2,790
  • 1
  • 22
  • 38
  • I'm confused as to why you would want to make them multiples of 5. Why not use multiples of 1? Start at 0 and work up to 255. That allows for 256 different opcodes. Adding a multiple will reduce the possible values. – JimmyJames Nov 10 '16 at 22:51
  • @JimmyJames I thought up that idea about 2 months ago when I had no idea how binary worked. But I think I'll end up going with something like you described. – Christian Dean Nov 10 '16 at 22:54
  • 2
    The x86 instruction set and instruction encoding are really bad examples. First, they haven't been "designed", rather they have "grown" over the course of over 40 years. Secondly, (partly as a consequence of #1) they are widely considered to be bad and extremely complex. Something like Donald Knuth's MMIX (for an artificially designed CPU), the MIPS R1000 (for a highly-orthogonal, simple, real-world CPU), the Burroughs B5000, or the Pascal P-Code system might be better examples. Also, CPU instruction encodings are designed for fast decoding in hardware; something which is not relevant for you. – Jörg W Mittag Nov 10 '16 at 23:24

3 Answers3

10

It's not random, but it may not be immediately apparent. And it may have evolved over time: the x86, architecture, for example, has been with us for nearly 40 years (1977), and has evolved from 16-bits to 32, to 64, with additional operations (such as MMX and SSE) added in that time.

Architectures that have been around for a long time generally have some relationship between opcodes and the actual electrical signals used to control the CPU.

In some architectures, such as the PDP-11, there is a clear plan to the opcodes. All opcodes fit into a 16-bit word, generally divided into 3-bit octal digits, with the following general usage:

  • Bit 15: 1 for a byte operation, 0 for a word operaton
  • Bits 14-12: 0 indicates an instruction that takes a single operand, 1-6 indicates an instruction that takes two operands, 7 is for operations that don't follow the standard operand encoding.
  • Bits 11-6: source operand for two-operand instructions, otherwise denotes the specific single-operand instruction.
  • Bits 5-0: destination operand for two-operand instructions, otherwise the sole operand for single-operand instructions.

In the case of Java bytecode, there's no underlying electrical architecture, and no real reason to put intelligence inside the structure of the bytecode, so related operations are grouped together and assigned sequential numbers. So while there's a reason that iconst_0 and iconst_1 have adjacent opcodes, there's (probably) no reason why those appear before the integer math opcodes.

kdgregory
  • 5,220
  • 23
  • 27
3

I have never designed a bytecode, but here are some things I would consider if I did:

  1. I think that it does not matter much. If people will want to work with your bytecode, they will do that using mnemonics, not the byte values. And whatever scheme you devise likely won't hold up when you add new opcodes in the future.

  2. Plan for the future. Make sure that you can easily add new opcodes in the future, and stay backwards compatible.

  3. If you plan to implement your instruction set in hardware, it could make sense to design the opcode values in a way that easy to decode for the hardware. But I think this will matter only for the most simple of CPUs.

  4. Group related opcodes together.

  5. If you have some instructions that have a built-in values (e.g. ldc.i4.1 in CIL or iconst_1 in Java), then it might be nice if the value was visible in the opcode value. Or at least have the same opcodes only with different values sorted next to each other (the value of iconst_2 should be one more than inconst_1).

svick
  • 9,999
  • 1
  • 37
  • 51
  • _"And whatever scheme you devise likely won't hold up when you add new opcodes in the future."_ - I'm not quite sure what your getting at here. Are you saying any method I devise will fail? or are you saying if I don't design my byte code right then I'll fail? – Christian Dean Nov 10 '16 at 22:25
  • 2
    What I mean is that if you say e.g. "all opcodes that load values on the stack have values 0x00-0x20" and then need to add new opcodes that load values, it won't hold true anymore. – svick Nov 10 '16 at 22:27
2

If your byte code is just for a software interpreter, you probably will not need any rule. Probably...

Originally byte codes came from hardware design. And the rules were dictated by the hardware. Certain bit patterns were reserved for, say, branch codes, load codes, store codes, etc. In most cases this also helped humans to decode machine code.

The above rule did fit for most of the opcodes. But of course, also in hardware people had to patch design. So certain codes were not orthogonal, because they were designed after most of the HW was already finished.

If you need to read your byte code manually from the bytes, try to create a certain pattern that can help to decipher it. For the software reading your byte code it will not matter which bit pattern is used and you could well choose a random one.