4

How is Bytecode "parsed"?

It is my understand that Bytecode is a binary, intermediate representation of the syntax of a given programming language. Certain programming languages convert their source text into Bytecode which is then written to a file. How does the Virtual Machines of those languages "parse" their Bytecode?

To narrow down this question, take Python's Bytecode for instance. When the Python Virtual Machine begins reading Bytecode from a *.pyc file, how does the Virtual Machine translate the stream of bytes it is reading, into specific instructions?

When the Virtual Machine reads bytecode from a file, it is my understanding that the bytecode one long stream of bytes. How then, is the bytecode broken into useful chunks? How is it transformed into an opcode and the opcodes arguments?

For example, say the Virtual Machine was reading in the bytecode to add two numbers. The Virtal Machine sees the instruction 0x05, which would mean "add two numbers".

Each number could be represented by a different number of bytes, so how would the Virtual Machine know how many bytes it would need to read ahead to gather the arguments for the op 0x05?

Christian Dean
  • 2,790
  • 1
  • 22
  • 38
  • 5
    The same way any other code is parsed: by reading it and applying rules. The only difference is that this code is in a non-textual format. Are you looking for any specific details of the process? – Mason Wheeler Nov 03 '16 at 14:44
  • @MasonWheeler I'm really just confused about if and how bytecode is parsed. But are you saying parsing bytecode, is the same as parsing source code? How would that work? I can see how ascii source code is parsed using rules, but how would you apply rules to binary? – Christian Dean Nov 03 '16 at 14:45
  • 6
    Generally something along the lines of "this byte is `0x38`. The parse table says that operator `0x38` takes 3 bytes for arguments, so read the next 3 bytes into the data structure. Then read the next byte as an opcode..." – Mason Wheeler Nov 03 '16 at 14:49
  • ASCII is an alphabet with 128 "letters", Unicode is an alphabet with thousands of "letters", binary is an alphabet with 256 "letters", why would parsing one be significantly different than the other? – Jörg W Mittag Nov 03 '16 at 14:54
  • @MasonWheeler I think my problem comes from the fact that my opcode can be a arbitrary numbers of bytes. I am converting numbers into bytes, which acts as my opcode. Should I try to limit my opcodes to one byte? That seems as if is would be the only way to parse it. – Christian Dean Nov 03 '16 at 14:57
  • @JörgWMittag Because your trying to apply grammar rules to arbitrary bytes? – Christian Dean Nov 03 '16 at 14:58
  • 5
    The bytes aren't arbitrary. They're an *encoding* of your textual code and they also follow grammatical rules. – MetaFight Nov 03 '16 at 14:59
  • @MetaFight Ok, I think I get how that might work. I thought that parsing ascii would be different from parsing binary. Apparently it is not. – Christian Dean Nov 03 '16 at 15:01
  • 1
    human readable text can be harder to parse - variable numbers of tabs/spaces, blank lines, multiple characters 'f', 'o', 'r', 'e', 'a', 'c', 'h' to represent a single "token", and so on. If you don't need to care about human readability, you can condense the source considerably and make it easier to parse – Dan Pichelman Nov 03 '16 at 15:04
  • 1
    Byte code class files are *decoded* and/or interpreted more than *parsed* in the classical sense. By decoding I refer to interpreting a file format, whereas parsing refers to deserializing text to inflating data structures from that text. – Erik Eidt Nov 03 '16 at 15:07
  • @ErikEidt That makes more sense to me. shouldn't all the VM need is to decode the instructions, not make an AST? – Christian Dean Nov 03 '16 at 15:08
  • Yes, though some still do choose to use an AST/tree during in their JIT (translation bytecode to cpu instructions) and to perform other optimizations. When a JIT makes an AST it is still more interpreting the bytecode than classically parsing text. – Erik Eidt Nov 03 '16 at 15:10
  • In computer science & with computer languages (be it for machine consumption only or for both humans and machine) the same applies as in chemistry: nothing ever gets really created, nothing ever gets really lost, everything gets transformed. Parsing bytecode differs from parsing textual source code only in the details relevant to obtain the transformation target(s) in the most cost effective way one can devise. But in essence it's still all about applying rewriting rules to go from one artifact to the other. Check out the Tombstone diagrams also: https://en.wikipedia.org/wiki/Tombstone_diagram – YSharp Nov 04 '16 at 03:05

3 Answers3

11

I think your confusion comes from thinking of bytecodes as a language that is being interpreted by the virtual machine. While this is technically a correct way to describe it, it's leading you to some assumptions about things that are not correct.

The first thing to understand is that bytecode is a type of machine code. The only thing that makes it different from the machine code that your CPU understands is that the machine in this case is virtual (hardware that uses bytecode directly is possible.) This might seem like a big distinction but if you consider what emulators do, whether the target machine is virtual or not isn't really of much importance in the context of the machine language.

Machine code is easy for computers to parse because is is expressly built for to make it easy to do so. The main distinction between machine languages and the higher languages most people are familiar with is the latter are generally built to be easy for humans to use.

This 1997 article on java bytecode might help. Let's walk through an example from that text:

84 00 01

For the first byte (called the opcode) is 84. We can lookup what that opcode means and find that it's iinc (increment local variable #index by signed byte const) and that the two following bytes indicate the index of the variable and the amount, respectively. The JVM then takes that instruction and translates it (while following the language specification) into machine instructions that correspond to the bytecode instructions.

JimmyJames
  • 24,682
  • 2
  • 50
  • 92
  • I upvoted, even though I think your argument is circular. "A language that is being interpreted by the virtual machine" is exactly what is happening with byte code; calling it machine code seems like a distinction without a difference, unless you believe "interpretation" and "execution" are meaningfully different. – Robert Harvey Nov 03 '16 at 16:54
  • 2
    @RobertHarvey As I said in the answer "this isn't really an incorrect way to describe it." Doesn't that address your complaint or is there more to it? My point here is that bytecode is more like machine code (actually it *is* machine code) than a language you or I would write code in. If you understand how machine code is processed, then it's more-or-less the same. If you aren't familiar with that, it helps give context for more research. – JimmyJames Nov 03 '16 at 17:04
  • @RobertHarvey I switched out "not incorrect" for "correct". It was confusingly worded. – JimmyJames Nov 03 '16 at 17:07
  • Yeah, that is better. – Robert Harvey Nov 03 '16 at 17:08
  • Hardware **decodes** instructions before/while executing them, **parsing** is done at the software level. So I would imagine that a VM (byte-code interpreter) does both, depending on which level of abstraction it is being viewed on (it's decoding routines for byte-code instructions are certainly doing some parsing). – samus Dec 08 '16 at 19:23
  • @SamusArin I'm not sure what you are getting at. Semantics? – JimmyJames Dec 08 '16 at 19:26
  • Paragraph 3, sentence 1. – samus Dec 08 '16 at 19:28
4

Byte codes are decoded. They are designed like a processor instruction set. Because the byte codes are variable length, even though we know where they are, in order to decode them, you have to decode from the beginning (usually of a method).

When you reach a branch instruction (especially conditional) you might choose to follow the branch target or the fall thru (next instruction). If you were interpreting, you'd do the former, and when JITing, you'd likely do the latter.

Each encoded byte says something about the instruction to execute as well as its length. Simple, common operations are encoded within a single byte. Other operations use additional bytes. The decoder looks at the values of the bytes so far and can then determine minimally whether the instruction is completed or takes one more byte. Some encodings may indicate multiple additional bytes.


Have a look at Java bytecode class file format, and also VAX instruction set architecture, which is a variable length and highly regular. Java bytecode uses a stack architecture, and is fairly high level (as it is bytecode), while the VAX is a register machine and low level. (You could also look at x86, but that is less regular and thus more complicated, IMHO.)

Erik Eidt
  • 33,282
  • 5
  • 57
  • 91
3

The file will have a small header with information about the version, where the executable bytecode is located (plus maybe information about the functions contained in it) and where the constant data (like strings) is located. On stackoverflow the question about python's bytecode has already been asked.

The bytecode itself is very often a very simple syntax. Where the first few bytes indicate what operation has to be performed and what operands are needed. The bytecode will be designed so that when reading byte per byte there is a unambiguous interpretation of the instructions.

To give an example that makes the bytes-per-operation very explicit there is SPIR-V. The first 4-byte word of each instruction is constructed as 2-byte length + 2-byte opcode.

ratchet freak
  • 25,706
  • 2
  • 62
  • 97