Home | History | Annotate | Download | only in manual
      1 <?xml version="1.0"?>
      2 <!--
      3     * Licensed to the Apache Software Foundation (ASF) under one
      4     * or more contributor license agreements.  See the NOTICE file
      5     * distributed with this work for additional information
      6     * regarding copyright ownership.  The ASF licenses this file
      7     * to you under the Apache License, Version 2.0 (the
      8     * "License"); you may not use this file except in compliance
      9     * with the License.  You may obtain a copy of the License at
     10     * 
     11     *   http://www.apache.org/licenses/LICENSE-2.0
     12     * 
     13     * Unless required by applicable law or agreed to in writing,
     14     * software distributed under the License is distributed on an
     15     * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     16     * KIND, either express or implied.  See the License for the
     17     * specific language governing permissions and limitations
     18     * under the License.    
     19 -->
     20 <document>
     21   <properties>
     22     <title>The Java Virtual Machine</title>
     23   </properties>
     24 
     25   <body>
     26     <section name="The Java Virtual Machine">
     27       <p>
     28         Readers already familiar with the Java Virtual Machine and the
     29         Java class file format may want to skip this section and proceed
     30         with <a href="bcel-api.html">section 3</a>.
     31       </p>
     32 
     33       <p>
     34         Programs written in the Java language are compiled into a portable
     35         binary format called <em>byte code</em>. Every class is
     36         represented by a single class file containing class related data
     37         and byte code instructions. These files are loaded dynamically
     38         into an interpreter (<a
     39               href="http://docs.oracle.com/javase/specs/">Java
     40         Virtual Machine</a>, aka. JVM) and executed.
     41       </p>
     42 
     43       <p>
     44         <a href="#Figure 1">Figure 1</a> illustrates the procedure of
     45         compiling and executing a Java class: The source file
     46         (<tt>HelloWorld.java</tt>) is compiled into a Java class file
     47         (<tt>HelloWorld.class</tt>), loaded by the byte code interpreter
     48         and executed. In order to implement additional features,
     49         researchers may want to transform class files (drawn with bold
     50         lines) before they get actually executed. This application area
     51         is one of the main issues of this article.
     52       </p>
     53 
     54       <p align="center">
     55         <a name="Figure 1">
     56           <img src="../images/jvm.gif"/>
     57           <br/>
     58           Figure 1: Compilation and execution of Java classes</a>
     59       </p>
     60 
     61       <p>
     62         Note that the use of the general term "Java" implies in fact two
     63         meanings: on the one hand, Java as a programming language, on the
     64         other hand, the Java Virtual Machine, which is not necessarily
     65         targeted by the Java language exclusively, but may be used by <a
     66               href="http://www.robert-tolksdorf.de/vmlanguages.html">other
     67         languages</a> as well. We assume the reader to be familiar with
     68         the Java language and to have a general understanding of the
     69         Virtual Machine.
     70       </p>
     71 
     72     <subsection name="Java class file format">
     73       <p>
     74         Giving a full overview of the design issues of the Java class file
     75         format and the associated byte code instructions is beyond the
     76         scope of this paper. We will just give a brief introduction
     77         covering the details that are necessary for understanding the rest
     78         of this paper. The format of class files and the byte code
     79         instruction set are described in more detail in the <a
     80               href="http://docs.oracle.com/javase/specs/">Java
     81         Virtual Machine Specification</a>. Especially, we will not deal
     82         with the security constraints that the Java Virtual Machine has to
     83         check at run-time, i.e. the byte code verifier.
     84       </p>
     85 
     86       <p>
     87         <a href="#Figure 2">Figure 2</a> shows a simplified example of the
     88         contents of a Java class file: It starts with a header containing
     89         a "magic number" (<tt>0xCAFEBABE</tt>) and the version number,
     90         followed by the <em>constant pool</em>, which can be roughly
     91         thought of as the text segment of an executable, the <em>access
     92         rights</em> of the class encoded by a bit mask, a list of
     93         interfaces implemented by the class, lists containing the fields
     94         and methods of the class, and finally the <em>class
     95         attributes</em>, e.g.,  the <tt>SourceFile</tt> attribute telling
     96         the name of the source file. Attributes are a way of putting
     97         additional, user-defined information into class file data
     98         structures. For example, a custom class loader may evaluate such
     99         attribute data in order to perform its transformations. The JVM
    100         specification declares that unknown, i.e., user-defined attributes
    101         must be ignored by any Virtual Machine implementation.
    102       </p>
    103 
    104       <p align="center">
    105         <a name="Figure 2">
    106           <img src="../images/classfile.gif"/>
    107           <br/>
    108           Figure 2: Java class file format</a>
    109       </p>
    110 
    111       <p>
    112         Because all of the information needed to dynamically resolve the
    113         symbolic references to classes, fields and methods at run-time is
    114         coded with string constants, the constant pool contains in fact
    115         the largest portion of an average class file, approximately
    116         60%. In fact, this makes the constant pool an easy target for code
    117         manipulation issues. The byte code instructions themselves just
    118         make up 12%.
    119       </p>
    120 
    121       <p>
    122         The right upper box shows a "zoomed" excerpt of the constant pool,
    123         while the rounded box below depicts some instructions that are
    124         contained within a method of the example class. These
    125         instructions represent the straightforward translation of the
    126         well-known statement:
    127       </p>
    128 
    129       <p align="center">
    130         <source>System.out.println("Hello, world");</source>
    131       </p>
    132 
    133       <p>
    134         The first instruction loads the contents of the field <tt>out</tt>
    135         of class <tt>java.lang.System</tt> onto the operand stack. This is
    136         an instance of the class <tt>java.io.PrintStream</tt>. The
    137         <tt>ldc</tt> ("Load constant") pushes a reference to the string
    138         "Hello world" on the stack. The next instruction invokes the
    139         instance method <tt>println</tt> which takes both values as
    140         parameters (instance methods always implicitly take an instance
    141         reference as their first argument).
    142       </p>
    143 
    144       <p>
    145         Instructions, other data structures within the class file and
    146         constants themselves may refer to constants in the constant pool.
    147         Such references are implemented via fixed indexes encoded directly
    148         into the instructions. This is illustrated for some items of the
    149         figure emphasized with a surrounding box.
    150       </p>
    151 
    152       <p>
    153         For example, the <tt>invokevirtual</tt> instruction refers to a
    154         <tt>MethodRef</tt> constant that contains information about the
    155         name of the called method, the signature (i.e., the encoded
    156         argument and return types), and to which class the method belongs.
    157         In fact, as emphasized by the boxed value, the <tt>MethodRef</tt>
    158         constant itself just refers to other entries holding the real
    159         data, e.g., it refers to a <tt>ConstantClass</tt> entry containing
    160         a symbolic reference to the class <tt>java.io.PrintStream</tt>.
    161         To keep the class file compact, such constants are typically
    162         shared by different instructions and other constant pool
    163         entries. Similarly, a field is represented by a <tt>Fieldref</tt>
    164         constant that includes information about the name, the type and
    165         the containing class of the field.
    166       </p>
    167 
    168       <p>
    169         The constant pool basically holds the following types of
    170         constants: References to methods, fields and classes, strings,
    171         integers, floats, longs, and doubles.
    172       </p>
    173 
    174     </subsection>
    175 
    176     <subsection name="Byte code instruction set">
    177       <p>
    178         The JVM is a stack-oriented interpreter that creates a local stack
    179         frame of fixed size for every method invocation. The size of the
    180         local stack has to be computed by the compiler. Values may also be
    181         stored intermediately in a frame area containing <em>local
    182         variables</em> which can be used like a set of registers. These
    183         local variables are numbered from 0 to 65535, i.e., you have a
    184         maximum of 65536 of local variables per method. The stack frames
    185         of caller and callee method are overlapping, i.e., the caller
    186         pushes arguments onto the operand stack and the called method
    187         receives them in local variables.
    188       </p>
    189 
    190       <p>
    191         The byte code instruction set currently consists of 212
    192         instructions, 44 opcodes are marked as reserved and may be used
    193         for future extensions or intermediate optimizations within the
    194         Virtual Machine. The instruction set can be roughly grouped as
    195         follows:
    196       </p>
    197 
    198       <p>
    199         <b>Stack operations:</b> Constants can be pushed onto the stack
    200         either by loading them from the constant pool with the
    201         <tt>ldc</tt> instruction or with special "short-cut"
    202         instructions where the operand is encoded into the instructions,
    203         e.g.,  <tt>iconst_0</tt> or <tt>bipush</tt> (push byte value).
    204       </p>
    205 
    206       <p>
    207         <b>Arithmetic operations:</b> The instruction set of the Java
    208         Virtual Machine distinguishes its operand types using different
    209         instructions to operate on values of specific type. Arithmetic
    210         operations starting with <tt>i</tt>, for example, denote an
    211         integer operation. E.g., <tt>iadd</tt> that adds two integers
    212         and pushes the result back on the stack. The Java types
    213         <tt>boolean</tt>, <tt>byte</tt>, <tt>short</tt>, and
    214         <tt>char</tt> are handled as integers by the JVM.
    215       </p>
    216 
    217       <p>
    218         <b>Control flow:</b> There are branch instructions like
    219         <tt>goto</tt>, and <tt>if_icmpeq</tt>, which compares two integers
    220         for equality. There is also a <tt>jsr</tt> (jump to sub-routine)
    221         and <tt>ret</tt> pair of instructions that is used to implement
    222         the <tt>finally</tt> clause of <tt>try-catch</tt> blocks.
    223         Exceptions may be thrown with the <tt>athrow</tt> instruction.
    224         Branch targets are coded as offsets from the current byte code
    225         position, i.e., with an integer number.
    226       </p>
    227 
    228       <p>
    229         <b>Load and store operations</b> for local variables like
    230         <tt>iload</tt> and <tt>istore</tt>. There are also array
    231         operations like <tt>iastore</tt> which stores an integer value
    232         into an array.
    233       </p>
    234 
    235       <p>
    236         <b>Field access:</b> The value of an instance field may be
    237         retrieved with <tt>getfield</tt> and written with
    238         <tt>putfield</tt>. For static fields, there are
    239         <tt>getstatic</tt> and <tt>putstatic</tt> counterparts.
    240       </p>
    241 
    242       <p>
    243         <b>Method invocation:</b> Static Methods may either be called via
    244         <tt>invokestatic</tt> or be bound virtually with the
    245         <tt>invokevirtual</tt> instruction. Super class methods and
    246         private methods are invoked with <tt>invokespecial</tt>. A
    247         special case are interface methods which are invoked with
    248         <tt>invokeinterface</tt>.
    249       </p>
    250 
    251       <p>
    252         <b>Object allocation:</b> Class instances are allocated with the
    253         <tt>new</tt> instruction, arrays of basic type like
    254         <tt>int[]</tt> with <tt>newarray</tt>, arrays of references like
    255         <tt>String[][]</tt> with <tt>anewarray</tt> or
    256         <tt>multianewarray</tt>.
    257       </p>
    258 
    259       <p>
    260         <b>Conversion and type checking:</b> For stack operands of basic
    261         type there exist casting operations like <tt>f2i</tt> which
    262         converts a float value into an integer. The validity of a type
    263         cast may be checked with <tt>checkcast</tt> and the
    264         <tt>instanceof</tt> operator can be directly mapped to the
    265         equally named instruction.
    266       </p>
    267 
    268       <p>
    269         Most instructions have a fixed length, but there are also some
    270         variable-length instructions: In particular, the
    271         <tt>lookupswitch</tt> and <tt>tableswitch</tt> instructions, which
    272         are used to implement <tt>switch()</tt> statements.  Since the
    273         number of <tt>case</tt> clauses may vary, these instructions
    274         contain a variable number of statements.
    275       </p>
    276 
    277       <p>
    278         We will not list all byte code instructions here, since these are
    279         explained in detail in the <a
    280               href="http://docs.oracle.com/javase/specs/">JVM
    281         specification</a>. The opcode names are mostly self-explaining,
    282         so understanding the following code examples should be fairly
    283         intuitive.
    284       </p>
    285 
    286     </subsection>
    287 
    288     <subsection name="Method code">
    289       <p>
    290         Non-abstract (and non-native) methods contain an attribute
    291         "<tt>Code</tt>" that holds the following data: The maximum size of
    292         the method's stack frame, the number of local variables and an
    293         array of byte code instructions. Optionally, it may also contain
    294         information about the names of local variables and source file
    295         line numbers that can be used by a debugger.
    296       </p>
    297 
    298       <p>
    299         Whenever an exception is raised during execution, the JVM performs
    300         exception handling by looking into a table of exception
    301         handlers. The table marks handlers, i.e., code chunks, to be
    302         responsible for exceptions of certain types that are raised within
    303         a given area of the byte code. When there is no appropriate
    304         handler the exception is propagated back to the caller of the
    305         method. The handler information is itself stored in an attribute
    306         contained within the <tt>Code</tt> attribute.
    307       </p>
    308 
    309     </subsection>
    310 
    311     <subsection name="Byte code offsets">
    312       <p>
    313         Targets of branch instructions like <tt>goto</tt> are encoded as
    314         relative offsets in the array of byte codes. Exception handlers
    315         and local variables refer to absolute addresses within the byte
    316         code.  The former contains references to the start and the end of
    317         the <tt>try</tt> block, and to the instruction handler code. The
    318         latter marks the range in which a local variable is valid, i.e.,
    319         its scope. This makes it difficult to insert or delete code areas
    320         on this level of abstraction, since one has to recompute the
    321         offsets every time and update the referring objects. We will see
    322         in <a href="bcel-api.html#ClassGen">section 3.3</a> how <font
    323               face="helvetica,arial">BCEL</font> remedies this restriction.
    324       </p>
    325 
    326     </subsection>
    327 
    328     <subsection name="Type information">
    329       <p>
    330         Java is a type-safe language and the information about the types
    331         of fields, local variables, and methods is stored in so called
    332         <em>signatures</em>. These are strings stored in the constant pool
    333         and encoded in a special format. For example the argument and
    334         return types of the <tt>main</tt> method
    335       </p>
    336 
    337       <p align="center">
    338         <source>public static void main(String[] argv)</source>
    339       </p>
    340 
    341       <p>
    342         are represented by the signature
    343       </p>
    344 
    345       <p align="center">
    346         <source>([java/lang/String;)V</source>
    347       </p>
    348 
    349       <p>
    350         Classes are internally represented by strings like
    351         <tt>"java/lang/String"</tt>, basic types like <tt>float</tt> by an
    352         integer number. Within signatures they are represented by single
    353         characters, e.g., <tt>I</tt>, for integer. Arrays are denoted with
    354         a <tt>[</tt> at the start of the signature.
    355       </p>
    356 
    357     </subsection>
    358 
    359     <subsection name="Code example">
    360       <p>
    361         The following example program prompts for a number and prints the
    362         factorial of it. The <tt>readLine()</tt> method reading from the
    363         standard input may raise an <tt>IOException</tt> and if a
    364         misspelled number is passed to <tt>parseInt()</tt> it throws a
    365         <tt>NumberFormatException</tt>. Thus, the critical area of code
    366         must be encapsulated in a <tt>try-catch</tt> block.
    367       </p>
    368 
    369       <source>
    370 import java.io.*;
    371 
    372 public class Factorial {
    373     private static BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
    374 
    375     public static int fac(int n) {
    376         return (n == 0) ? 1 : n * fac(n - 1);
    377     }
    378 
    379     public static int readInt() {
    380         int n = 4711;
    381         try {
    382             System.out.print("Please enter a number&gt; ");
    383             n = Integer.parseInt(in.readLine());
    384         } catch (IOException e1) {
    385             System.err.println(e1);
    386         } catch (NumberFormatException e2) {
    387             System.err.println(e2);
    388         }
    389         return n;
    390     }
    391     
    392     public static void main(String[] argv) {
    393         int n = readInt();
    394         System.out.println("Factorial of " + n + " is " + fac(n));
    395     }
    396 }
    397       </source>
    398 
    399       <p>
    400         This code example typically compiles to the following chunks of
    401         byte code:
    402       </p>
    403 
    404       <source>
    405         0:  iload_0
    406         1:  ifne            #8
    407         4:  iconst_1
    408         5:  goto            #16
    409         8:  iload_0
    410         9:  iload_0
    411         10: iconst_1
    412         11: isub
    413         12: invokestatic    Factorial.fac (I)I (12)
    414         15: imul
    415         16: ireturn
    416 
    417         LocalVariable(start_pc = 0, length = 16, index = 0:int n)
    418       </source>
    419 
    420       <p><b>fac():</b>
    421         The method <tt>fac</tt> has only one local variable, the argument
    422         <tt>n</tt>, stored at index 0. This variable's scope ranges from
    423         the start of the byte code sequence to the very end.  If the value
    424         of <tt>n</tt> (the value fetched with <tt>iload_0</tt>) is not
    425         equal to 0, the <tt>ifne</tt> instruction branches to the byte
    426         code at offset 8, otherwise a 1 is pushed onto the operand stack
    427         and the control flow branches to the final return.  For ease of
    428         reading, the offsets of the branch instructions, which are
    429         actually relative, are displayed as absolute addresses in these
    430         examples.
    431       </p>
    432 
    433       <p>
    434         If recursion has to continue, the arguments for the multiplication
    435         (<tt>n</tt> and <tt>fac(n - 1)</tt>) are evaluated and the results
    436         pushed onto the operand stack.  After the multiplication operation
    437         has been performed the function returns the computed value from
    438         the top of the stack.
    439       </p>
    440 
    441       <source>
    442         0:  sipush        4711
    443         3:  istore_0
    444         4:  getstatic     java.lang.System.out Ljava/io/PrintStream;
    445         7:  ldc           "Please enter a number&gt; "
    446         9:  invokevirtual java.io.PrintStream.print (Ljava/lang/String;)V
    447         12: getstatic     Factorial.in Ljava/io/BufferedReader;
    448         15: invokevirtual java.io.BufferedReader.readLine ()Ljava/lang/String;
    449         18: invokestatic  java.lang.Integer.parseInt (Ljava/lang/String;)I
    450         21: istore_0
    451         22: goto          #44
    452         25: astore_1
    453         26: getstatic     java.lang.System.err Ljava/io/PrintStream;
    454         29: aload_1
    455         30: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V
    456         33: goto          #44
    457         36: astore_1
    458         37: getstatic     java.lang.System.err Ljava/io/PrintStream;
    459         40: aload_1
    460         41: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V
    461         44: iload_0
    462         45: ireturn
    463 
    464         Exception handler(s) =
    465         From    To      Handler Type
    466         4       22      25      java.io.IOException(6)
    467         4       22      36      NumberFormatException(10)
    468       </source>
    469 
    470       <p><b>readInt():</b> First the local variable <tt>n</tt> (at index 0)
    471         is initialized to the value 4711.  The next instruction,
    472         <tt>getstatic</tt>, loads the references held by the static
    473         <tt>System.out</tt> field onto the stack. Then a string is loaded
    474         and printed, a number read from the standard input and assigned to
    475         <tt>n</tt>.
    476       </p>
    477 
    478       <p>
    479         If one of the called methods (<tt>readLine()</tt> and
    480         <tt>parseInt()</tt>) throws an exception, the Java Virtual Machine
    481         calls one of the declared exception handlers, depending on the
    482         type of the exception.  The <tt>try</tt>-clause itself does not
    483         produce any code, it merely defines the range in which the
    484         subsequent handlers are active. In the example, the specified
    485         source code area maps to a byte code area ranging from offset 4
    486         (inclusive) to 22 (exclusive).  If no exception has occurred
    487         ("normal" execution flow) the <tt>goto</tt> instructions branch
    488         behind the handler code. There the value of <tt>n</tt> is loaded
    489         and returned.
    490       </p>
    491 
    492       <p>
    493         The handler for <tt>java.io.IOException</tt> starts at
    494         offset 25. It simply prints the error and branches back to the
    495         normal execution flow, i.e., as if no exception had occurred.
    496       </p>
    497 
    498     </subsection>
    499     </section>
    500   </body>
    501 
    502 </document>