Attention: The Purifier project has been re-tasked as a commercial product, Purifier2. The original, open source, Purifier1 version is still available. However, for technical reasons, it cannot be made to guarantee that the results are always identical to Sun's preverifier. The Purifier2 commercial version, which is still under development, uses a significantly different architecture to provide better compatibility. See the Project Status section of the Home page for more details.
The following documentation is from the original Purifier1 project and is provided for those who are interested in the project despite its current status.
The Developers Guide provides information on the inner workings of the Purifier1 for those who wish to understand how it works or develop it further. For information on how to use the Purifier1, see the Users Guide for more information.
Much of the information of this section is included in the copious "/* ... */" documentation in the actual code. However, such documentation is harder to follow because all of the special cases and details are mixed in with the main routines. The following documentation, though not as detailed as the code documentation, is written in such a way as to make the over-all algorithm clear before digging into the details.
The Connected Limited Device Configuration Specification provides details of the preverification process. The Purifier1 uses Apache BCEL to read and manipulate Java class files according to the CLDC specification. Other J2ME related documentation can be found at: http://java.sun.com/j2me/docs/.
The largest task in J2ME preverification is to generate a Stack Map. The basic concept of Stack Map generation is fairly simple, but there are lots of special cases and esoteric issues that make the detail more complex than the basic concept. Consequently, this documentation starts at the big picture and works down into the special cases.
A Stack Map is essentially a table of snap-shots of the state of the Virtual Machine at each branch instruction of the corresponding method. Consequently, Stack Maps are created on a method by method basis. Each of these VM states consists of a record of the data types that are stored in the operand stack and the local variables array of the VM at the corresponding branch instruction. The operand stack and local variables array are collectively known as the frame. Details regarding the operand stack, local variables array and other VM related issues are available from the The Java™ Virtual Machine Specification (Note: the KVM is a subset of the JVM).
Generation of a Stack Map requires an analysis that can go through the instructions in a method in the correct order, determine the effects that those instructions have on the frame and resolve conflicts between frames that are generated by different paths through the instructions. This process is known as Data Flow Analysis. Determining the correct flow of control through the instructions is known as Control Flow Analysis and is the basis upon which the DFA is built. Most implementations of verifiers do a complete CFA that maps out all of the possible execution paths through the code and use a DFA process determines what the data types should be for every instruction in that map. The Purifier1 uses a reduced CFA/DFA process that traverses most branches of the CFA only once rather than multiple times and stores type data only at the branch points where a Stack Map entry would be required.
The algorithm used by the Purifier1 project is much faster and less resource intensive than an algorithm that uses the full CFA/DFA approach. However, it is believed that Sun's preverifier is actually just a modified version of their regular verifier, whose algorithm is a full CFA/DFA and is described in Section 4.9.2 of the The Java™ Virtual Machine Specification. Although the algorithms do produce the same StackMaps most of the time, various idosyncracies of the different algorithms makes it impossible for them to be guaranteed to produce identical results all of the time. This is the primary reason that the Purifier1 development was halted and the Purifier2 was started.
Control Flow Analysis is the core of Stack Map generation. Most implementations of CFA go through every possible execution path of the instructions. For a full J2SE verifier, this is required because such a verifier needs to be able to guarantee the integrity of the code with absolute certainty. However, the Purifier1, uses a simplified CFA that attempts to visit each branch target only once. This reduces the work that the CFA has to do and simplifies the DFA, and reduces memory requirements as well. The Purifier1's CFA can get away with this because it assumes that the incoming class file has already been checked with a J2SE verifier or has been created with a trusted tool such as the javac compiler. The Purifier1's job is not to verify the integrity of a class file, but rather to take an existing, trusted, class file and add a Stack Map to it (along with inlining subroutines and checking CLDC specific requirements). Consequently, the Purifier1 does not have to check for integrity. Instead, the purifier can simply analyze the instructions and calculate the Stack Map. It can even use the assumption of integrity to help it resolve conflicting frames because it can use the knowledge that the data must be consistent and that any inconsistencies it detects must, therefore, be inconsequential and safe to discard. Although this does work extremely well, as pointed out above, it is not possible to guarantee that it will always produce the same results as Sun's preverifier.
The CFA works by stepping through each instruction one at a time. If an instruction happens to be a branch instruction (one that can cause execution to goto something other than the next instruction), then it pushes all of the BranchTargets of the instruction onto a BranchStack so they can be examined later. The CFA continues to step through the instructions like this until it reaches an instruction, such as return, that terminates execution. At that point, the CFA examines the BranchStack and repeats this process for each BranchTarget on the BranchStack In this way, the CFA visits every branch of the execution tree.
The CFA reduces the amount of work it has to do by using some extra code in the BranchStack class to keep track of which branches have already been visited. Whenever the main CFA loop attempts to push a BranchTarget onto the BranchStack, it is checked against the existing BranchTargets on the stack. If the target has been visited before, then it is not added to the stack a second time. This prevents the CFA from going through each branch target for every branch source (instruction that points to a branch target).
Note that this mechanism does not absolutely guarantee that each instruction is checked only once. Some branches are in the middle of normal code flow that can "fall through" to the next instruction, which just happens to be the target of some other branch instruction. Consequently, some sections of the code will be visited more than once by the CFA... once for the normal fall through case and again for each branch target that is upstream of the code section. This is not a problem, but is important when accounting for data in the Data Flow Analysis.
The CFA is complicated by the fact that there are several possible entry points to any method. The primary entry point is normal method invocation, which starts at the first instruction in the method code attribute. However, caught exceptions and finally blocks are also included in the method code block and represent alternate paths through the method. To deal with these cases, the CFA goes through the exception table attribute for the method and determines what these alternate entry points are. Then, the CFA creates BranchTargets for each of these entry points and adds them to the BranchStack before starting the normal method entry point. In this way, these alternate entry points are checked by the normal flow of the CFA when it pops the corresponding BranchTargets off of the BranchStack.
There is a complication, to this complication. The Data Flow Analysis cannot allow the CFA to proceed through one of these alternate entry points until they corresponding try block has been encountered, so that the BranchTarget can be initialized. In most cases, this occurs naturally through the normal action of the CFA. However, in the case of try-catch blocks that are nested inside of other catch blocks, the initialization will not have occurred when the CFA gets to the corresponding entry point. Consequently, the CFA has a mechanism to detect uninitialized BranchTargets and move them to the bottom of the BranchStack for later processing. This strategy allows the outer catch block to be processed so that the BranchTarget that points to the inner catch block can be initialized before it is processed.
Since Data Flow Analysis is concerned with tracking how data types in the frame change during program execution, it should come as no surprise that it is intimately entwined with the Control Flow Analysis. The main part of the DFA is a call to the updateFrame method just before the CFA does its check to see whether the current instruction is a branch or not. The updateFrame method looks up the current instruction in a large switch statement to determine how that instruction affects the frame.
The current frame is stored in an instance of the Frame class, which contains an instance of the OperandStack and LocalVariables classes respectively. The current Frame is passed to the updateFrame method which works directly with the OperandStack and LocalVariables contained inside of the Frame to update the state of the Frame according to what the current instruction is intended to do.
When the CFA reaches a branching instruction, a copy of the Frame is pushed onto the BranchStack as a part of the BranchTarget object. This allows the DFA to reset the new, "current", Frame, when the BranchTarget is popped off of the BranchStack, to be identical to what it was when the branching instruction was first encountered. In this way, the state of the Frame is correctly set for each target that is investigated.
Whenever a branch instruction is detected and the current target and Frame is pushed onto the BranchStack, the Frame is also put into a FrameHash with the target as a key. When the CFA is completed, the FrameHash is sorted by offset and used to create the final org.apache.bcel.classfile.StackMap, which is the final product of the StackMap generator.
Before the start of the CFA, the current Frame must be initialized. This is handled by the Frame class constructor, which delegates the job of putting the "this" object and the method arguments onto the LocalVariables array to the LocalVariables constructor.
As mentioned in the CFA section above, initializing the Frames in the case of the exception entry points, is slightly more complicated. Since the data on the LocalVariables array is determined by what is in the LocalVariables of the Frame at the start of the corresponding try block, it is not possible to initialize the LocalVariables in the Frames that are pushed onto the BranchStack at the beginning of the CFA. The Frame constructor that is used in these cases leaves the LocalVariables reference set to null so that the CFA mechanism that pops BranchTargets off of the BranchStack can detect an uninitialized Frame and 'squeeze' it back onto the bottom of the BranchStack for later processing.
For the DFA to initialize these uninitialized Frames, it adds a mechanism into the CFA to detect the beginning of a try block and use the LocalVariables on the current Frame when the try block is detected to update those in the corresponding uninitialized Frame. This is accomplished by using a FrameStore to hold onto a reference to the Frame that is stored in the BranchStack for each Exception along with an index that is the same as the offset given in the exception table of the first instruction of the try block. When the current instruction matches that offset, then the Frame for the Exception handler BranchTarget can be fetched from the FrameStore and updated with the LocalVariables from the current Frame.
Since it is possible for several different paths through the instructions to lead to the same target, it is also possible that the Frames for those paths might not be exactly the same. In such cases, there is a conflict that has to be resolved. This is similar to the "merging" process that is discussed in Section 4.9.2 of the The Java™ Virtual Machine Specification. However, the Purifier1 uses StackMapTypes at branch points only and the JVM uses JVM Types at every instruction. This leads to some implementation differences.
There are four types of conflict situations:
When one data type is a subtype of the other, then it's perfectly valid to use that subtype in place of the intended type. For example, the Vector.add method takes an argument of type Object. Since a String is a sub class of Object, then it is legitimate to add a String or an Integer or any other subclass of Object to a Vector.
This can lead to the situation where two Frames going to the same target appear to be different because they are not exactly the same, but are, nonetheless compatible because both contain valid types that just happen to be sub types of each other. So, if the Purifier1 encounters two different Frames for the same target that have data types that are not identical, it must check to see if they are subtypes of each other. If they are sub types, than the most general of them is the one that is chosen for storage.
These conflicts are detected by the BranchStack.push method. Since the BranchStack.push method is the first place that Frames are stored and it is already checking existing Frames against new contenders for the same target, it is the natural place to also check for conflicts in these Frames. The actual conflict resolution code is in the resolveConflicts method of the LocalVariables class.
The FrameStore, FrameHash and the try block detection code also check for conflicts as a secondary precaution. This is a bit of overkill, but since the project is in the early stages of development, it is acceptable for now.
Common super class merging as described in the JVM Specification has not been attempted in the Purifier1 and might be the source of some of the remaining problems.
This situation is similar to the compatible subtype situation except that a null reference is stored instead of the intended data type. In the runtime environment, it is perfectly acceptable to use a null reference as a placeholder for an object reference. However, when doing the DFA, it is not always possible to determine, at the time a null is pushed onto the stack, what data type it is supposed to represent. The LocalVariables.resolveConflicts method also checks for these cases and when the challenging type is an object, will treat the Null type as if it is a subtype of the object and over-ride the Null with the object (note: the capitalization is important).
Sometimes it is not possible to resolve Null objects in this way. One solution that is used inside of the updateFrame method is to use the information in the signature of invocation instructions, such as invokevirtual, to determine what the data type of a Null is supposed to be. For this to work correctly, the DFA must be careful to make sure that when Frames are copied for storage, that the copy is not a deep copy. This is because the DFA needs to hold onto the actual StackMapType objects that are stored in the OperandStack and LocalVariables so that changes made to them in the invokeInstruction method affect the same objects that are used to generate the final StackMap. One exception to this is when invokespecial is called on an <init> method of an ITEM_NewObject. In that case, the StackMapType on the top of the stack must be replaced with an ITEM_Object rather than modified to avoid changing stored references to the original.
Although this last solution solves most cases there are cases where it does not work. At the time the Purifier1 project was halted, this problem was still unsolved. However, it shold be noted that the Purifier actually gets Null references correct in many places where the JustIce Verifier does not, which is a good indication of the success of the above methodology.
When a path through the instructions uses an entry in the LocalVariables array to temporarily store transient data, the values are not cleared out when it is done with this data. Although the stack must be treated more rigorously to avoid overflow, there is no requirement to clean up the LocalVariables. If the downstream code makes no use of this extra data, then it is junk that does not need to be on the StackMap because the checker doesn't have to check against it. It is also a problem for the Purifier1 because it looks like a conflict when one frame has different data than another. Since the BranchStack.push is checking for conflicts, the LocalVariables.resolveConflicts method must also be able to deal with junk data. It does so by assuming any data that any time there is a null (not to be confused with a Null data type entry) in a LocalVariables entry in either of two contending frames, then any other data must be junk. In this case, the stored Frame has that entry set to null. Also, if case #1 above finds that two objects are not subtypes of each other, then they both must be junk and, again, the stored Frame gets a null in the corresponding LocalVariables entry. Note that a J2SE Verifier would attempt to merge these to a common super class instead.
Sometimes transient data types are used that do not have corresponding entries in the ConstantPool. There is significant evidence that this is because such items are simply added to the runtime ConstantPool and aren't required for the ConstantPool that is saved with the class file because of their transient nature. This situation poses a problem for the Purifier1 because all OperandStack and LocalVariables entries are stored as StackMapType instances, which require a constant pool reference for all StackMapTypes that are of type ITEM_Object. These types can only be stored in StackMapType as unknown by using the index value of -1, which can cause problems for several of the StackMapType methods.
Although the unknown objects can be problematic, they never end up in the final StackMap because if they did, then the compiler would have been required to included a ConstantPool entry for them. The problems caused by unknown objects is handled in the Purifier1 in several ways. The primary technique is coding the DFA to ensure that the problematic methods are never called on StackMap instances that may represent unknown objects. Another was to code the LocalVariables.resolveConflicts method to favour other objects over unknown objects. The invokeInstruction method has a similar bias.
Determining the scope in which a particular variable is active is important to getting the DFA right. LocalVariables slots can be used to store different variables at different points in the Control Flow of the code. Consequently, it is helpful to discover when a particular variable is in scope and when it is not in order to determine when to include a variable in the StackMap and when to attempt to resolve types to a super class.
Many of the issues in the previous section are related to this issue. For example, when the compatible subtype conflict resolution detects two incompatible types, it is clear that both are out their scope of applicability, so they are discarded. However, other cases are much more subtle.
Unfortunately, the Purifier1 must emulate the behaviour of Sun's preverifier, which seems to follow the JVM Verifier. These verifiers don't seem to rigorously track variable scoping. Instead they use a method of instruction by instruction type merging to deal with the issue. This produces results that are highly dependent upon the implementation details rather than consistent with a rigorous scoping specification. This is part of the reason that the Purifier1 algorithm could not always produce the same results as Sun's preverifier. This is not so much an issue of one or the other algorithm being wrong, but merely an issue of the idiosyncracies of the corresponding conflict resolution mechanisms.
Since this project is incomplete, there are other cases that have not been fully solved yet. Here is a list of these cases. Some include detailed technical explanations:
While investigating some of these issues, some potential problems with the JustIce Verifier, which is built into BCEL have also been found. See JustIce Exception Handling Issue and JustIce Interface Merge Inconsistency with Sun's Preverifier for details.
The Purifier1 uses a number of third party libraries and tools. Most are included in the development distribution (BCEL, JUnit and Log4j). One tool that is not included is the Apache Ant build tool. It is assumed that most developers will already have this tool installed and configured on their system anyways so it is not included with the distribution. Besides, it is not required for the Purifier1 to run, while the others are.
One Ant task that is relevant to the discussion of tools is the "showCommandLines" task. This task will print out the fully qualified command lines required to run various tools. On some operating systems, the resulting command lines are too long to paste into a shell window, but they do, at least, show the complete paths to everything so you have a starting point for creating platform specific shell scripts.
Another development tool that is requied for some compilation and most testing is the J2ME Configuration and Profile dependent libraries. In the most common case, this requirement means that the Purifier1 developer will have to download the MIDP API library. This is available from Sun as a part of the J2ME Wireless Toolkit, but cannot be included in the Purifier1 distribution due to licensing restrictions. The developer will need to copy or symbolically link the midpapi.zip and emptyapi.zip files into the Purifier/lib/midp directory. A symbolic link is prefered because the Ant "distribution" task is designed to ignore symbolic links to prevent these proprietary libraries from getting into the Purifier1 distribution.
The Purifier1 is tested against a huge number of sample Midlets (the current count is 661 test cases). Some of these are included in the developer's distribution of the Purifier1. Unfortunately, most of the sample Midlets are third party programs that are not freely redistributable. They are stored in the proprietary subdirectories of the Purifier/src/TestCases, Purifier/TestCases/preverified and Purifier/TestCases/unpreverified directories. The Ant "distribution" task automatically excludes any proprietary directories. The current Purifier1 proprietary test cases come from the Wireless Toolkit, Data Representations' Simplicity™ for Mobile Devices tutorials and some free Midlets from midlet.org.
To be able to create the preverified example test cases that are distributed with the Purifier1, a platform dependent version of Sun's preverifier is required. This should be placed in the Purifier/bin directory. Also, the midpapi.zip library must be placed in the Purifier/lib/midp directory. Although the emptyapi.zip is used when running the Purifier1 itself, compiling the test cases requires midpapi.zip. Again, symbolic links are preferable.
The Purifier1 itself is built with Apache Ant using the build.xml and default.build.properties files in the Purifier directory. If started from the Purifier directory you should not have to change any of the settings in the default.build.properties file to get everything to work. However, you will need to have the emptyapi.zip file installed as described above. Use the command ant -projecthelp to see the different compilation options.
The freely distributable example midlets that are used as test cases have their own ant build scripts because they require different different compilation paths than the Purifier1 code itself does. These classes also require preverification with Sun's preverification tool so that they can be used to compare with the results produced by the Purifier1 itself.
Under construction...
From command line need to set ANT_OPTS to -mx256m or will run out of memory