Charles, the father of Rete, the algorithm that enabled the birth and success of the BRMS industry, described the performance issues he faces (and gets frustrated with) when developing in Java. Charles focuses in performance – through algorithms, through efficient coding, through clever usage of the platform.
Java (and the .NET world and other VM-based systems) generate bytecode that gets handled by a virtual machine implementation. The VM is a high-level abstraction, and, as such, frequently does not leverage the capabilities of the capabilities offered by the processors and their eco-system. They do offer modern run-time optimization capabilities (the previous infamous and now famous JITs and the rest of their family), but there are still a number of areas where modern processors offer powerful performance capabilities that the VM abstractions have difficulties taking advantage of.
- out-of-order execution (more attention may have been put into compiler reorder optimizations)
- pipelined memory access with the issue of full cache line fetch on cache miss
- stalls – if there are not enough instructions available for the processor to work on, or if there are not enough free resources
If out-of-order execution is possible, stalls can be leveraged for actual execution – resulting in better performance.
Specific issues with Java
– Pointer chasing problem
Essentially when lots of object navigation is required during the processing. The processing itself tends to be quite limited, but there is a lot of navigation – which will lead to cache thrashing. In memory database specialists have conducted studies proving that a stall rate ranging from 73% to 82% is to be expected in typical operations. Charles did a similar study on the Rete II implementation, and identified a rate substantially higher than 50%.
Getting rid of this problem is not easy with a VM. An approach is to pre-fetch the data ahead of time – that will provoke a cache miss but that miss with overlap with useful work. Pre-fetching the data results in “strange” Java code (essentially getting references not actually used) but the result is better performance. It may introduce issues with in-order retirement of instructions.
The x86 family has a PREFETCH instruction group that could be used to achieve the expected result. These are not accessible through the VM.
– Making structures cache friendly
A data fetch results in a block of data being put in the cache – so the possibility of grouping together the data that is processed together opens interesting performance possibilities. In well controlled cases, pointer arithmetic could be used instead of higher level reference navigation.
Charles illustrated the possibilities with a BTree implementation.
But Java does not have support for this level of control.
– Using special instruction set features
Charles used the SIMD instructions. These can manipulate significant numbers of bits at a time – currently 128, soon 256. Of course, these instructions, if directly used, can result in performance gains.
Again, Java does not support this level of control.
Learn more about Decision Management and Sparkling Logic’s SMARTS™ Data-Powered Decision Manager