For those unfamiliar with it, Box2D is a great 2D physics library written by Erin Catto, which is at the core of a large number of casual games on consoles and mobile devices. Angry Birds is one you might have heard of, but there are many, many others.
It’s also not a simple library by any means. When porting Angry Birds to HTML5, we found that in some cases Box2D performance could be the limiting factor in the game’s frame-rate (on the more complex levels). It turns out this little library is doing a lot of work under the hood. And the work it’s doing isn’t limited to any one tight loop or hotspot. Rather, its work is distributed all over the place – matrix and vector math, creation of lots of small objects, and general object-oriented logic distributed over a complex code base.
The goal of this little experiment is not to add fuel to the flames of the Internet’s already-tiresome “Compiler and VM Wars” – rather, my intention is to get some hard data on what behavior real-world performance-sensitive code can actually expect to see in practice on various platforms. Measuring the performance of virtual machines in isolation is particularly tricky, but this benchmark has the nice property that, if a VM or compiler improves it, then real-world problems are actually solved in the wild, and everyone wins.
- Native : This is the standard Box2D code, compiled via gcc or clang/llvm (the latter on my test machine, as described below).
- NaCl : The same code, compiled via the NaCl SDK’s custom gcc build, and running within Chrome.
- Java : The JRE (1.6), as currently shipped by Apple on Mac OS 10.6.
Picking the right world structure for this kind of benchmark is a bit tricky, because it needs to have the following properties:
- A high per-frame running time.
- Not settle quickly: simulations that settle eventually stop doing work, as the physics engine skips calculations for settled objects.
- Stable: Subtle variations in the behavior of floating-point math can cause the behavior on different VMs to diverge badly, invalidating comparisons.
I eventually settled on a simple pyramid of boxes with 40 boxes at the base, for a total of about 800 boxes. I manually verified that this simulation is stable on all the systems tested, and that it doesn’t settle within the number of frames simulated for each test. It takes at least 3-4ms to simulate in native code, and only gets slower from there, so the running time is sufficient to avoid problems with timer resolution.
I used my MacBook Pro 2.53 GHz Intel Core i5 as a test machine. It seems a fairly middle-of-the road machine (maybe a bit on the fast side for a laptop). As always, your mileage may vary.
The raw data I collected is in the following spreadsheet. I ran each test several times, to mitigate spikes and hiccups that might be caused by other activity, and to give each system a fair chance. Each run warms up over 64 frames, and then runs for 256 frames – I’ve confirmed that the simulation is stable over this duration on all the platforms under test.
First, let’s look at all the results together, on a log-scale graph:
Now let’s compare the best of the best of each of these three groups.
This also demonstrates that native code compiled via NaCl stays within 20-30% of the performance of un-sandboxed native code, which is in line with what I’ve been told to expect.
First off, all the VMs tested are well within 3x of each other, which is wonderful, because wildly variant performance across browsers would make it exceedingly difficult to depend upon them for any heavy lifting. V8 and JSCore are quite close to one-another, but JSCore has an edge in variance. It’s not immediately obvious what’s causing this, but GC pauses are a likely culprit given the regularly-periodic spikes we see in this graph.
Note: See the update below about Emscripten performance
As with all benchmarks, especially ones as fuzzy as this, there are a lot of caveats.
The code’s not identical
These are all ports of the same C++ source code, so by their nature they must vary from one another. It may be the case that a particular port is unfairly negatively biased, because of particular idioms used in the code. If you suspect this is the case, please say so and preferably offer a patch to the maintainer. These aren’t being used in the same way as, e.g., the V8 or Kraken benchmarks, so it’s entirely fair to optimize the code to get better numbers.
I may have made mistakes
I suppose this goes without saying, but there are a lot of knobs to be tweaked here, and there could easily be something sub-optimal in my configuration, makefiles, or what-have you. If you notice anything amiss, please say so and I’ll try to address it.
This is just one machine
As described above, I ran these tests on my MacBook Pro. The relative performance of these tests might not be the same on different machines.
Based on feedback from comments and elsewhere, I’ve updated a few of the numbers in the linked spreadsheet, along with the graphs.
Note: I'm moving to G+ comments, but also want to preserve the old blog comments in read-only form. I just haven't gotten around to that last part yet, so they're temporarily unavailable.