You mean the algorithm for running GoL? I'm using Life128, which is a bespoke algorithm I wrote (and attains a twofold speedup over Golly's QuickLife).simsim314 wrote:May I ask what iterator did you used?
It partitions the universe into 28-by-28 squares. They tile the universe as below, so each square has six neighbours (rather than the eight concomitant with an ordinary grid-like tiling):
Each of these is surrounded by a two-cell border, giving a covering of the universe by overlapping 32-by-32 tiles.
The algorithm iterates over all tiles that have changed since the previous iteration, and advances each of them by 2 generations at once (hence why a two-cell border is required). If cells near the edge have changed, the neighbouring tiles are alerted. Note that advancing by two generations at a time means that typical ash (which is period-2 due to the presence of blinkers) can be ignored.
Advancing a tile by two generations is accomplished by a 882-instruction straight-line routine manually written in inline x86 assembly! It uses bitwise operations (similar to your LifeAPI) and Streaming SIMD Extensions (allowing four 32-bit rows to be processed simultaneously), so there is a 128-fold parallelisation (compared with 64 for LifeAPI) -- hence the name Life128.
The assembly code is designed to mesh well with things like branch prediction (in particular, there are no branches!) and pipelining (the code has a shallowish Gantt chart, so the CPU's multiple ALUs can be used simultaneously). Consequently, this code runs in about 320 clock cycles (200 nanoseconds) on my machine, despite there being 882 instructions.
And indeed this would be necessary for any noticeable improvement, since Life128 already uses pretty much all of the tricks available to the CPU.simsim314 wrote:I would suggest to port the soup search to gpu as well, this should give another order of magnitude jump.
Unfortunately, my laptop lacks a GPU (by which I mean it doesn't support OpenCL -- it probably has a very basic inbuilt graphics chip), so I have no means of programming a GPU version of the search. Does anyone have a suitable machine into which I can ssh?
Of course, there's no rush -- I'm currently testing v2.0 with the help of Andrew, Dave, Tom and Nathaniel.