Ian07 wrote:I think there's been a bit of a slowdown since v4.9 in b3s23/D8_4 (and of course, possibly other rules/symmetries); I remember getting well over 7k soups/sec/core in v4.9 but have only been getting about 6.5k in v5.03.
On a slightly different but related note, I received a bug report saying that HighLife search speed had stalled to basically zero. That's now been fixed in
commit a679e82f, so HighLife can now be searched at a comfortable 20 soups/second. Other rules with erratic-growth patterns should see a concomitant return to the former glory of early v4.x versions.
wildmyron wrote:I was curious about using a smaller universe count on the GTX 960M. The naive code change I made to try this out didn't work out - how complex is actually changing the allowable values of this parameter? I'm guessing it involves modifications to copyhashes() in gpupattern.h that aren't at all obvious to me.
Essentially, this part of the code:
Code: Select all
if (unicount == 8192) {
logf = 7;
exclusive_scan_uint64_128<<<64, 128>>>(tilemasks, l1sums, l1totals);
} else if (unicount == 4096) {
logf = 6;
exclusive_scan_uint64<<<64, 64>>>(tilemasks, l1sums, l1totals);
}
exclusive_scan_uint64<<<1, 64>>>(l1totals, l2sums, l2sums + 64);
compact_tiles<<<unicount, 128>>>((uint32_cu*) multiverse, tilemasks, l1sums, l2sums, compacts, logf);
psc_universes<<<(unicount / 32), 32>>>(to_restore, psums);
if (unicount == 8192) {
exclusive_scan_uint32_256<<<1, 256>>>(psums, psums2, l2sums + 65);
} else if (unicount == 4096) {
exclusive_scan_uint32<<<1, 128>>>(psums, psums2, l2sums + 65);
}
needs to be changed if you want a smaller number of universes. In particular, to add 2048 as an option, you'll need to implement a 32-element uint64 exclusive scan and a 64-element uint32 exclusive scan. (Is this necessary due to memory requirements? Setting
-u 4096 should already use less than a gigabyte total GPU memory. I'll add the 2048 option myself if there are use-cases.)
wildmyron wrote:Hmm, I meant to include this URL:
https://wpdev.uservoice.com/forums/2669 ... pu-support - I haven't seen any recent news from the WSL team on progress towards this feature.
From the reading I've done, it seems it would be possible to link the CPU and GPU parts built with gcc (cross compiled) and nvcc + cl.exe, respectively.
Thanks for the suggestions. Is the idea to build the CPU part using MinGW (in the same way that the precompiled Windows binary is built), but output a dynamic library instead of an executable, and then link to that?
I'm not sure if differing conventions in the decorators used would cause problems, but building the CPU searcher as a dll would hopefully avoid that issue.
Usually you can avoid linkage problems by using
extern "C", at the expense of having a more low-level interface between the parts of the program. (That's how python-lifelib works: the lifelib dynamic library exposes several C-like interfaces and calls them from the Python side.)
Among other objects missing from the G1 census are messless soups, because they obviously pass the p6 stability test within a short duration and are therefore considered uninteresting. Is it easy (and worthwhile) to pick these up and pass them to the CPU searcher?
It should be possible, but would cause a slight slowdown: I'd need to check whether all of the tiles are already empty in the copyhashes() routine -- the one in which I clear the existing universe -- and if so, mark the soup as interesting. The reason for the slowdown is that each write would need to be replaced with the combination of a read and a write.
Of course, another possibility is to move the universe-clearing code out of copyhashes(), and incorporate it into a new function which can actually analyse the ash products of uninteresting soups (and mark 'interesting' if there's anything it can't identify). That would allow genuine C1 searching using the GPU: admittedly marginally slower than searching G1, owing to the extra overhead of ash analysis, but with the huge upside of being compatible with the existing main census.