Ah, thanks! That's pretty useful for non-Life rules. The extension change is good, too. However, there's one thing - the profiling might enable some rule-specific optimizations behind the scenes. Maybe make an option to lower "-n" from 100K to say, 1K for non-Life "slow" rules? (Well literally every non-b3s23 is slow from my testing, even very similar non-explosive ones like HighLife b36s23, but I don't know what makes it slow)calcyman wrote:I've merged your changes now anyway -- your profiling approach improves it to 4320 on my AVX2 machine!testitemqlstudop wrote:whoops, last-minute updating
Because the profiling is pretty intrusive, and only suitable for rules where 100000 soups finish relatively quickly, I've made it optional. So to use your profiling acceleration, the invocation is:
./recompile.sh --profile
I've also changed the extension from .O to .op for the profiled objects, because some people (mainly Windows users) have case-insensitive filesystems that would object to having .o and .O coexisting.
Sure, but after all, without you there's no apgmera...Thank you very much for these updates! How would you like to be credited in the README?
(I might as well give away my full name: Darren Li, if you want to put it in. Where did the "credits" section even go? It was there the last time I checked the README...)
Great to see that! I wanted this tested on an AVX-512, too, to see if it had a greater increase. The optimizations work!EDIT: On my AVX-512 machine, it's even better still: soups per second have increased from 6140 to 6584, and the total number of instructions per soup has plummeted from 1080K to 966K:
(Without the overhead of perf stat, it's 6592 soups per second.)Code: Select all
$ perf stat -B ./apgluxe -n 1000000 -t 1 -v 0 -s test Greetings, this is apgluxe v4.86-ll2.1.11, configured for b3s23/C1. Lifelib version: ll2.1.11 Compiler version: 7.3.0 Python version: '2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]' Using seed test Running 1000000 soups per haul: Instruction set AVX-512 detected b3s23/C1: 64401 soups completed (6440.060 soups/second current, 6440.060 overall). b3s23/C1: 130164 soups completed (6576.199 soups/second current, 6508.121 overall). b3s23/C1: 196853 soups completed (6667.964 soups/second current, 6561.404 overall). Linear-growth pattern detected: yl144_1_16_afb5f3db909e60548f086e22ee3353ac Linear-growth pattern detected: yl144_1_16_afb5f3db909e60548f086e22ee3353ac Soup test216747 lasts an estimated 740 generations; rerunning... Soup test216747 actually lasts 637 generations. b3s23/C1: 261511 soups completed (6465.779 soups/second current, 6537.497 overall). Linear-growth pattern detected: yl144_1_16_afb5f3db909e60548f086e22ee3353ac b3s23/C1: 327370 soups completed (6585.898 soups/second current, 6547.175 overall). b3s23/C1: 393115 soups completed (6573.912 soups/second current, 6551.630 overall). Linear-growth pattern detected: yl144_1_16_afb5f3db909e60548f086e22ee3353ac Linear-growth pattern detected: yl144_1_16_afb5f3db909e60548f086e22ee3353ac b3s23/C1: 459748 soups completed (6663.297 soups/second current, 6567.580 overall). b3s23/C1: 525636 soups completed (6588.743 soups/second current, 6570.224 overall). Linear-growth pattern detected: yl144_1_16_afb5f3db909e60548f086e22ee3353ac b3s23/C1: 591645 soups completed (6600.703 soups/second current, 6573.610 overall). Rare oscillator detected: xp8_gk2gb3z11 b3s23/C1: 656918 soups completed (6527.051 soups/second current, 6568.953 overall). b3s23/C1: 723084 soups completed (6616.541 soups/second current, 6573.278 overall). Linear-growth pattern detected: yl144_1_16_afb5f3db909e60548f086e22ee3353ac b3s23/C1: 788364 soups completed (6527.883 soups/second current, 6569.495 overall). Linear-growth pattern detected: yl144_1_16_afb5f3db909e60548f086e22ee3353ac b3s23/C1: 855046 soups completed (6667.639 soups/second current, 6577.044 overall). b3s23/C1: 921159 soups completed (6610.982 soups/second current, 6579.468 overall). b3s23/C1: 987972 soups completed (6681.267 soups/second current, 6586.253 overall). b3s23/C1: 1000000 soups completed (6409.840 soups/second current, 6584.073 overall). ---------------------------------------------------------------------- 1000000 soups completed. Attempting to contact payosha256. testing mode testing Connection was successful; starting new search... ---------------------------------------------------------------------- New seed: l_qXruggaVhGbN; iterations = 1; quitByUser = 0 Terminating... Performance counter stats for './apgluxe -n 1000000 -t 1 -v 0 -s test': 151884.435307 task-clock (msec) # 0.999 CPUs utilized 137 context-switches # 0.001 K/sec 0 cpu-migrations # 0.000 K/sec 121,755 page-faults # 0.802 K/sec 525,813,323,844 cycles # 3.462 GHz 966,647,102,509 instructions # 1.84 insn per cycle 92,968,439,128 branches # 612.100 M/sec 6,130,671,887 branch-misses # 6.59% of all branches 151.971671331 seconds time elapsed
Interesting. Let me try to install and run perf on my system...The perf report is interesting: the two most expensive functions are runkgens (where it spends 65% of the time) and censusSoup (where it spends 11% of the time). Digging deeper, the next easy performance target seems to be the loop in upattern::totalPopulation() which gets inlined into censusSoup. If I restrict that to only iterate over tiles that have changed, it will be far more cache-friendly and we'll pick up another 3% performance improvement (i.e. 6800 soups/second after the change).
Finally, I saw you removed "-Ofast" - it made many of my own programs faster. What was the reason for removing -Ofast? It might have some platform-specific improvements.