Using tensorflow to run CGOL

simsim314 · Post by **simsim314** » April 11th, 2023, 5:39 am

I've finally managed to write CGOL iterator using tensorflow, which is the adaption of LifeAPI iterator.

You can run it on whatever hardware that supports tf. Like GPU, multi GPU, TPU, cluster computing etc. they all have acceleration for tf and bitwise operations are part of its specification. Soon there would appear a lot of dedicated hardware for this purpose (mainly for neural nets). So it's a good idea to move to higher level libraries, which are hardware agnostic, and most hardware is actually adapted to them.

On my laptop with 1060 GTX, it runs twice faster than on my CPU, with ~500K CGOL iterations/sec of 64x64 grid.

EDIT: I've run Tesla T4 on google colab, and it was 1.5M iter/sec. I guess if you have something like 3090 RTX you will get to ~5M.

Code: Select all

import tensorflow as tf 
import numpy as np 

def Add(b1, b0, val):
    t_b1 = tf.bitwise.bitwise_and(b0, val)
    b1.assign(tf.bitwise.bitwise_or(b1, t_b1))
    b0.assign(tf.bitwise.bitwise_xor(b0, val))

def Add_Init(b1, b0, val):
    b1.assign(tf.bitwise.bitwise_and(b0, val))
    b0.assign(tf.bitwise.bitwise_xor(b0, val))

def Add1(b2, b1, b0, val):
    t_b2 = tf.bitwise.bitwise_and(b0, val)
    b2.assign(tf.bitwise.bitwise_or(b2, tf.bitwise.bitwise_and(t_b2, b1)))
    b1.assign(tf.bitwise.bitwise_xor(b1, t_b2))
    b0.assign(tf.bitwise.bitwise_xor(b0, val))

def Add_Init1(b2, b1, b0, val):
    t_b2 = tf.bitwise.bitwise_and(b0, val)
    b2.assign(t_b2 & b1)
    b1.assign(tf.bitwise.bitwise_xor(b1, t_b2))
    b0.assign(tf.bitwise.bitwise_xor(b0, val))

class Life64Layer(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(Life64Layer, self).__init__(**kwargs)
        self.sum0 = tf.Variable(tf.zeros((1, K, N), dtype=tf.uint64), trainable=False)
        self.sum1 = tf.Variable(tf.zeros((1, K, N), dtype=tf.uint64), trainable=False)
        self.sum2 = tf.Variable(tf.zeros((1, K, N), dtype=tf.uint64), trainable=False)


    def Evolve(self, temp, bU0, bU1, bB0, bB1):
        self.sum0.assign(tf.bitwise.left_shift(temp, 1))
        Add_Init(self.sum1, self.sum0, tf.bitwise.right_shift(temp, 1))

        Add(self.sum1, self.sum0, bU0)
        Add_Init(self.sum2, self.sum1, bU1)
        Add1(self.sum2, self.sum1, self.sum0, bB0)
        Add(self.sum2, self.sum1, bB1)

        return tf.bitwise.bitwise_and(tf.bitwise.bitwise_and(tf.bitwise.invert(self.sum2), self.sum1), tf.bitwise.bitwise_or(temp, self.sum0))

        
    def call(self, inputs):
        state = inputs
        l = tf.bitwise.left_shift(state, 1)
        r = tf.bitwise.right_shift(state, 1)
        bit0 = tf.bitwise.bitwise_xor(tf.bitwise.bitwise_xor(l, r), state)

        l_or_r = tf.bitwise.bitwise_or(l, r)
        l_and_r = tf.bitwise.bitwise_and(l, r)
        state_and_l_or_r = tf.bitwise.bitwise_and(state, l_or_r)
        bit1 = tf.bitwise.bitwise_or(state_and_l_or_r, l_and_r)
        
        bU0 = tf.roll(bit0, shift=-1, axis=2)
        bU1 = tf.roll(bit1, shift=-1, axis=2)
        bB0 = tf.roll(bit0, shift=1, axis=2)
        bB1 = tf.roll(bit1, shift=1, axis=2)

        return self.Evolve(state, bU0, bU1, bB0, bB1)
        
def print_bits(arr):
    N = arr.shape[0]
    for i in range(N):
        s = "{0:b}".format(arr[i])
        s = s.zfill(64).replace('0', '_').replace('1', 'O')
        print(s)
        if i == N-1:
            break
        print()
        
K = 10000
N = 64

x = np.random.randint(0, 2**64, size=(1, K, N), dtype=np.uint64)
print_bits(x[0, 0])

# convert numpy array to tensor with uint64 data type
x_tensor = tf.convert_to_tensor(x, dtype=tf.uint64)

model = tf.keras.Sequential([Life64Layer(input_shape=(K, N), dtype=tf.uint64)] + 
    [Life64Layer(dtype=tf.uint64) for _ in range(10)])
import time

for k in range(100):
    start_time = time.time()
    y_pred = model.predict(x_tensor, batch_size = 32, verbose = False)
    end_time = time.time()

    elapsed_time = end_time - start_time
    print("Elapsed time: {:.2f} seconds".format(elapsed_time))
    
    print_bits(y_pred[0, 1])
    x_tensor = y_pred

carsoncheng · Post by **carsoncheng** » April 11th, 2023, 7:05 am

simsim314 wrote: ↑
April 11th, 2023, 5:39 am
I've finally managed to write CGOL iterator using tensorflow, which is the adaption of LifeAPI iterator.

You can run it on whatever hardware that supports tf. Like GPU, multi GPU, TPU, cluster computing etc. they all have acceleration for tf and bitwise operations are part of its specification. Soon there would appear a lot of dedicated hardware for this purpose (mainly for neural nets). So it's a good idea to move to higher level libraries, which are hardware agnostic, and most hardware is actually adapted to them.

On my laptop with 1060 GTX, it runs twice faster than on my CPU, with ~500K CGOL iterations/sec of 64x64 grid.

Great! Does it run faster than LifeAPI or lifelib? If it is, it might be an option to adapt search utilities like catforce to the tensorflow utilities (of course make a c++ version of this iterator first, but it shouldn't be too hard)

simsim314 · Post by **simsim314** » April 11th, 2023, 8:22 am

As it's parallel computing - I couldn't use the (min, max) frame and ignore the zeros around. So very tiny patterns might still run faster on LifeAPI, but generally speaking if you have the hardware like cluster or even 4 GPUs, LifeAPI has no chance. Also one could define a smaller grid like 40x64 for example. As of lifelib - I'm not sure about the GPU utilization of this library. I remember Adam added some GPU support there.

Generally speaking - robust libraries like tensorflow, that run on any hardware you like out of the box, something like 10 TPUs, is the way to go in the long run for intensive computing. TPU is 4 times more capable than the strongest GPU, which is 20 times more capable than a CPU. So imagine renting something like 10 TPUs, and you can run 800 times faster than any previous search.

It's possible both to make c++ version of this iterator, and rewrite catforce in python. The main problem is that catforce can't run out of the box with this iterator, because it's not built to parallel computing of this extent. I think it's better to rewrite LifeAPI with this iterator on python and then catforce. But this is something that would be done automatically with GPT4, or GPT5. This conversion is pretty tricky part...

P.S. The idea of LifeAPI was to run faster with the comfort of golly. I was just writing lots of scripts in golly, and waiting for days for them to finish. Now when python is catching up with other languages in performance (assuming you use the correct libraries), there is no need to use c++ anymore, not as much as in 2014. Anyway translating between languages is also almost trivial today with chatGPT.

Post by **calcyman** » April 12th, 2023, 12:04 pm

simsim314 wrote: ↑
April 11th, 2023, 5:39 am
On my laptop with 1060 GTX, it runs twice faster than on my CPU, with ~500K CGOL iterations/sec of 64x64 grid.

This is massively underutilising the hardware. My laptop's GPU (a pretty rubbish Quadro P600) can manage 300 million 64x64 grid iterations per second using lifelib's GPU support which powers the QuFince search program. A 1080 Ti GPU should be able to manage > 1 billion 64x64 grid iterations, so your GPU is being underutilised by a factor of at least 2000.

There are a number of reasons for this, but the biggest one is that this TensorFlow implementation is memory-bound rather than compute-bound. The results of the TensorFlow operations are being saved into global GPU memory and then loaded back from memory in the next operation, whereas with lifelib the entire calculation (many successive generations in a loop, each of which is a handful of boolean operations, shifts, and shuffles) is performed solely in GPU registers.

The largest of these issues can be solved by something called kernel fusion, where you take your computation graph (at the moment, you're launching a separate kernel for every single bitwise operation) and combine it into fewer kernels (ideally one):

https://stackoverflow.com/q/53305830/5130486

There is a way to automatically do kernel fusion in TensorFlow using something called XLA, but it's not enabled by default in either TensorFlow or Keras:

https://www.tensorflow.org/xla

Enabling XLA should give you a huge speed boost (with the possible exception of the first timing test, because there's a brief pause at the beginning when it compiles your kernel for the first time). You'll want to also ensure that data is arranged in memory correctly (so that each 64x64 grid occupies a contiguous 512-byte segment of memory, i.e. you want your shape to be (N, K, 1) instead of (1, K, N)) for this to be most effective. There's also a Python library called JAX (also by Google, I believe) which gives you XLA but provides a more numpy-like interface than TensorFlow.

XLA together with memory contiguity should give you a big speedup, but I don't know how close this will be to the ~2000x speedup you'll get from using lifelib's CUDA code. The lifelib code isn't something that I could easily imagine a compiler being able to produce, especially one such as XLA which is designed for linear algebra:

Code: Select all

/**
 * Advance a 64x64 torus by one generation using a single warp.
 *
 * Each thread holds a 64x2 rectangle of the torus in a uint4
 * datatype (representing a 2x2 array of 32x1 rectangles).
 */
_DI_ uint4 advance_torus(uint4 old) {

    // Precompute the lane indices of the threads representing
    // the 64x2 rectangles immediately above and below this one.
    int upperthread = (threadIdx.x + 31) & 31;
    int lowerthread = (threadIdx.x +  1) & 31;

    // 8 boolean ops + 8 funnel shifts:
    uint4 xor2; uint4 sh;
    {
        uint4 al; uint4 ar;
        al.x = (old.x << 1) | (old.y >> 31);
        al.y = (old.y << 1) | (old.x >> 31);
        al.z = (old.z << 1) | (old.w >> 31);
        al.w = (old.w << 1) | (old.z >> 31);
        ar.x = (old.x >> 1) | (old.y << 31);
        ar.y = (old.y >> 1) | (old.x << 31);
        ar.z = (old.z >> 1) | (old.w << 31);
        ar.w = (old.w >> 1) | (old.z << 31);

        xor2.x = al.x ^ ar.x;
        xor2.y = al.y ^ ar.y;
        xor2.z = al.z ^ ar.z;
        xor2.w = al.w ^ ar.w;
        sh.x = al.x & ar.x;
        sh.y = al.y & ar.y;
        sh.z = al.z & ar.z;
        sh.w = al.w & ar.w;
    }

    // 24 boolean ops + 4 shuffles:
    uint4 xor3; uint4 xor3prime; uint4 sv; uint4 pt8;
    xor3.x = xor2.x ^ old.x;
    xor3.y = xor2.y ^ old.y;
    xor3.z = xor2.z ^ old.z;
    xor3.w = xor2.w ^ old.w;
    xor3prime.x = hh::shuffle_32(xor3.x, lowerthread);
    xor3prime.y = hh::shuffle_32(xor3.y, lowerthread);
    xor3prime.z = hh::shuffle_32(xor3.z, upperthread);
    xor3prime.w = hh::shuffle_32(xor3.w, upperthread);
    {
        uint4 uda; uint4 udx;
        udx.x = xor3.z ^ xor3prime.z;
        udx.y = xor3.w ^ xor3prime.w;
        udx.z = xor3.x ^ xor3prime.x;
        udx.w = xor3.y ^ xor3prime.y;
        uda.x = xor3.z & xor3prime.z;
        uda.y = xor3.w & xor3prime.w;
        uda.z = xor3.x & xor3prime.x;
        uda.w = xor3.y & xor3prime.y;

        pt8.x = xor2.x ^ udx.x;
        pt8.y = xor2.y ^ udx.y;
        pt8.z = xor2.z ^ udx.z;
        pt8.w = xor2.w ^ udx.w;
        sv.x = (xor2.x & udx.x) | uda.x;
        sv.y = (xor2.y & udx.y) | uda.y;
        sv.z = (xor2.z & udx.z) | uda.z;
        sv.w = (xor2.w & udx.w) | uda.w;
    }

    // 32 boolean ops + 4 shuffles
    uint4 maj3; uint4 maj3prime; uint4 pt4; uint4 xc3;
    maj3.x = sh.x | (old.x & xor2.x);
    maj3.y = sh.y | (old.y & xor2.y);
    maj3.z = sh.z | (old.z & xor2.z);
    maj3.w = sh.w | (old.w & xor2.w);
    maj3prime.x = hh::shuffle_32(maj3.x, lowerthread);
    maj3prime.y = hh::shuffle_32(maj3.y, lowerthread);
    maj3prime.z = hh::shuffle_32(maj3.z, upperthread);
    maj3prime.w = hh::shuffle_32(maj3.w, upperthread);

    pt4.x = (maj3prime.z ^ maj3.z) ^ (sh.x ^ sv.x);
    pt4.y = (maj3prime.w ^ maj3.w) ^ (sh.y ^ sv.y);
    pt4.z = (maj3prime.x ^ maj3.x) ^ (sh.z ^ sv.z);
    pt4.w = (maj3prime.y ^ maj3.y) ^ (sh.w ^ sv.w);
    xc3.x = (maj3prime.z | maj3.z) ^ (sh.x | sv.x);
    xc3.y = (maj3prime.w | maj3.w) ^ (sh.y | sv.y);
    xc3.z = (maj3prime.x | maj3.x) ^ (sh.z | sv.z);
    xc3.w = (maj3prime.y | maj3.y) ^ (sh.w | sv.w);

    // 12 boolean ops
    uint4 newstate;
    newstate.x = pt4.x & xc3.x & (pt8.x | old.x);
    newstate.y = pt4.y & xc3.y & (pt8.y | old.y);
    newstate.z = pt4.z & xc3.z & (pt8.z | old.z);
    newstate.w = pt4.w & xc3.w & (pt8.w | old.w);

    return newstate;

}

Note that in addition to the boolean operations, we have funnel-shifts (for communicating between cells horizontally) and shuffles (for communicating between cells vertically), so that the entire 64x64 grid is evolved in a warp of 32 threads without any memory accesses. The use of warp shuffle instructions is a niche technique and depends intricately on how the cells are arranged in registers and distributed between the threads of a warp; it's not something that I'd expect an optimising compiler such as XLA to be able to produce automatically.

simsim314 wrote: ↑
April 11th, 2023, 8:22 am
Generally speaking - robust libraries like tensorflow, that run on any hardware you like out of the box, something like 10 TPUs, is the way to go in the long run for intensive computing. TPU is 4 times more capable than the strongest GPU, which is 20 times more capable than a CPU. So imagine renting something like 10 TPUs, and you can run 800 times faster than any previous search.

TPUs are only faster than GPUs for computations that can make use of their systolic array. So a TPU will outperform a GPU for matrix multiplication, but it won't outperform a GPU for running CGoL, because CGoL can't be efficiently expressed in terms of matrix multiplication. (Yes, you could inefficiently implement CGoL using matrix multiplication, because any arithmetic circuit can be implemented using matrix multiplication, but that overhead would outweigh any advantage from using the TPU's superior hardware.)

The other property of matrix multiplication is that the I/O of a square matrix multiplication grows like O(n^2), but the amount of compute grows like O(n^3). This means that kernel fusion isn't so important in those cases, because the overhead from reading the inputs and writing the outputs is negligible compared with the computation itself. That's why a lot of numerical/scientific/deep-learning work can be done using Python with the help of the appropriate libraries (BLAS, LAPACK, cuDNN, etc.).

But individual bitwise operations applied elementwise across an array are massively I/O-bound, and even a single generation of CGoL is still I/O-bound, so you need to kernel-fuse many successive CGoL generations to get something that utilises the GPU's compute resources.

simsim314 · Post by **simsim314** » April 12th, 2023, 2:49 pm

calcyman wrote: ↑
April 12th, 2023, 12:04 pm
underutilized by a factor of at least 2000

This factor is huge...

Reimplemented with Jax

Code: Select all

import jax
import jax.numpy as jnp
import time

@jax.jit
def Add(b1, b0, val):
    t_b1 = jnp.bitwise_and(b0, val)
    b1 = jnp.bitwise_or(b1, t_b1)
    b0 = jnp.bitwise_xor(b0, val)
    return b1, b0

@jax.jit
def Add_Init(b1, b0, val):
    b1 = jnp.bitwise_and(b0, val)
    b0 = jnp.bitwise_xor(b0, val)
    return b1, b0

@jax.jit
def Add1(b2, b1, b0, val):
    t_b2 = jnp.bitwise_and(b0, val)
    b2 = jnp.bitwise_or(b2, jnp.bitwise_and(t_b2, b1))
    b1 = jnp.bitwise_xor(b1, t_b2)
    b0 = jnp.bitwise_xor(b0, val)
    return b2, b1, b0

@jax.jit
def Add_Init1(b2, b1, b0, val):
    t_b2 = jnp.bitwise_and(b0, val)
    b2 = t_b2 & b1
    b1 = jnp.bitwise_xor(b1, t_b2)
    b0 = jnp.bitwise_xor(b0, val)
    return b2, b1, b0

@jax.jit
def evolve_iter(temp, bU0, bU1, bB0, bB1):
    sum0 = jnp.left_shift(temp, 1)
    sum1 = jnp.zeros_like(temp, dtype=jnp.uint64)
    sum2 = jnp.zeros_like(temp, dtype=jnp.uint64)

    sum1, sum0 = Add_Init(sum1, sum0, jnp.right_shift(temp, 1))
    sum1, sum0 = Add(sum1, sum0, bU0)
    sum2, sum1 = Add_Init(sum2, sum1, bU1)
    sum2, sum1, sum0 = Add1(sum2, sum1, sum0, bB0)
    sum2, sum1 = Add(sum2, sum1, bB1)
    res = jnp.bitwise_and(jnp.bitwise_and(~sum2, sum1), jnp.bitwise_or(temp, sum0))
    return res

@jax.jit
def Evolve(state):
    l = jnp.left_shift(state, 1)
    r = jnp.right_shift(state, 1)
    bit0 = jnp.bitwise_xor(jnp.bitwise_xor(l, r), state)

    l_or_r = jnp.bitwise_or(l, r)
    l_and_r = jnp.bitwise_and(l, r)
    state_and_l_or_r = jnp.bitwise_and(state, l_or_r)
    bit1 = jnp.bitwise_or(state_and_l_or_r, l_and_r)

    bU0 = jnp.roll(bit0, shift=-1, axis=2)
    bU1 = jnp.roll(bit1, shift=-1, axis=2)
    bB0 = jnp.roll(bit0, shift=1, axis=2)
    bB1 = jnp.roll(bit1, shift=1, axis=2)

    return evolve_iter(state, bU0, bU1, bB0, bB1)

@jax.jit
def evolve_multiple_times(x):
    for i in range(10):
        x = Evolve(x)
    return x   
    
def print_bits(arr):
    N = arr.shape[0]
    for i in range(N):
        s = "{0:b}".format(arr[i])
        s = s.zfill(64).replace('0', '_').replace('1', 'O')
        print(s)
        if i == N-1:
            break
        print()
       

K = 10000
N = 64

x = jax.random.randint(jax.random.PRNGKey(0), shape=(1, K, N), minval=0, maxval=2**31 - 1, dtype=jnp.uint64)
#print_bits(x[0, 0])

for k in range(100):
    start_time = time.time()
    x = evolve_multiple_times(x)
    end_time = time.time()

    elapsed_time = end_time - start_time
    print("Elapsed time: {:.2f} seconds".format(elapsed_time))
    #print_bits(x[0, 0])

Yes seems like on colab with Tesla T4, tf is 1.8M iter/sec, and jax 80M iter/sec. I've posted both of the codes in colab

Thanks for the advice! Seems closer to your numbers but still an order of magnitude behind...

1080 Ti GPU is 10.6 TFLOPS/sec, while Tesla T4 is 7.6, using this estimation, we should get about 110M iter/sec on 1080 Ti with jax.

NOTE The factor I measured on Tesla T4 was ~40. Still i would expect more from tf, I guess with network and floats it's not that bad.
NOTE2 It took me about 1h to convert to jax with chatGPT.

Post by **calcyman** » April 12th, 2023, 5:34 pm

simsim314 wrote: ↑
April 12th, 2023, 2:49 pm
Yes seems like on colab with Tesla T4, tf is 1.8M iter/sec, and jax 80M iter/sec. I've posted both of the codes in colab

That's a huge improvement!

I've tried lifelib on colab (Tesla T4) and it's 1800M iter/sec, so about 20x faster than JAX or 1000x faster than TF. Here's the colab which downloads, compiles, and runs the QuFince example code and reports the speed.

simsim314 wrote: ↑
April 12th, 2023, 2:49 pm
NOTE2 It took me about 1h to convert to jax with chatGPT.

This is illustrative of the tradeoff between human time and computer time: it took many hours to design and write lifelib's CUDA code, and it relied on practices that I've learned from about 1000 hours of GPU programming experience (600 of which were consulting for a client back in 2019).

Of course, if the same rate of progress that we've observed moving from GPT-2 to GPT-3 to GPT-4 continues, then it won't be long before some future version of chatGPT (maybe GPT-6 or 7) would be able to generate CUDA code of the same quality as someone with 1000 hours of GPU programming experience, but in a fraction of the time. Then I'm pretty sure that I'd be out of a job!

simsim314 · Post by **simsim314** » April 13th, 2023, 2:37 am

calcyman wrote: ↑
April 12th, 2023, 5:34 pm
That's a huge improvement!

I also think so... the only advantage, this have over your example - except some small cluster hardware, that jax is working out of the box, and for lifelib you will need some distribution library, is that it's python, and lifelib currently for this functionality is still C++ because the kernel compiles with c++ and all that stuff.

I guess if python's lifelib could have something of this line, which I guess what makes it 64x64 torus?

Code: Select all

apg::lifetree<uint32_t, 1> lt(1000);

I think I can run this one from python using cupy library

Code: Select all

run_collision_kernel(and_mask_b, target_b, bvec, avec, 200, 20);

LifeAPI had iterators, that were well incorporated into the cglol search flow, over many entities, in different stages put together. You would define "while do loop" over iterator ((-7,7), (-7, 7), (0, 4) ... ) and the just loop over it, or let the user define 2-3 cgol entities, and just build the iterator. But yes it's just to fill the array with data.

Code: Select all

   std::vector<apg::pattern> g1a;
    std::vector<apg::pattern> g2a;
    std::vector<apg::pattern> g2b;

    for (int y = -7; y <= 7; y++) {
        for (int x = -7; x <= 7; x++) {
            for (int z = 0; z < 4; z++) {
                g1a.push_back(nglider[z](x, y));
                g2a.push_back(sglider[z](x, y));
            }
        }
        g2b.push_back(sglider(0, y));
    }

    std::cout << "Preparing..." << std::endl;

    std::vector<apg::bpattern> avec;

    for (uint32_t i = 0; i < g1a.size(); i++) {
        for (uint32_t j = 0; j < i; j++) {
            auto a = g1a[i];
            auto b = g1a[j];
            auto c = a + b;
            if ((c.totalPopulation() == 10) && (c[4] == a[4] + b[4])) {
                avec.push_back(c.flatlayer(0).to_bpattern());
            }
        }
    }

    std::cout << avec.size() << std::endl;

    std::vector<apg::bpattern> bvec;

    for (auto a : g2a) {
        for (auto b : g2b) {
            auto c = a + b;
            if ((c.totalPopulation() == 10) && (c[4] == a[4] + b[4])) {
                bvec.push_back((c + starting_still_life).flatlayer(0).to_bpattern());
            }
        }
    }

I would guess I might want to rewrite some kind of catforce in python using lifelib. Or maybe stable reflector search, with many random SLs placed randomly, and not to worry about good space cover, and just simplicity of code, and lots of search time. If the search space large enough, there is no real need in iterators of any kind, and just random sample. But I think i don't have the motivation to do it in c++. Maybe with chatGPT...

calcyman wrote: ↑
April 12th, 2023, 5:34 pm
it took many hours to design and write lifelib's CUDA code, and it relied on practices that I've learned from about 1000 hours of GPU programming experience (600 of which were consulting for a client back in 2019).

Yes I saw you had some other CUDA libraries... very impressive I must say. I mean this difference in performance, giving no reason to try something else, only use this cuda kernel of yours. Even though I probably thought for good several hours about how to make cgol iterator, i wasn't optimizing it for cuda, and generally speaking I somehow believed tf or jax or wtr would be 20%-30% less efficient not x20 less. I don't think I can compete with that, but maybe I can improve the cuda kernel, yet it would be at best 5% not factor of 20... (I managed to do it to siemeks once).

calcyman wrote: ↑
April 12th, 2023, 5:34 pm
Of course, if the same rate of progress that we've observed moving from GPT-2 to GPT-3 to GPT-4 continues, then it won't be long before some future version of chatGPT (maybe GPT-6 or 7) would be able to generate CUDA code of the same quality as someone with 1000 hours of GPU programming experience, but in a fraction of the time. Then I'm pretty sure that I'd be out of a job!

I really hope people will understand that there is no need in jobs in world where all problems are solved. Jobs is not something people choose... it's something most of us has to do in order to survive...if you like your job, in a world where all problems solved you will do it for hobby. If you liked to see cool cgol patterns, gpt7 will write a search utility better than you, unless you really like to write inefficient search utilities, like chess players like to play chess without making the strongest of the moves.

Let me quote from an opinion article I wrote on the topic:

Chapter 9. Job Security wrote:It’s amusing to imagine someone saying, "I really miss being a switchboard operator. There was something so satisfying about plugging in all those cables and connecting people’s calls. Now that everything’s automated, I just don’t get the same sense of accomplishment." This highlights the contrast between old-fashioned manual labor and modern automation.

P.S A little bit off topic, but regarding AI safety. I had an idea how to make "safe robots" based on LLMs. The idea was to manage a setup that an LLM would not be able to distinguish between this setup and a real world - it would think it has a body, and will give commands to its body. We could test it first on virtual setup, maybe millions of times, and then the same exact setups, we can transfer to physical world, while the body would be also an LLM but it would be train to interpret "high level command" into some sort of "g-code" i.e telling different engines what to do. It would also need to have sensory input converter to text as well. But in the end it would communicate with its brain using readable human text, and we would be able to check up upon what it does. One can make it even on "operator mode" like with teslas, you need to check up on what it's doing and approve it, and then the body commit the command. Thus we will have some amount of safety regarding its action in the real world...I've built several scenarios, and a prompt that builds this "prefix state machine", tested it with gpt3 turbo.

EDIT I've managed to improve the performance of the cuda kernel by 10% now it runs 2000M/sec, with chatGPT. Check out this colab. I hope it still works...

lifelib/cuda2/qufince_kernel.cu

Code: Select all

#include "qufince.h"
#include "cudagol.h"
#include <chrono>

namespace apg {
__global__ void collision_kernel(const bpattern *things, uint32_t a_start, uint32_t a_end, uint32_t b_start, uint32_t b_end, int gens1, int gens2, uint32_t *sols) {

    uint32_t idx = ((threadIdx.x + blockIdx.x * blockDim.x) >> 5) + a_start;
    if (idx >= a_end) { return; }

    uint4 cx = load_bpattern(things);
    uint4 dx = load_bpattern(things + 1);

    uint4 ax = load_bpattern(things + idx);

    uint32_t bx_x = ax.x, bx_y = ax.y, bx_z = ax.z, bx_w = ax.w;

    for (uint32_t jdx = b_start; jdx < b_end; jdx++) {

        uint4 bx = load_bpattern(things + jdx);
        bx_x |= bx.x;
        bx_y |= bx.y;
        bx_z |= bx.z;
        bx_w |= bx.w;

        for (int i = 0; i < gens1; i++) {
            bx = advance_torus(bx);
        }

        uint32_t res = ((bx.x & cx.x) ^ dx.x) | ((bx.y & cx.y) ^ dx.y) | ((bx.z & cx.z) ^ dx.z) | ((bx.w & cx.w) ^ dx.w);

        if (hh::ballot_32(res != 0) == 0) {
            for (int i = 0; i < gens2; i++) {
                bx = advance_torus(bx);
            }
            res = ((bx.x & cx.x) ^ dx.x) | ((bx.y & cx.y) ^ dx.y) | ((bx.z & cx.z) ^ dx.z) | ((bx.w & cx.w) ^ dx.w);
        }

        if (hh::ballot_32(res != 0) == 0) {
            // solution found!
            if ((threadIdx.x & 31) == 0) {
                uint32_t k = hh::atomic_add(sols + 131072, 1) & 65535;
                sols[k*2] = idx;
                sols[k*2+1] = jdx;
            }
        }

        bx_x &= ~bx.x;
        bx_y &= ~bx.y;
        bx_z &= ~bx.z;
        bx_w &= ~bx.w;
    }
}

void run_collision_kernel(const bpattern& and_mask, const bpattern& target, const std::vector<bpattern> &a, const std::vector<bpattern> &b, int gens1, int gens2) {
bpattern* patterns_device;
bpattern* patterns_host;

size_t n_patterns = 2 + a.size() + b.size();

cudaMalloc(&patterns_device,   n_patterns * sizeof(bpattern));
cudaMallocHost(&patterns_host, n_patterns * sizeof(bpattern));

patterns_host[0] = and_mask;
patterns_host[1] = target;

size_t i = 2;

for (auto x : a) { patterns_host[i++] = x; }
for (auto x : b) { patterns_host[i++] = x; }

cudaMemcpy(patterns_device, patterns_host, n_patterns * sizeof(bpattern), cudaMemcpyHostToDevice);

uint32_t* sols_device;
uint32_t* sols_host;
cudaMalloc(&sols_device, 532480);
cudaMallocHost(&sols_host, 532480);
memset(sols_host, 0, 532480);
cudaMemcpy(sols_device, sols_host, 532480, cudaMemcpyHostToDevice);

uint32_t a_start = 2;
uint32_t a_end = a_start + a.size();

uint32_t chunksize = 128;
uint32_t blocksize = 256;

for (uint32_t bo = 0; bo < b.size(); bo += chunksize) {

    uint32_t b_start = a_end + bo;
    uint32_t b_end = a_end + hh::min(bo + chunksize, ((uint32_t) b.size()));

    uint32_t n_blocks = a.size() / (blocksize >> 5) + 1;
    uint64_t workload = gens1 * ((uint64_t) (a_end - a_start)) * ((uint64_t) (b_end - b_start));

    auto before = std::chrono::high_resolution_clock::now();
    collision_kernel<<<n_blocks, blocksize>>>(patterns_device, a_start, a_end, b_start, b_end, gens1, gens2, sols_device);
    cudaDeviceSynchronize();
    uint32_t sols_before = sols_host[131072];
    cudaMemcpy(sols_host, sols_device, 532480, cudaMemcpyDeviceToHost);
    uint32_t n_sols = sols_host[131072] - sols_before;
    auto after = std::chrono::high_resolution_clock::now();

    uint64_t microseconds = std::chrono::duration_cast<std::chrono::microseconds>(after - before).count();

    std::cerr << "# progress: " << (b_end - a_end) << "/" << b.size() << "; speed = " << (workload / microseconds) << "M iters/sec" << std::endl; 

    if (n_sols > 0) {
        std::cerr << "#" << n_sols << " solutions." << std::endl;
        if (n_sols > 65536) { n_sols = 65536; }
        for (uint32_t i = 0; i < n_sols; i++) {
            uint32_t so = (i + sols_before) & 65535;
            uint32_t idx = sols_host[so*2];
            uint32_t jdx = sols_host[so*2+1];

              for (int k = 0; k < 64; k++) {
                    uint64_t row = patterns_host[idx].x[k] | patterns_host[jdx].x[k];
                    char s[66] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n";
                    for (int l = 0; l < 64; l++) {
                        s[l] = ((row >> l) & 1) ? 'o' : '.';
                    }
                    std::cout << s;
                }
                std::cout << std::endl;
            }
        }
    }

    cudaFreeHost(sols_host);
    cudaFree(sols_device);
    cudaFreeHost(patterns_host);
    cudaFree(patterns_device);


}
}

P.S Couldn't reproduce the success it writes for some reason "Please specify the input file." and I lost the original file that did worked... but I think the functions rewritten fine.

Con man · Post by **Con man** » April 27th, 2023, 12:10 pm

carsoncheng wrote: ↑
April 11th, 2023, 7:05 am

simsim314 wrote: ↑
April 11th, 2023, 5:39 am
I've finally managed to write CGOL iterator using tensorflow, which is the adaption of LifeAPI iterator.

You can run it on whatever hardware that supports tf. Like GPU, multi GPU, TPU, cluster computing etc. they all have acceleration for tf and bitwise operations are part of its specification. Soon there would appear a lot of dedicated hardware for this purpose (mainly for neural nets). So it's a good idea to move to higher level libraries, which are hardware agnostic, and most hardware is actually adapted to them.

On my laptop with 1060 GTX, it runs twice faster than on my CPU, with ~500K CGOL iterations/sec of 64x64 grid.

Great! Does it run faster than LifeAPI or lifelib? If it is, it might be an option to adapt search utilities like catforce to the tensorflow utilities (of course make a c++ version of this iterator first, but it shouldn't be too hard)

...... I am new here, what is life API

carsoncheng · Post by **carsoncheng** » April 27th, 2023, 12:28 pm

Con man wrote: ↑
April 27th, 2023, 12:10 pm
...... I am new here, what is life API

LifeAPI is a fast library that is used for simulating, modifying and comparing patterns in Life. Here is its repository:
https://github.com/simsim314/LifeAPI

And here is its wiki page:
https://conwaylife.com/wiki/LifeAPI

ConwayLife.com

Using tensorflow to run CGOL

Using tensorflow to run CGOL

Re: Using tensorflow to run CGOL

Re: Using tensorflow to run CGOL

Re: Using tensorflow to run CGOL

Re: Using tensorflow to run CGOL

Re: Using tensorflow to run CGOL

Re: Using tensorflow to run CGOL

Re: Using tensorflow to run CGOL

Re: Using tensorflow to run CGOL