calcyman wrote: ↑April 12th, 2023, 5:34 pm
That's a huge improvement!
I also think so... the only advantage, this have over your example - except some small cluster hardware, that jax is working out of the box, and for lifelib you will need some distribution library, is that it's python, and lifelib currently for this functionality is still C++ because the kernel compiles with c++ and all that stuff.
I guess if python's lifelib could have something of
this line, which I guess what makes it 64x64 torus?
Code: Select all
apg::lifetree<uint32_t, 1> lt(1000);
I think I can run
this one from python using cupy library
Code: Select all
run_collision_kernel(and_mask_b, target_b, bvec, avec, 200, 20);
LifeAPI had iterators, that were well incorporated into the cglol search flow, over many entities, in different stages put together. You would define "while do loop" over iterator ((-7,7), (-7, 7), (0, 4) ... ) and the just loop over it, or let the user define 2-3 cgol entities, and just build the iterator. But yes it's just to fill the array with data.
Code: Select all
std::vector<apg::pattern> g1a;
std::vector<apg::pattern> g2a;
std::vector<apg::pattern> g2b;
for (int y = -7; y <= 7; y++) {
for (int x = -7; x <= 7; x++) {
for (int z = 0; z < 4; z++) {
g1a.push_back(nglider[z](x, y));
g2a.push_back(sglider[z](x, y));
}
}
g2b.push_back(sglider(0, y));
}
std::cout << "Preparing..." << std::endl;
std::vector<apg::bpattern> avec;
for (uint32_t i = 0; i < g1a.size(); i++) {
for (uint32_t j = 0; j < i; j++) {
auto a = g1a[i];
auto b = g1a[j];
auto c = a + b;
if ((c.totalPopulation() == 10) && (c[4] == a[4] + b[4])) {
avec.push_back(c.flatlayer(0).to_bpattern());
}
}
}
std::cout << avec.size() << std::endl;
std::vector<apg::bpattern> bvec;
for (auto a : g2a) {
for (auto b : g2b) {
auto c = a + b;
if ((c.totalPopulation() == 10) && (c[4] == a[4] + b[4])) {
bvec.push_back((c + starting_still_life).flatlayer(0).to_bpattern());
}
}
}
I would guess I might want to rewrite some kind of catforce in python using lifelib. Or maybe stable reflector search, with many random SLs placed randomly, and not to worry about good space cover, and just simplicity of code, and lots of search time. If the search space large enough, there is no real need in iterators of any kind, and just random sample. But I think i don't have the motivation to do it in c++. Maybe with chatGPT...
calcyman wrote: ↑April 12th, 2023, 5:34 pm
it took many hours to design and write lifelib's CUDA code, and it relied on practices that I've learned from about 1000 hours of GPU programming experience (600 of which were consulting for a client back in 2019).
Yes I saw you had some other CUDA libraries... very impressive I must say. I mean this difference in performance, giving no reason to try something else, only use this cuda kernel of yours. Even though I probably thought for good several hours about how to make cgol iterator, i wasn't optimizing it for cuda, and generally speaking I somehow believed tf or jax or wtr would be 20%-30% less efficient not x20 less. I don't think I can compete with that, but maybe I can improve the cuda kernel, yet it would be at best 5% not factor of 20... (I managed to do it to siemeks once).
calcyman wrote: ↑April 12th, 2023, 5:34 pm
Of course, if the same rate of progress that we've observed moving from GPT-2 to GPT-3 to GPT-4 continues, then it won't be long before some future version of chatGPT (maybe GPT-6 or 7) would be able to generate CUDA code of the same quality as someone with 1000 hours of GPU programming experience, but in a fraction of the time. Then I'm pretty sure that I'd be out of a job!
I really hope people will understand that there is no need in jobs in world where all problems are solved. Jobs is not something people choose... it's something most of us has to do in order to survive...if you like your job, in a world where all problems solved you will do it for hobby. If you liked to see cool cgol patterns, gpt7 will write a search utility better than you, unless you really like to write inefficient search utilities, like chess players like to play chess without making the strongest of the moves.
Let me quote from an
opinion article I wrote on the topic:
Chapter 9. Job Security wrote:It’s amusing to imagine someone saying, "I really miss being a switchboard operator. There was something so satisfying about plugging in all those cables and connecting people’s calls. Now that everything’s automated, I just don’t get the same sense of accomplishment." This highlights the contrast between old-fashioned manual labor and modern automation.
P.S A little bit off topic, but regarding AI safety. I had an idea how to make "safe robots" based on LLMs. The idea was to manage a setup that an LLM would not be able to distinguish between this setup and a real world - it would think it has a body, and will give commands to its body. We could test it first on virtual setup, maybe millions of times, and then the same exact setups, we can transfer to physical world, while the body would be also an LLM but it would be train to interpret "high level command" into some sort of "g-code" i.e telling different engines what to do. It would also need to have sensory input converter to text as well. But in the end it would communicate with its brain using readable human text, and we would be able to check up upon what it does. One can make it even on "operator mode" like with teslas, you need to check up on what it's doing and approve it, and then the body commit the command. Thus we will have some amount of safety regarding its action in the real world...
I've built several scenarios, and a prompt that builds this "prefix state machine", tested it with gpt3 turbo.
EDIT I've managed to improve the performance of the cuda kernel by 10% now it runs 2000M/sec, with chatGPT. Check out this
colab. I hope it still works...
lifelib/cuda2/qufince_kernel.cu
Code: Select all
#include "qufince.h"
#include "cudagol.h"
#include <chrono>
namespace apg {
__global__ void collision_kernel(const bpattern *things, uint32_t a_start, uint32_t a_end, uint32_t b_start, uint32_t b_end, int gens1, int gens2, uint32_t *sols) {
uint32_t idx = ((threadIdx.x + blockIdx.x * blockDim.x) >> 5) + a_start;
if (idx >= a_end) { return; }
uint4 cx = load_bpattern(things);
uint4 dx = load_bpattern(things + 1);
uint4 ax = load_bpattern(things + idx);
uint32_t bx_x = ax.x, bx_y = ax.y, bx_z = ax.z, bx_w = ax.w;
for (uint32_t jdx = b_start; jdx < b_end; jdx++) {
uint4 bx = load_bpattern(things + jdx);
bx_x |= bx.x;
bx_y |= bx.y;
bx_z |= bx.z;
bx_w |= bx.w;
for (int i = 0; i < gens1; i++) {
bx = advance_torus(bx);
}
uint32_t res = ((bx.x & cx.x) ^ dx.x) | ((bx.y & cx.y) ^ dx.y) | ((bx.z & cx.z) ^ dx.z) | ((bx.w & cx.w) ^ dx.w);
if (hh::ballot_32(res != 0) == 0) {
for (int i = 0; i < gens2; i++) {
bx = advance_torus(bx);
}
res = ((bx.x & cx.x) ^ dx.x) | ((bx.y & cx.y) ^ dx.y) | ((bx.z & cx.z) ^ dx.z) | ((bx.w & cx.w) ^ dx.w);
}
if (hh::ballot_32(res != 0) == 0) {
// solution found!
if ((threadIdx.x & 31) == 0) {
uint32_t k = hh::atomic_add(sols + 131072, 1) & 65535;
sols[k*2] = idx;
sols[k*2+1] = jdx;
}
}
bx_x &= ~bx.x;
bx_y &= ~bx.y;
bx_z &= ~bx.z;
bx_w &= ~bx.w;
}
}
void run_collision_kernel(const bpattern& and_mask, const bpattern& target, const std::vector<bpattern> &a, const std::vector<bpattern> &b, int gens1, int gens2) {
bpattern* patterns_device;
bpattern* patterns_host;
size_t n_patterns = 2 + a.size() + b.size();
cudaMalloc(&patterns_device, n_patterns * sizeof(bpattern));
cudaMallocHost(&patterns_host, n_patterns * sizeof(bpattern));
patterns_host[0] = and_mask;
patterns_host[1] = target;
size_t i = 2;
for (auto x : a) { patterns_host[i++] = x; }
for (auto x : b) { patterns_host[i++] = x; }
cudaMemcpy(patterns_device, patterns_host, n_patterns * sizeof(bpattern), cudaMemcpyHostToDevice);
uint32_t* sols_device;
uint32_t* sols_host;
cudaMalloc(&sols_device, 532480);
cudaMallocHost(&sols_host, 532480);
memset(sols_host, 0, 532480);
cudaMemcpy(sols_device, sols_host, 532480, cudaMemcpyHostToDevice);
uint32_t a_start = 2;
uint32_t a_end = a_start + a.size();
uint32_t chunksize = 128;
uint32_t blocksize = 256;
for (uint32_t bo = 0; bo < b.size(); bo += chunksize) {
uint32_t b_start = a_end + bo;
uint32_t b_end = a_end + hh::min(bo + chunksize, ((uint32_t) b.size()));
uint32_t n_blocks = a.size() / (blocksize >> 5) + 1;
uint64_t workload = gens1 * ((uint64_t) (a_end - a_start)) * ((uint64_t) (b_end - b_start));
auto before = std::chrono::high_resolution_clock::now();
collision_kernel<<<n_blocks, blocksize>>>(patterns_device, a_start, a_end, b_start, b_end, gens1, gens2, sols_device);
cudaDeviceSynchronize();
uint32_t sols_before = sols_host[131072];
cudaMemcpy(sols_host, sols_device, 532480, cudaMemcpyDeviceToHost);
uint32_t n_sols = sols_host[131072] - sols_before;
auto after = std::chrono::high_resolution_clock::now();
uint64_t microseconds = std::chrono::duration_cast<std::chrono::microseconds>(after - before).count();
std::cerr << "# progress: " << (b_end - a_end) << "/" << b.size() << "; speed = " << (workload / microseconds) << "M iters/sec" << std::endl;
if (n_sols > 0) {
std::cerr << "#" << n_sols << " solutions." << std::endl;
if (n_sols > 65536) { n_sols = 65536; }
for (uint32_t i = 0; i < n_sols; i++) {
uint32_t so = (i + sols_before) & 65535;
uint32_t idx = sols_host[so*2];
uint32_t jdx = sols_host[so*2+1];
for (int k = 0; k < 64; k++) {
uint64_t row = patterns_host[idx].x[k] | patterns_host[jdx].x[k];
char s[66] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n";
for (int l = 0; l < 64; l++) {
s[l] = ((row >> l) & 1) ? 'o' : '.';
}
std::cout << s;
}
std::cout << std::endl;
}
}
}
cudaFreeHost(sols_host);
cudaFree(sols_device);
cudaFreeHost(patterns_host);
cudaFree(patterns_device);
}
}
P.S Couldn't reproduce the success it writes for some reason "Please specify the input file." and I lost the original file that did worked... but I think the functions rewritten fine.