dvrgn wrote:
dvgrn wrote: t's only mostly true that [apgcodes] can grow without bounds. A pattern over 40 cells wide would need a definite official ruling on how to encode longer strings of 0s. Quoting [...] from the apgcode article:
(Theoretically, runs of more than 39 zeroes should be replaced by 'yz' followed by the coding for the remaining zeroes. At the moment apgsearch labels everything larger than 40x40 as 'oversized' and refuses to process it.)
I'd agree that one or more yz, followed by one more y-something to get to the required length, is the easiest and most obvious way to extend the apgsearch format to encode patterns bigger than 40x40. But it might be worth thinking about a little bit, to see if it might be worth adding a marker that allows arbitrarily large (base 40?) counts to be encoded. At the moment, sufficiently large sparse patterns can be encoded much more efficiently in RLE format, because RLE can encode long distances of emptiness in a logarithmic number of characters, where apgcode is currently stuck with a linear encoding... a much more efficient encoding, but still linear.
I think most current Golly scripts that deal with apgcodes are lifted fairly directly from the old Python apgsearch, so they would similarly fail to handle patterns that need more than thirty-nine 0s, so it would be good to publish "official" encoder and decoder scripts somewhere -- maybe in Lua as well as Python.
And I replied:
Apple Bottom wrote: I think this is an issue of decoding vs. encoding. Given an apgcode of the form ...yzy1..., it's obvious that this corresponds to a run of 44 zeros, and even absent a formal specification or test suite I think decoders should treat it as such. It's the encoding part that's problematic, since there's no Word of God on which apgcode should be canonical.
Indeed, although encoding a run of 44 zeros can be encoded as yzy1, it can also be encoded as y1yz, which would retain the overall code's length but precede it lexicographically, so a crafty encoder could emit this instead when faced with such a pattern. In fact it could also emit wxyz instead. (0y0yz would yield a longer code and should thus be rejected.)
I'm tempted to suggest encoding long runs of zeros using a base-36 notation, where y# encodes # + 4 consecutive zeros, with # given in base-36, but since arbitary base-36 digits that aren't part of the code may follow, it would require a stop marker. This would break all existing codes that don't have one, so it's not feasible.
What is feasible is repurposing unused letters to indicate how many base-36 digits the number # has. For instance, if the z digit were used, with n z's indicating n + 1 digits, code fragments such as yzzz#### would be unambiguous, and the codes y0 ... yy would remain unchanged. A run of 44 zeros would be encoded as yz14 with this scheme.
This could still be done: a quick check of my files reveals that there are not currently any objects on Catagolue containing either yz or yy (though there are two containing yx, an xp412 in B358/S23 and the ship-pulling xq190 in B38/S23 (one variant, anyway).
So that's two ways out I see:
- clarify how runs of more than 40 zeroes are encoded (greedy, e.g. yzy1; based on length/lexicographic order, e.g. wxyz; or other, to be specified), or
- repurpose yz... to encode longer runs (e.g. yz14).
I personally prefer the first, since it requires no changes to existing decoders, and since I'm not so convinced people are going to want to encode extremely large patterns as apgcodes anyway (as you point out, file formats such as RLE are better suited to that task.) For that matter I also prefer the greedy algorithm, since that'd be easiest to implement.
What do others think? Although it's not urgent I think this is a matter that should be dealt with at some point; one of the nice features of apgcodes is that although there's many ways to encode the same object only one such code is THE canonical apgcode.
Since this is what allows us to meaningfully use apgcodes to discuss patterns in the first place, I'd like to see this feature retained for larger patterns: out of the various possible encodings in the presence of long runs of zeros, there needs to be one canonical one.
Thoughts?