Emergence of attention mechanisms during compression
It just dawned on me. When we want to compress, we have to do it in one or the other incremental fashion, arriving in description lengths of intermediate length during the process. Also, there will be different paths that we can pursue during compression leading to different competing programs that aspire to explain our data. This ambiguity stems from the fact that in practice we don’t have immediate guarantees that the next greedy step is the correct one. Therefore, one keeps various half-built versions of programs in memory in parallel. Further, in order to do proper induction, not only the shortest program is of interest but several other short programs as well. This leads to the problem of having to keep track of an array of programs and at each step decide, which part of which program is going to be compressed further.
This decision has to be made based on an internal estimate of likelihood that a particular program is likely to be easy enough to be compressed further. In other words, optimal compression will choose that program that on the one hand, offers much yet uncompressed data and on the other hand is sufficiently easy to compress further. This is exactly the search for “intermediate complexity” tasks that cognitive forms of attention employ! These attention mechanisms focus on tasks that are neither too easy nor too complex. We see that this property automatically emerges from the compression requirement.