# Solved – Probability of n-bit sequence appearing at least twice in m-bit sequence

Lets assume that we have a pattern \$alpha\$ of bits of length \$n\$. Then I wish to know what the probability is of \$alpha\$ appearing on a string of bits of length \$m\$ at least twice (where \$m > n\$), assuming that bit 1 and bit 0 are equiprobable (bin distribution).

So for example, lets assume \$n=4\$ and the sequence \$alpha\$ is `0111`, and \$m=10\$. So a valid ocurrence would be `011101010111`, where we would have an \$alpha\$ at the beginning and one at the end. Of course there are many more, but how can I calculate this probability depending on \$n\$ and \$m\$ ?

I reckon this also depends on how \$alpha\$ looks like, since if for example \$alpha=\$ `00010000`, then nothing speaks against starting another ocurrence of \$alpha\$ in the last 3 bits of the "old" \$alpha\$, as the sequence some coinciding bits at the beginning and end.

An upper bound analysis for the error is also a valid answer. Intuition with simulations in R, Python or Matlab to arrive to that bound are also valid. I am not looking at ridiculous cases where \$n\$ is very small, say \$alpha\$=`01`, but instead assume that \$n\$ is in the ballpark 10-20 and \$m\$ is a handful of hundreds (300-500).

EDIT2 changed twice for at least twice above, which was desired from the beginning.

Contents

As @NeilG stated in the comments, the desired probability can still be computed exactly by defining a (2n+1)-state markov chain and computing the probability of having seen 2 copies of \$alpha\$ after m iterations. The states in the markov chain will be of the form \$(a, b)\$, where \$a\$ is the number of times we have already seen \$alpha\$ (0, 1, or 2), and \$b\$ is our progress in seeing \$alpha\$ at the current position. For the small example provided with pattern `0111`, the states are:

begin{align*} (0,&~-) \ (0,&~0) \ (0,&~01) \ (0,&~011) \ (1,&~-) \ (1,&~0) \ (1,&~01) \ (1,&~011) \ (2,&~-) \ end{align*}

The transition probabilities are defined quite naturally. For pattern `0111`, if we see a 0 from state \$(0,~-)\$ then we transition to \$(0,~0)\$ and otherwise we stay in \$(0,~-)\$. If we see a 1 from state \$(0,~0)\$ then we transition to \$(0,~01)\$ and otherwise we stay in \$(0,~0)\$, as we still have the first 0 in the pattern despite not getting a 1. If we see a 1 from state \$(0,~01)\$ then we transition to \$(0,~011)\$ and otherwise we go back to state \$(0,~0)\$. Finally, if we see a 1 from state \$(0,~011)\$ then we transition to \$(1,~-)\$, and otherwise we go back to \$(0,~0)\$. State \$(2,~-)\$ is an absorbing state.

This cleanly handles overlapping patterns. If we were searching for pattern `00010000`, then upon getting a 0 from \$(0,~0001000)\$ we would transition to \$(1,~000)\$.

Computing the transition probabilities and iterating the markov chain from the initial state of \$(0,~-)\$ can be implemented without too much trouble in your favorite programming language (I'll use R here):

``library(expm) best.match <- function(pattern, partial) {   to.match <- sapply(seq_len(nchar(pattern)+1), function(s) substr(pattern, s, nchar(pattern)))   matches <- match(to.match, partial)   matches <- matches[!is.na(matches)]   partial[matches] } get.prob <- function(alpha, m) {   n <- nchar(alpha)   partial <- sapply(0:(n-1), function(k) substr(alpha, 1, k))   state.match <- rep(0:2, c(n, n, 1))   state.pattern <- c(partial, partial, "")   mat <- sapply(seq_along(state.match), function(i) {     this.match <- state.match[i]     this.pattern <- state.pattern[i]     if (this.match == 2) {       as.numeric(state.match == 2)  # Absorbing state     } else {       rowMeans(sapply(paste0(this.pattern, c("0", "1")), function(new.pattern) {         if (new.pattern == alpha && this.match == 0) {           as.numeric(state.match == 1 &                      state.pattern == best.match(substr(new.pattern, 2, nchar(new.pattern)),                                                  partial))         } else if (new.pattern == alpha && this.match == 1) {           as.numeric(state.match == 2)         } else {           as.numeric(state.match == this.match &                      state.pattern == best.match(new.pattern, partial))         }       }))     }   })   tail((mat %^% m)[,1], 1) } ``

You can invoke the function by passing the pattern (as a string) and the number of iterations m:

``get.prob("0111", 10) #  0.0234375 get.prob("00010000", 200) #  0.177094 ``

Though this is not a closed form solution, it does give the exact desired probabilities, so it can be used to evaluate the quality of any other bounds that are derived.

Rate this post