Solved – Probability of n-bit sequence appearing at least twice in m-bit sequence

Lets assume that we have a pattern $alpha$ of bits of length $n$. Then I wish to know what the probability is of $alpha$ appearing on a string of bits of length $m$ at least twice (where $m > n$), assuming that bit 1 and bit 0 are equiprobable (bin distribution).

So for example, lets assume $n=4$ and the sequence $alpha$ is 0111, and $m=10$. So a valid ocurrence would be 011101010111, where we would have an $alpha$ at the beginning and one at the end. Of course there are many more, but how can I calculate this probability depending on $n$ and $m$ ?

I reckon this also depends on how $alpha$ looks like, since if for example $alpha=$ 00010000, then nothing speaks against starting another ocurrence of $alpha$ in the last 3 bits of the "old" $alpha$, as the sequence some coinciding bits at the beginning and end.

EDIT (to add comment as part of the question requirements)

An upper bound analysis for the error is also a valid answer. Intuition with simulations in R, Python or Matlab to arrive to that bound are also valid. I am not looking at ridiculous cases where $n$ is very small, say $alpha$=01, but instead assume that $n$ is in the ballpark 10-20 and $m$ is a handful of hundreds (300-500).

EDIT2 changed twice for at least twice above, which was desired from the beginning.

As @NeilG stated in the comments, the desired probability can still be computed exactly by defining a (2n+1)-state markov chain and computing the probability of having seen 2 copies of $alpha$ after m iterations. The states in the markov chain will be of the form $(a, b)$, where $a$ is the number of times we have already seen $alpha$ (0, 1, or 2), and $b$ is our progress in seeing $alpha$ at the current position. For the small example provided with pattern 0111, the states are:

begin{align*} (0,&~-) \ (0,&~0) \ (0,&~01) \ (0,&~011) \ (1,&~-) \ (1,&~0) \ (1,&~01) \ (1,&~011) \ (2,&~-) \ end{align*}

The transition probabilities are defined quite naturally. For pattern 0111, if we see a 0 from state $(0,~-)$ then we transition to $(0,~0)$ and otherwise we stay in $(0,~-)$. If we see a 1 from state $(0,~0)$ then we transition to $(0,~01)$ and otherwise we stay in $(0,~0)$, as we still have the first 0 in the pattern despite not getting a 1. If we see a 1 from state $(0,~01)$ then we transition to $(0,~011)$ and otherwise we go back to state $(0,~0)$. Finally, if we see a 1 from state $(0,~011)$ then we transition to $(1,~-)$, and otherwise we go back to $(0,~0)$. State $(2,~-)$ is an absorbing state.

This cleanly handles overlapping patterns. If we were searching for pattern 00010000, then upon getting a 0 from $(0,~0001000)$ we would transition to $(1,~000)$.

Computing the transition probabilities and iterating the markov chain from the initial state of $(0,~-)$ can be implemented without too much trouble in your favorite programming language (I'll use R here):

library(expm) best.match <- function(pattern, partial) {   to.match <- sapply(seq_len(nchar(pattern)+1), function(s) substr(pattern, s, nchar(pattern)))   matches <- match(to.match, partial)   matches <- matches[!is.na(matches)]   partial[matches[1]] } get.prob <- function(alpha, m) {   n <- nchar(alpha)   partial <- sapply(0:(n-1), function(k) substr(alpha, 1, k))   state.match <- rep(0:2, c(n, n, 1))   state.pattern <- c(partial, partial, "")   mat <- sapply(seq_along(state.match), function(i) {     this.match <- state.match[i]     this.pattern <- state.pattern[i]     if (this.match == 2) {       as.numeric(state.match == 2)  # Absorbing state     } else {       rowMeans(sapply(paste0(this.pattern, c("0", "1")), function(new.pattern) {         if (new.pattern == alpha && this.match == 0) {           as.numeric(state.match == 1 &                      state.pattern == best.match(substr(new.pattern, 2, nchar(new.pattern)),                                                  partial))         } else if (new.pattern == alpha && this.match == 1) {           as.numeric(state.match == 2)         } else {           as.numeric(state.match == this.match &                      state.pattern == best.match(new.pattern, partial))         }       }))     }   })   tail((mat %^% m)[,1], 1) } 

You can invoke the function by passing the pattern (as a string) and the number of iterations m:

get.prob("0111", 10) # [1] 0.0234375 get.prob("00010000", 200) # [1] 0.177094 

Though this is not a closed form solution, it does give the exact desired probabilities, so it can be used to evaluate the quality of any other bounds that are derived.

Similar Posts:

Rate this post

Leave a Comment