Lets assume that we have a pattern $alpha$ of bits of length $n$. Then I wish to know what the probability is of $alpha$ appearing on a string of bits of length $m$ at least twice (where $m > n$), assuming that bit 1 and bit 0 are equiprobable (bin distribution).
So for example, lets assume $n=4$ and the sequence $alpha$ is 0111
, and $m=10$. So a valid ocurrence would be 011101010111
, where we would have an $alpha$ at the beginning and one at the end. Of course there are many more, but how can I calculate this probability depending on $n$ and $m$ ?
I reckon this also depends on how $alpha$ looks like, since if for example $alpha=$ 00010000
, then nothing speaks against starting another ocurrence of $alpha$ in the last 3 bits of the "old" $alpha$, as the sequence some coinciding bits at the beginning and end.
EDIT (to add comment as part of the question requirements)
An upper bound analysis for the error is also a valid answer. Intuition with simulations in R, Python or Matlab to arrive to that bound are also valid. I am not looking at ridiculous cases where $n$ is very small, say $alpha$=01
, but instead assume that $n$ is in the ballpark 10-20 and $m$ is a handful of hundreds (300-500).
EDIT2 changed twice for at least twice above, which was desired from the beginning.
Best Answer
As @NeilG stated in the comments, the desired probability can still be computed exactly by defining a (2n+1)-state markov chain and computing the probability of having seen 2 copies of $alpha$ after m iterations. The states in the markov chain will be of the form $(a, b)$, where $a$ is the number of times we have already seen $alpha$ (0, 1, or 2), and $b$ is our progress in seeing $alpha$ at the current position. For the small example provided with pattern 0111
, the states are:
begin{align*} (0,&~-) \ (0,&~0) \ (0,&~01) \ (0,&~011) \ (1,&~-) \ (1,&~0) \ (1,&~01) \ (1,&~011) \ (2,&~-) \ end{align*}
The transition probabilities are defined quite naturally. For pattern 0111
, if we see a 0 from state $(0,~-)$ then we transition to $(0,~0)$ and otherwise we stay in $(0,~-)$. If we see a 1 from state $(0,~0)$ then we transition to $(0,~01)$ and otherwise we stay in $(0,~0)$, as we still have the first 0 in the pattern despite not getting a 1. If we see a 1 from state $(0,~01)$ then we transition to $(0,~011)$ and otherwise we go back to state $(0,~0)$. Finally, if we see a 1 from state $(0,~011)$ then we transition to $(1,~-)$, and otherwise we go back to $(0,~0)$. State $(2,~-)$ is an absorbing state.
This cleanly handles overlapping patterns. If we were searching for pattern 00010000
, then upon getting a 0 from $(0,~0001000)$ we would transition to $(1,~000)$.
Computing the transition probabilities and iterating the markov chain from the initial state of $(0,~-)$ can be implemented without too much trouble in your favorite programming language (I'll use R here):
library(expm) best.match <- function(pattern, partial) { to.match <- sapply(seq_len(nchar(pattern)+1), function(s) substr(pattern, s, nchar(pattern))) matches <- match(to.match, partial) matches <- matches[!is.na(matches)] partial[matches[1]] } get.prob <- function(alpha, m) { n <- nchar(alpha) partial <- sapply(0:(n-1), function(k) substr(alpha, 1, k)) state.match <- rep(0:2, c(n, n, 1)) state.pattern <- c(partial, partial, "") mat <- sapply(seq_along(state.match), function(i) { this.match <- state.match[i] this.pattern <- state.pattern[i] if (this.match == 2) { as.numeric(state.match == 2) # Absorbing state } else { rowMeans(sapply(paste0(this.pattern, c("0", "1")), function(new.pattern) { if (new.pattern == alpha && this.match == 0) { as.numeric(state.match == 1 & state.pattern == best.match(substr(new.pattern, 2, nchar(new.pattern)), partial)) } else if (new.pattern == alpha && this.match == 1) { as.numeric(state.match == 2) } else { as.numeric(state.match == this.match & state.pattern == best.match(new.pattern, partial)) } })) } }) tail((mat %^% m)[,1], 1) }
You can invoke the function by passing the pattern (as a string) and the number of iterations m:
get.prob("0111", 10) # [1] 0.0234375 get.prob("00010000", 200) # [1] 0.177094
Though this is not a closed form solution, it does give the exact desired probabilities, so it can be used to evaluate the quality of any other bounds that are derived.
Similar Posts:
- Solved – Constructing a transition probability from Q-learning
- Solved – Is a GMM-HMM equivalent to a no-mixture HMM enriched with more states
- Solved – Use multi-state modelling (MSM) to predict individual transition probabilities in R
- Solved – Essential transient state in a Markov chain
- Solved – Essential transient state in a Markov chain