I have been doing supervised learning and classification with multilayer perceptron for some time. But now I need to use unsupervised learning to infer the presence of a pattern and I would need some help. It is not a hard problem by itself and I can solve it by supervised learning (by showing "examples" of the possible patterns), but I want my algorithm to be able work without giving it pattern samples to learn from.

Imagine that I have nine tiles presented as a 3×3 square, numbered from 1 to 9. They light out one at a time in a certain pattern unknown to the user. The pattern could be a sequence of anything from 2 tiles (say 3-4-3-4-3-4…) to a much longer sequence (say **2**-5-3-2-6-4-6-1-**2**-5-3-2-6-4-6-1 etc).

I need an algorithm able to infer the pattern (or "regularity") and then use this knowledge to do an educated guess of what should be the next tile to light up.

It doesn't need to learn forever though. It could be: "If you think you have a pattern and it repeats itself at least 10 times, presume you've found it".

I was reading scholarly articles on the subject but I think I need some ideas to narrow my researches first.

Thanks!

**Contents**hide

#### Best Answer

I don't think we need complicated machinery. HMMs? Not for this, I think, unless the real problem is a lot harder than your examples suggest.

If a pattern must match *exactly* to count as a repeat (as your examples suggest), this is pretty easy (in the simpler cases, it's not even a statistical problem, but an algorithmic one).

If you're speaking of cycles happening where that may not be the case, then I see a number of possible cases, of increasing complexity. Let there be $n$ values in total:

1) The sequence has a cycle that repeats throughout (as in your examples) – if the period is $p$, then $y_{t+p}=y_t$ for any integer $t,p$ such that $1leq t<t+pleq n$.

2) The sequence begins arbitrarily for some initial period $t=1,2,ldots,m$, but then begins a cycle that repeats after that point.

3) The sequence begins and ends arbitrarily, but has a substantial sequence of cycles in the middle

4) There are a number of relatively long periods of cycling, with arbitrary sequences between.

Cases 1 and 2 might be amenable to an adapted version of a cycle-detection algorithm (the usual algorithms would assume that a particular value must always be followed by the same value which may not apply here).

Case 4 might be amenable to algorithms that find repeated substrings (not necessarily just the longest repeated substring, but that might be a good place to start).

Let's look at a relatively basic approach that might work reasonably well in any of these cases and take a statistical approach anyway (given where you asked)

You can consider differences of lagged values; if that gets you a long string of zeros (what counts as long partly depends on what lag you look at), then you have a repeating sequence.

You can even work out the period from the number of zeros and the smallest lag you get it at (it will also occur at double that lag for example).

This data has an initial arbitrary sequence of length 15, a repeating cycle of length 12 (which repeats three full times) and then a final arbitrary sequence of length 17 (i.e. case 3 above):

` 1 7 8 2 5 3 7 9 8 3 3 1 3 6 9 3 9 1 6 5 2 5 6 1 2 2 2 3 9 1 6 5 2 5 6 1 2 2 2 3 9 1 6 5 2 5 6 1 2 2 2 3 9 1 6 5 2 5 6 1 2 2 2 4 4 1 4 9 `

Here's a short piece of R code (which I haven't tried to make even slightly efficient) that finds the cycle in the middle:

` maxlen=0 for (lag in 1:20) { m=max(rle(diff(y,lag=lag))$length) if (m>maxlen) { maxlen=m maxlag=lag } } > maxlag [1] 12 > maxlen [1] 36 `

That is, it tells us that the cycle is of length 12 and that after the first time through the cycle it repeats for a further 36 units of time. Note that the functions that do the hard work are `diff`

(which takes differences) and `rle`

(rle="run-length encoding") which transforms a sequence to a count of repeats of *single* values.

So what does this cycle consist of?

` r= rle(diff(y,lag=maxlag))$lengths loc= which(r==maxlen) init= sum(r[1:(loc-1)]) # this is the initial length of sequence before the cycle y[(init+1):(init+maxlag)] # this is the first time through the cycle [1] 3 9 1 6 5 2 5 6 1 2 2 2 `

… which is indeed correct.

This algorithm should also work for case 4.

To potentially – but not certainly – find such sequences even in case 4 you might also consider the shortuct of looking at the acf and/or the pacf, both of which find the cycle length immediately for our example:

The peaks at 12 suggest we try looking at `max(rle(diff(y,lag=12))$length)`

to see if that's a large number.

If you only ever have to deal with case 1, a number of simplifications to these approaches are possible.

You might also find some value in the posts here, here and here.