I am using this website to learn Machine Learning. I am working on a simple project to learn it. I want to predict employee's attendance. This is the data in the file:
1,0,1,0,1,0,John
The index of the columns is the day number. John
is the name of the employee. 1
means that he came to work, and 0
means he didn't. So according to these data:
Day 1: 1 Day 2: 0 Day 3: 1 Day 4: 0 Day 5: 1 Day 6: 0
I want to predict day number 7 based on the previous days. I am expecting that there is a chance that 100% John will come in day number 7. Because it seems to be a pattern that every day he comes on the odd days.
I want to be able to tell the probability he will come. The above example is easy for learning, but it can be more complex:
Day 1: 1 Day 2: 0 Day 3: 1 Day 4: 1 Day 5: 1 Day 6: 0 ... Day 49: 1 Day 50: 1
Back to the first example, I added the information to a dataset:
import pandas url = r'C:UsersmyUserDesktopemployee.txt' names = ['1', '2', '3', '4', '5', '6', 'class'] dataset = pandas.read_csv(url, names=names) dataset
This the result:
But I don't know how to get the prediction. I want it to tell me "the chance that John will come on day 7 based on his previous attendance".
Best Answer
The example I'm about to give avoids the use of proper sequence learning and instead goes for a simple and hopefully more intuitive attribute-value approach based on the Naive Bayes estimator. There are more advanced ways of solving this but from reading your question I think the approach I'm about to present is more appropriate.
So given what we know about the previous sequence, we want to learn as much as possible about what is about to follow. Let's take the following sequence [1,0,1,1,0,1,0,1,1,0,1]
:
Based on the last value, 1
, we can say that in 4 out of 6 cases, a 0
will follow. The same can be done for the previous 2 values, 0,1
; these are in 2 out of 3 cases followed by a 1
. The same can be said about the previous 3 values, 1,0,1
, which are in 2 out of 3 cases followed by a 1
. Now looking at the previous 4 values 1,1,0,1
we can finally state that this will always be followed by a 0
,idem for the previous 5 and 6. We don't know anything about the previous 7 as they have not been seen before. A natural conclusion is thus the follow the prediction based on the most certain predictions based on the previous 4,5 and 6 values and state that 0
be the next value.
One can thus parse all patterns of length n
from the sequence:
def parse(history, n): for i in range(len(history) - n): yield (history[i:i + n], history[i + n])
Which will yield for n=3
:
[([1, 0, 1], 1), ([0, 1, 1], 0), ([1, 1, 0], 1), ([1, 0, 1], 0), ([0, 1, 0], 1), ([1, 0, 1], 1), ([0, 1, 1], 0), ([1, 1, 1], 1)]
Next step is to learn from these patterns how many times a certain pattern is followed by 1
or 0
def learn(example, n): occurences = parse(example, n) # counts[followed_by][pattern] counts = [[0] * pow(2, n), [0] * pow(2, n)] # loop over all occurences for occurence in occurences: pattern, followed_by = occurence pattern_hash = hash(pattern) counts[followed_by][pattern_hash] += 1 return counts
In the above we hash each pattern into the its binary value for convenience. We can now determine the probability the previous-n values (i.e. last n-values of our example) are followed by a 1
def probability(example, n): counts = learn(example, n) # determine probability that 1 will follow after `previous` # (last n values), using what we know about past values previous = example[-n:] previous_hash = hash(previous) # Determine probability of `followed_by`=1 total_count = (counts[1][previous_hash] + counts[0][previous_hash]) # if pattern was never (`total_count` == 0) seen return None return counts[1][previous_hash] / total_count if total_count else None
The last step is to make a prediction based on the most certain prediction between the different pattern lengths n
from the previous step. probability([1,0,1,1,0,1,0], 3)
will return None
as 0,1,0
has never been observed, probability([1,0,1,1,0,1,0], 2)
will return 3/4.
def prediction(example): max_n = len(example) - 1 # Determine probability of `followed_by`=1 over different pattern lengths probabilities = [(probability(example, n), n) for n in range(1, max_n) if probability(example, n) is not None] # If min equals 0, there is a pattern that is inconsistent with next being 1 p = min(probabilities)[0] and max(probabilities)[0] n = [n for pp, n in probabilities if pp==p][0] most_informative = example[-n:] + [int(p)] return (p, most_informative)
When one of the probabilities equals zero, which means that a 1
would be fully inconsistent with a certain pattern, the probability is zero. When one of the probabilities equals one, this means that a 1
is in perfect agreement with a certain pattern and the probability for 1
is thus one.
As an extra feature the last step will also give the most informative pattern, in the case of equal importance, the shortest is returned. The following tests show it in action:
tests = [([0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0], 1), ([1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0], 1), ([1, 1, 0, 1, 0, 0, 1], 0), ([1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0], 1)] # run tests for test in tests: pattern, expectation = test outcome, mi = prediction(pattern) test_outcome = "CORRECTLY" if outcome == expectation else "INCORRECTLY" print("Predicted", int(outcome), test_outcome,"for", pattern, ", pattern:", mi) >>> Predicted 1 CORRECTLY for [0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0] , pattern: [0, 1] >>> Predicted 1 CORRECTLY for [1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0] , pattern: [0, 1] >>> Predicted 0 CORRECTLY for [1, 1, 0, 1, 0, 0, 1] , pattern: [0, 1, 0] >>> Predicted 1 CORRECTLY for [1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0] , pattern: [0, 1, 0, 1]
The full code can be found in this gist
Similar Posts:
- Solved – the difference between Gini index and Gini coefficient
- Solved – Predicting sequence of integers / binary values
- Solved – Predicting sequence of integers / binary values
- Solved – Why does scikit learn’s HashingVectorizer give negative values
- Solved – Classify a pattern using multiple inputs(rows) and predict that pattern