Solved – Are Neural Nets viable to extract Date patterns in a text

I was just wondering, Is there any possibility that we train neural networks (may be LSTM/RNN) on different formats of dates (with multiple examples in each format to expand its learning) and then ask the neural net to extract a close match of date in a RAW TEXT?

All I need is a Machine learning algorithm to do pattern matching instead of using regex. Can this be implemented?

If so, Please provide me an idea of implementation or any already working solution (Python/R) is also fine.

Update: Added my problem statement for better understanding

My Input text file (input.txt) with different formats of dates (Say, I have some thousands of examples for each format) will be as following: (Say, I will only be expecting these formats of data in the raw file)

13/08/1993 23/09/2016 24/12/1992 ... 13-08-1993 23-09-2016 24-12-1992 ... 13-Sep-1993 23-Sep-2016 24-Dec-1992 ... Some other formats 

An Example RAW TEXT file is given below: (It is just a OCR Extracted info from a receipt)

ROCKET MEALS 23/09/2015 RECEIPT ID: #294055 Shop: #1 ITEM QTY PRICE French Fries 33.26 Coca Cola 22.4 SUB TOTAL: 95.66 Tax: + 6.45 TOTAL: 102. 1 THANK YOU - VISIT AGAIN 

Expected Output: 23/09/2015

PS: I have already used regex for this but, actually I am curious to know how to train such a network and how it understands the patterns?

This is technically possible, but there would be several issues you would run into, for example:

  1. What would be your output? You could use a soft-max and then do classification for the days and months. However, classification on the year number would limit you to a specific time range and might lead to thousand of unused or sparsely used classes.
    You could use regression and use the fact that this is the 292nd day of 366 (leap year!) of year 2016 to have 2016 292/366 as the target value for the current date. However you would need to train this network to be accurate to 7 significant figures to give accurate dates.
  2. What would be your input? You could just input the ASCII values of the characters and then perform some normalization. However that would mean the input could have variable length (especially if you allow words like 'Thursday' and 'March' in your dates). You could handle that with a recurrent neural network, but training such a network to learn words like 'Thursday' is going to be an incredible amount of work for such a simple problem.
    Otherwise you will need to do pre-processing. Assigning numerical values to the names of the months and while you're at it you can just as well combine consecutive digit-characters into the numbers they represent. You could feed that into a neural network, but you'd be a long way into parsing the dates with regex. So why not finish the job using regex?
  3. A more general problem: how do you deal with ambiguity? Is 10-11-2016 a day in October or November?

In conclusion, neural networks are the wrong tools for the job. However if you really wanted to, you could give yourself a whole lot of extra work and do it with the wrong tools anyway.

Edit: ok, so responding to OP's edit. If you wanted to do it anyway, my answer would depend on the amount of preprocessing you would allow for. Without any pre-processing and assuming ASCII input, I would represent each of the 256 ASCII characters as a vector of 255 zeros and a single 1 at the location corresponding to their ASCII encoding. So 'A' would have a 1 on index 65 and 'h' would have a 1 at index 104 and the rest zeros.
Now first we want translation invariance, that simply means that the same string should be recognized as a date or not, no matter where it is positioned in the text. We can achieve this by feeding n characters (each represented as vectors) into our network at a time. So first we take characters 1-n, then characters 2-n+1, then 3-n+2 etc. until we are at the end of the text.
Now the obvious question is what is a good value for n. We would like n to be the same as the length of a date-string. This brings us to the issue that dates might have different lengths (compare: 10-8-16, 10-08-16, 10-08-2016 and 10-October-2016). The only solution I can come up with is to train a classifier for each length that dates might have in your training set.
If you are willing to do pre-processing, you can make it a lot easier for the algorithm. First of all it is a bit silly to pretend '0' has as much in common with 'a' as with '1'. So instead of representing the digits as 10 different places to put a 1 in a vector, you could make the vector 9 indices shorter and represent the digits 0-9 as the numbers 1-10 in a single spot (remember, 0 means no character). That would save on the number of inputs and save the network the trouble of learning what digits are.
Next in this specific use-case, it would probably help a lot to combine 2 consecutive digits as a single vector with the numerical value in the digit slot. First of all, this would represent '8' and '08' as the same thing and therefor cut back on the number of different lengths dates can have (in this representation 8/8/16, 08/8/16, 8/08/16 and 08/08/16 all would have the same length). Second of all, this would make it very easy to learn that 24 could be a day of the month, but 42 can't. And by only combining at most 2 consecutive digits, it is relatively easy to learn that 2016 and 16 are similar in the date context.

Similar Posts:

Rate this post

Leave a Comment