I have a dataset of fraudulent orders from some business. Each order has a bunch of features such as order_amount, address, state, city, phone_number, and name. Obviously a criminal would not be using his/her real name when making a fraudulent order. So I was wondering if there was any sort of machine learning strategy to identify fake names. I assume there must be some sort of underlying structure to how fake names are selected – so understanding this structure could allow me to identify them. Unless the fake names are completely randomly selected. Any thoughts on how to do this?
I know it must be far too late for you (only 2.5 years late, I'm quite fast to answer!) but I've been looking upon this problem as well, and found a paper from David Mandell Freeman (Linkedin) that might help other people looking into this.
I haven't tested it yet since my dataset isn't labeled 'fake' or 'valid' (greatest problem ever for the learning phase), but I will soon.
Until then, here is the forementionned paper: http://theory.stanford.edu/~dfreeman/papers/namespam.pdf
The idea is to check for frequency, not of the entire names, but of substrings of the names.