I have about 7000000 patents that I would like to do find the document similarity of. Obviously with a sample set that big it will take a long time to run. I am just taking a small sample of about 5600 patent documents and I am preparing to use Doc2vec to find similarity between different documents. From many of the examples and the Mikolov paper he uses Doc2vec on 100000 documents that are all short reviews. My documents are much longer than reviews, like 3000+ words each, but I have way fewer of them. Should I still use Doc2vec on this limited sample set? Or should I use something like Word Mover Distance and Word2Vec since I have perhaps almost as many words as Mikolov's paper but fewer documents. Gensim has pre-trained Word2vec. I don't really understand Doc2vec/Word2vec very well, but can I use that corpus to train Doc2vec? Anyone have any suggestions?
Note: I have already implemented LDA/LSI and cosine sim of: TF-IDF. I'm looking to see which method gets the most accurate similarity measure so I can test similarity measures over time.
Yes, I would try Doc2Vec with that. The build_vocab() method in gensim is akin to word2vec, in any case (i.e. only for Distributed Memory algorithm, not the DBOW which does not make word vectors). You can test the words similarities in DM route after training and see how they compare. Then also you can test the documents' also.
Another word embedding method is supposed to be good: GLoVE. There are some good tutorials in the blogosphere for doc2vec – as well as the gensim ipython notebook you could follow to get going. My intuition is that it works better with smaller short texts like tweets than longer documents, but you can try in any case.