# Solved – Why word2vec maximizes the cosine similarity between semantically similar words

I have an understanding into the technicals of word2vec. What I don't understand is:

Why semantically similar words should have high cosine similarity. From what I know, goodness of a particular embedding is seen in shallow tasks such as word analogy. I am unable to grasp the relationship between maximizing cosine similarity and good word embeddings.

Contents

Why semantically similar words should have high cosine similarity.

From wikipedia on distributional semantics:

The distributional hypothesis in linguistics is derived from the semantic theory of language usage, i.e. words that are used and occur in the same contexts tend to purport similar meanings. The underlying idea that "a word is characterized by the company it keeps" was popularized by Firth.

Why exactly cosine similarity? Because apart from being a similarity, which is in itself useful, it is related to euclidean distance: if \$\$|x| = |y| = 1\$\$ then \$\$|x-y|^2 = 2 – 2 langle x, yrangle\$\$, because

\$\$|x-y|^2 = langle x-y, x-y rangle = |x|^2 + |y|^2 – 2 langle x, yrangle\$\$

To sum up: word2vec and other word embedding schemes that tend to have high cosine similarity for words that occur in similar context – that is, they translate words which are similar semantically to vectors that are similar geometrically in euclidean space (which is really useful, since many machine learning algorithms exploit such structure).

Rate this post