Word embedding has been a very exciting topic in natural language processing recently. Mikolov et al (2013) showed word embeddings (mappings from words to Euclidean space) constructed from training neural networks surprisingly have linear structure –analogous pairs of words have similar differences of embeddings. Subsequently, Levy and Goldberg (2014) and Pennington et al (2014) tried to explain why such linear structure should arise in embeddings derived from nonlinear methods.
We provide a new generative model "explanation" for various word embedding methods as well as the above-mentioned linear structure. It also gives a generative explanation of older vector space methods such as the PMI method of Church and Hanks (1990). The model has surprising predictions (e.g., the spatial isotropy of word vectors), which are empirically verified.
The theoretical analysis also leads us to a linear algebraic understanding of how a word embedding behaves when the word is polysemous (has multiple meanings), and to a new technique that recovers the different meanings from the embedding. This methodology and generative model may be useful for other NLP tasks and neural models.
The talk is based on joint works with Sanjeev Arora, Yuanzhi Li, Yingyu Liang and Andrej Risteski. No prior knowledge will be assumed.
Bio: Tengyu Ma is currently a fourth-year graduate student at Princeton University, advised by Prof. Sanjeev Arora. He received the Simons Award for Graduate Students in Theoretical Computer Science in 2014 and IBM Ph.D. Fellowship in 2015. Ma’s work seeks to develop efficient algorithms with provable guarantees for machine learning problems. His research interests include non-convex optimization, deep learning, natural language processing, distributed optimization, and convex relaxation (e.g. sum of squares hierarchy) for machine learning problems.