ICML-98 Submission #199
Learning a Language-Independent Representation for Terms from a
Partially Aligned Corpus
Michael L. Littman
Dept. of Computer Science
Duke University
Durham, N.~C. 27708-0129
mlittman@cs.duke.edu
Fan Jiang
Dept. of Computer Science
Duke University
Durham, NC 27708-0129
small fan@cs.duke.edu
Greg A. Keim
Dept. of Computer Science
Duke University
Durham, NC 27708-0129
keim@cs.duke.edu
Cross-language latent semantic indexing is a method that learns useful
vector-based language-independent representations of terms through a
statistical analysis of a document-aligned text. This is accomplished
by taking a collection of, say, English paragraphs and their
translations in Spanish and processing them by singular value
decomposition to yield a high-dimensional vector representation for
each term in the collection. These term vectors have the property
that semantically similar terms tend to have vectors with high cosine
measure, regardless of their source language. In the present work, we
show how to extend this approach to the case in which English-Spanish
translations are not available, but instead, translations for
documents in both languages are available in a third ``bridge''
language, say, French. Thus, although no aligned English-Spanish
documents are used, our method creates a representation in which
English and Spanish term can be compared. The resulting vector
representation of terms can be useful in natural language applications
such as cross-language information retrieval and machine translation.
KEYWORDS: natural language, information retrieval, machine
translation, singular value decomposition, missing values.
EMAIL ADDRESS OF CONTACT AUTHOR: mlittman@cs.duke.edu
PHONE NUMBER OF CONTACT AUTHOR: 919-660-6537