ICML-98 Submission #142
Improving Text Classification by Shrinkage in a Hierarchy of Classes
Andrew McCallum
Just Research
4616 Henry Street
Pittsburgh, PA 15213
Ronald Rosenfeld and Tom Mitchell
Department of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Andrew Ng
MIT AI Lab
545 Tech Square
Cambridge, MA 02139
Abstract
When documents are organized in a large number of topic categories,
the categories are often arranged in a hierarchy. The U.S. patent
database and Yahoo are two examples.
This paper shows that the accuracy of a learned text classifier can be
improved by taking advantage of a hierarchy of classes. We adopt an
established statistical technique called shrinkage that smoothes
parameter estimates of a data-sparse child with its parent in order to
obtain more robust parameter estimates. The approach is also related
to deleted interpolation, used with n-grams in language modeling.
Our method scales well to large data sets, with numerous categories in
large hierarchies. Experimental results on three real-world data sets
from UseNet, Yahoo, and corporate web pages show improved performance,
with a reduction in error up to 29% over a traditional flat
classifier.
Keywords:
Text Classification
Information Retrieval
Hierarchical Modeling
Statistical Language Modeling
Bayesian Learning
Email address of contact author: mccallum@jprc.com
Phone number of contact author: 412-683-9132