ICML-98 Submission #117

Employing EM in Pool-Based Active Learning for Text Classification

Andrew McCallum and Kamal Nigam
Just Research
4616 Henry Street
Pittsburgh, PA 15213

Abstract (250 word maximum):

This paper shows how a text classifier's need for labeled training
data can be reduced with a combination of active learning and
Expectation Maximization (EM) on a pool of unlabeled data.
Query-by-Committee is used to actively select documents for labeling,
then EM with a naive Bayes model further improves classification
accuracy by concurrently estimating probabilistic labels for the
remaining unlabeled documents and using them to improve the model.  We
also present a metric for better measuring disagreement among
committee members; it accounts for the strength of their disagreement
and for the distribution of the documents.  Our method of combining EM
and active learning requires only half as many labeled training
examples to achieve the same accuracy as either EM or active learning
alone.

Keywords:

text classification, active learning, unsupervised learning,
information retrieval

Email address of contact author:

mccallum@jprc.com

Phone number of contact author:

412-683-9132

Multiple submission statement (if applicable):

None.