ICML-98 Submission #117
Employing EM in Pool-Based Active Learning for Text Classification
Andrew McCallum and Kamal Nigam
Just Research
4616 Henry Street
Pittsburgh, PA 15213
Abstract (250 word maximum):
This paper shows how a text classifier's need for labeled training
data can be reduced with a combination of active learning and
Expectation Maximization (EM) on a pool of unlabeled data.
Query-by-Committee is used to actively select documents for labeling,
then EM with a naive Bayes model further improves classification
accuracy by concurrently estimating probabilistic labels for the
remaining unlabeled documents and using them to improve the model. We
also present a metric for better measuring disagreement among
committee members; it accounts for the strength of their disagreement
and for the distribution of the documents. Our method of combining EM
and active learning requires only half as many labeled training
examples to achieve the same accuracy as either EM or active learning
alone.
Keywords:
text classification, active learning, unsupervised learning,
information retrieval
Email address of contact author:
mccallum@jprc.com
Phone number of contact author:
412-683-9132
Multiple submission statement (if applicable):
None.