ICML-98 Submission #152

Refining the Initial Points in K-Means Clustering

               Paul Bradley and Usama Fayyad
               Microsoft Research
               Redmond, WA 98052, USA

Abstract

As an optimization problem, clustering data (unsupervised learning) is
known to be a difficult problem. Most practical approaches use a
heuristic, typically gradient-descent, algorithm to search for a
solution in the huge space of possible solutions.  Such methods are by
definition sensitive to starting points. It has been well-known that
clustering algorithms are extremely sensitive to initial
conditions. Most methods for guessing an initial solution simply make
random guesses. In this paper we present a method that takes an
initial condition and efficiently produces a refined starting
condition. The method is applicable to a wide class of clustering
algorithms for discrete and continuous data. In this paper we
demonstrate how this method is applied to the popular K-means
clustering algorithm and show that refined initial starting points
indeed lead to improved solutions. The method is based on an efficient
technique for estimating the modes of a distribution and runs in time
guaranteed to be less than overall clustering time for large data
sets. The method is also scalable and hence can be efficiently used on
huge databases to refine starting points for scalable clustering
algorithms in data mining applications.

Keywords:  Clustering, data mining, Initialization of Clustering, K-means
algorithm.

Contact Address:
Usama Fayyad (fayyad@microsoft.com, http://research.microsoft.com/~fayyad)
Microsoft Research
Redmond, WA 98052-6399, USA
        	Tel. +1-425-703-1528,  Fax  +1-425-936-7329