Impact of Similarity Measures on Web-page Clustering
Clustering of web documents enables (semi-)automated categorization, and
facilitates certain types of search. Any clustering method has to embed
the documents in a suitable similarity space. While several clustering methods
and the associated similarity measures have been proposed in the past, there is
no systematic comparative study of the impact of similarity metrics on cluster
quality, possibly because the popular cost criteria do not readily translate
across qualitatively different metrics. We observe that in domains such as YAHOO
that provide a categorization by human experts, a useful criteria for
comparisons across similarity metrics is indeed available. We then compare
four popular similarity measures (Euclidean, cosine, Pearson correlation and
extended Jaccard) in conjunction with several clustering techniques (random,
self-organizing feature map, hyper-graph partitioning, generalized k-means,
weighted graph partitioning), on high dimensional sparse data representing web
documents. Performance is measured against a human-imposed classification
into news categories and industry categories. We conduct a number of
experiments and use t-tests to assure statistical significance of results.
Cosine and extended Jaccard similarities emerge as the best measures to capture
human categorization behavior, while Euclidean performs poorest. Also, weighted
graph partitioning approaches are clearly superior to all others.