| |
Computer Science Seminar Series
A Robust Clustering Algorithm Using Statistical Depth Functions
February 7, 3:00pm
Weir Hall, Room 235
Presenter: Yuanyuan Ding
Ph.D. student in Computer Science at UM
In gene expression studies, the number of samples in most data sets is in the tens or low hundreds, while the total number of genes assayed is easily ten or twenty thousand. Such high dimension and low sample size (HDLSS) data present a substantial challenge to many methods of classical analysis, including cluster analysis.
Clustering algorithms are considered robust if they are not upset by small perturbations of the data or by the inclusion of unrelated variables, as in the case of HDLSS data. Clustering algorithms based on mean are highly sensitive to outliers. Algorithms based on component wise median are less sensitive, but median can be a very poor centroid, because it is calculated separately on each component and it disregards the information possessed by the interdependence among the components. So the calculated median is not representative of the data.
We propose a robust divisive clustering algorithm Bisecting k-Spatial Median based on statistical depth functions. Statistical depth functions are a relatively new method of finding the "center" in a multivariate data set.
Spatial Median is the median defined by one depth function: spatial depth.
It is proven to be robust to outliers and a better representative of the data.
We demonstrate that the proposed clustering algorithm outperforms the less robust componentwise-median-based bisecting k-median algorithm for high dimension and low sample size data by applying them to two real HDLSS gene data sets. When further applied on noisy real data sets, the proposed algorithm compares favorably in terms of robustness with the componentwise-median-based bisecting k-median algorithm.
[ Home |
Site Map ]
|
|