Bayesian Clustering of Data and Dimensions
Clustering or unsupervised learning is one of the frequently-used exploratory techniques to uncover data pattern. Including large number of unnecessary dimensions (or attributes) affects most of the pattern recognition algorithms negatively. As a remedy, often informative dimensions are selected or data are projected into smaller number of dimensions. Our approach is a bit different: we suggest grouping subjects and dimensions at the same time, called bi-clustering. Hierarchical clustering is one of the most popular pattern recognition techniques, because it produces dendrogram {a visual guide to data clusters with different number of groups. We generalize the common hierarchical clustering algorithms to hierarchical biclustering, to increase the precision of the estimated groupings in the presence of correlated or noise dimensions. A model-based bi-clustering gives a better understanding of biclusters statistically, therefore a model-based discrepancy measure such as the ward linkage looks more appropriate. We make a bridge between the ward linkage and a Bayesian model to produce scalable hierarchical bi-clustering algorithms and treat large data.