Construction of Ensemble by Exploiting Richness of Feature Variables in High-Dimensional Data with Applications in Protein Homology
High-dimensional data often contain a large number of observations and feature variables. In this work, we have developed a model which uses the richness of information presents in the large number of feature variables in high-dimensional data to predict a response. The proposed model - which is an aggregated collection of logistic regression models (LRM) - is called an ensemble, where each constituent LRM is fitted to a subset of feature variables. An algorithm is developed to cluster the feature variables into subsets in a way that the variables in a subset appear to be good to put together in an LRM, and the variables in different subsets appear to be good in separate LRMs. The strength of the ensemble depends on the algorithm's ability to identify strong and diverse subsets of useful feature variables present in high-dimensional data. We named each subset of variables a phalanx, and the resulting ensemble an ensemble of phalanxes. Homologous proteins are considered to have a common evolutionary origin. To produce an evolutionary sequence of proteins, a scientist needs to predict their biological homogeneity. The proposed model has been applied to predict biological homogeneity of proteins using feature variables obtained from the similarity search between a candidate protein and a target protein. The underlying assumption is that the structural similarity of proteins relates to their biological homogeneity. Considering scarcity of homologous proteins, the prediction performances of a model are evaluated by checking its ability to rank/sequence rare homologous proteins ahead of the non-homologous proteins. The protein homology data are obtained from the 2004 KDD cup website. While the prediction performance of an ensemble of phalanxes is competitive to contemporary state-of-the-art ensembles and the winning procedures of the 2004 KDD cup competition, a further improvement in prediction performances is achieved by aggregating two diverse ensembles of phalanxes obtained from optimizing two complementary evaluation metrics. Through parallel computation, the proposed ensemble is shown computationally efficient as well.