ParaKMeans ParaKMeans is a high performance parallel processing implementation of the K Means Clustering algorithm. We designed the software so it can be deployed on most Windows operating systems. The applications are written for the .NET Framework v1.1 using the C# programming language. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Because we use a web service, it is essential that at least one computer has Internet Information Services (IIS v.5 or better) installed and running. The parallel K Means algorithm used in this application is based on the work of Ben Zhang, Meichun Hsu and George Forman. If you make use of the program presented here, please cite the following article:
Kraj P, Sharma A, Garge N, Podolsky R, McIndoe RA: ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use. BMC Bioinformatics 2008;9:200. |
HPCluster Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high I/O costs involved and large distance matrices calculated, most of the clustering algorithms fail on large datasets (30,000+ genes/200+ arrays). We propose a new two-stage algorithm which partitions the high dimensional space associated with microarray data using hyper planes. The first stage is based on the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm with the second stage being a conventional k-Means clustering technique. Because the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44,460 genes without failure and significantly decreases the time to complete when compared to popular k-Means programs. The software was written in C# (.NET 1.1). This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. If you make use of the program presented here, please cite the following article:
Sharma A, Podolsky R, Zhao J, McIndoe RA: A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets. Bioinformatics 2009;25:1152-1157. |
ParaSAM Significance Analysis of Microarrays (SAM) is a permutation-based method that relies on estimating the FDR for determining significance. SAM is freely available as an Excel plug-in and as an R-package module. However, for large datasets the memory requirements are high and the algorithm fails. To overcome the memory limitations, we have developed a parallelized version of the SAM algorithm called ParaSAM. This high performance multithreaded application does not require programming experience to run and is designed to provide the general scientific community with an easy and manageable client-server Windows application. The parallel nature of the application comes from the use of web services to perform the permutations. The software is written in C# (.NET 1.1) and is designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. Our results indicate ParaSAM is not only faster than the serial versions, but can analyze extremely large datasets that cannot be performed using a single PC. If you make use of the program presented here, please cite the following article:
Sharma A, Zhao J, Podolsky R, McIndoe RA: ParaSAM: A parallelized version of the significance analysis of microarrays algorithm. Bioinformatics 2010. |