The Coordinating and Bioinformatics unit is responsible for the creation of the software and informatics infrastructure for the consortium as well as facilitating the efforts of the mouse engineering centers. This page provides information about the infrastructure created for the consortium as well as any software created for the scientific community.

Lab Personnel

Mike Jiaqi Brianna Lili
Mike Aufiero
Systems Analyst
Jiaqi Li
PhD Student
Brianna Perez
Curator
Lili Liang
Senior Research Assistant



Infrastructure Information
DiaComp IT Infrastructure
Our programming paradigm is to develop software systems based on an n-tier architecture, where we create the presentation layer, business logic and data layer into separate software systems. These systems have been developed to minimize maintenance, but provide a robust scalable model for future growth and interactions at the national level with other organism databases. These systems have been designed using the unified modeling language (UML) with the designs available to the general public. The two UML modeling tools we use are Rational Rose and Powerdesigner.

DiaComp Data Model
The core relational data model for the DiaComp was created using SQL Server 2000 and was based on a number of existing schemas containing our key subject areas: animal models, genotypes (including array experiment data), histopathology, and phenotype Assays. The Mouse Models of Human Cancer Consortium (MMHCC) and the Jackson Labs were particularly helpful, and shared several successful models. Currently DiaComp Data Model has been migrated to SQL Server 2005 and has been modified to include MMPC (National Mouse Metabolic Phenotyping Centers) Data Schema. The current version of the database addresses several domains, including DiaComp - MMPC administration, models, strains, publications, external database references, experiments, phenotype assays, microarray data, histology, images and dataset persistence. Current data model has 250 tables, 55 functions, 994 stored procedures, 141 data views and a total of 9344 lines of code.

DiaComp Administration Data Model

DiaComp Science Data Model

* Note: Above links require Internet Explorer version 5.0 or above to view Data Model with Zoom capability. Also please make sure to accept ActiveX warning to start viewer. Viewer has links to different data schemas on Navigation Dropdown Box, you will need to click go Next to the Links to load different schema.

DiaComp Object Model
The DiaComp Object Model (DiaComp-OM) created for the consortium fully describes the activities of the DiaComp and provides an OOP API to access the data generated by the consortium. The DiaComp-OM was designed using Powerdesigner and UML, written in C# and compiled as a .NET DLL. The object model contains both administrative and domain specific classes. However, only the data centric classes are available to the public. The Domain classes provide both object specific classes (e.g. Model, Strain, Experiment, Protocol, etc.) as well as DataManager and SearchCriteria classes used to retrieve data from the system. These DataManager classes are specific for each of the data types maintained by DiaComp. For example, the StrainMgr class provides methods to retrieve strain specific data. The SearchCriteria classes are also datatype specific and are used by the DataManager classes to query the database using different type specific parameters. For example, the StrainSearchCriteria class provides queryable properties specific for the Strain data in the system.

DiaComp Object model base was modified to add MMPC (National Mouse Metabolic Phenotyping Centers) schema. Currently common object model for both consortium contains classes to serve DiaComp and MMPC consortium web portals.

In order to provide the broadest access to the data, we are also creating a WebService that exposes specific portions fo the DiaComp-OM to the public. Specifically, the WebService will provide access to all the object specific classes as well as the DataManager and SearchCriteria classes. This provides a mechanisms for programmers to create local DiaComp-OM objects in other languages. The current version of the DiaComp-OM has 185 object classes.

DiaComp-Web Services
The DiaComp Web Services layer exposes classes and methods of the DiaComp object model which can be used by users to interact with the DiaComp object model using custom built web applications or even without a user interface. Details about the interfaces are provided to users through an XML document called a Web Services Description Language (WSDL) document. There are several tools available to read a WSDL file and generate the code required to communicate with an XML Web service including a very capable "Add Web Reference" tool used in Microsoft Visual Studio. DiaComp web services layer makes available public data search and retrieval methods for animals, strains, experiments, histology images, investigators, phenotype assays and publications. The exposed web service methods can be consumed through customized client ASP.NET applications using SOAP calls or through traditional HTTP GET/POST METHODS without the use of an API. The framework has been designed to be independent of any particular programming model and other implementation specific semantics. A complete documentation for each of the web service methods is available providing information about data return type, input parameters and exceptions thrown. In addition, users may choose to download a zipped Visual Studio 2008 solution file containing a sample ASP .NET client application and C# class library project.

Software Applications
ParaKMeans
ParaKMeans is a high performance parallel processing implementation of the K Means Clustering algorithm. We designed the software so it can be deployed on most Windows operating systems. The applications are written for the .NET Framework v1.1 using the C# programming language. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Because we use a web service, it is essential that at least one computer has Internet Information Services (IIS v.5 or better) installed and running. The parallel K Means algorithm used in this application is based on the work of Ben Zhang, Meichun Hsu and George Forman.
If you make use of the program presented here, please cite the following article:

Kraj P, Sharma A, Garge N, Podolsky R, McIndoe RA: ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use. BMC Bioinformatics 2008;9:200.
HPCluster
Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high I/O costs involved and large distance matrices calculated, most of the clustering algorithms fail on large datasets (30,000+ genes/200+ arrays). We propose a new two-stage algorithm which partitions the high dimensional space associated with microarray data using hyper planes. The first stage is based on the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm with the second stage being a conventional k-Means clustering technique. Because the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44,460 genes without failure and significantly decreases the time to complete when compared to popular k-Means programs. The software was written in C# (.NET 1.1). This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data.
If you make use of the program presented here, please cite the following article:

Sharma A, Podolsky R, Zhao J, McIndoe RA: A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets. Bioinformatics 2009;25:1152-1157.
ParaSAM
Significance Analysis of Microarrays (SAM) is a permutation-based method that relies on estimating the FDR for determining significance. SAM is freely available as an Excel plug-in and as an R-package module. However, for large datasets the memory requirements are high and the algorithm fails. To overcome the memory limitations, we have developed a parallelized version of the SAM algorithm called ParaSAM. This high performance multithreaded application does not require programming experience to run and is designed to provide the general scientific community with an easy and manageable client-server Windows application. The parallel nature of the application comes from the use of web services to perform the permutations. The software is written in C# (.NET 1.1) and is designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. Our results indicate ParaSAM is not only faster than the serial versions, but can analyze extremely large datasets that cannot be performed using a single PC.
If you make use of the program presented here, please cite the following article:

Sharma A, Zhao J, Podolsky R, McIndoe RA: ParaSAM: A parallelized version of the significance analysis of microarrays algorithm. Bioinformatics 2010.