# Informatics problems in the biological, chemical, and medical sciences

## Informatics Problems in the Biological, Chemical, and Medical Sciences

With the recent accumulation of vast amounts of chemical, biological, and clinical data, many scientific fields are becoming increasingly data-driven as opposed to model-driven. This paradigm shift has brought about many challenging computational data mining problems. Even though these problems originate from very disparate fields, they have very similar mathematical structures. In particular, they involve the use of a merit function to evaluate alternatives from very large, often combinatorial, search spaces. We develop novel mathematical models and algorithms for a variety of challenging

informatics problems that are described below.

## Inverse imaging problems in X-ray crystallography

Since the mid nineteen hundreds, analysis of X-ray diffraction data of crystals has been used extensively for the determination of molecular structure and properties. While the method is employed almost on a routine basis worldwide, it is often a major challenge to identify the 3D structure that best fits the diffraction data. A key obstacle, in particular, is the identification of the phases of the diffracted rays from measurements of intensities alone. This problem is known as the “phase problem” in crystallography. We have developed a novel methodology, the first to theoretically guarantee a global optimum of a minimal principle formulation of the phase problem for centric compounds. In collaboration with researchers from the Hauptman-Woodward Medical Research Institute, we have recently added to our models the capability to deal with symmetry translations and avoid false minima. Our algorithms lead to considerably more accurate structures in comparison to state-of-the-art crystallographic software. We are currently working to develop similar algorithms for non-centric compounds, including proteins. This work promises to lay the foundations of a new generation of crystallographic computing systems that will enable the determination of structures important in the understanding of life, materials science, and drug design.

## Design and reconstruction of dynamic biochemical networks

The genomic and proteomic data that have been collected over the past two decades have increased our understanding of how single genes and single proteins function. The challenge now in systems biology and bioinformatics is to build on this understanding in order to study and influence networks involving many genes, many proteins, and all the reactions they modulate. Powerful algorithms and software from the area of optimization have been used extensively and successfully over the past two decades to analyze, interpret, predict, and design metabolic and signaling networks. Yet, current metabolic engineering approaches are prone to failure when applied to biochemical systems with multiple local minima and require excessive human intervention and computing times to optimize dynamic pathways. Fundamentally new optimization models and algorithms are needed in order to increase the size and complexity of biochemical networks that can be understood and rationally designed.We are currently working to:(a) develop novel optimization models for designing biochemical networks under temporal performance requirements such as stability; (b) develop novel optimization models for estimating the structure and parameters of biochemical networks from time series of concentrations; (c) devise mathematical algorithms to solve the above models in a reliable and efficient way, thus increasing the size of tractable networks; and (d) validate our models and algorithms by applying them to experimental metabolic data.

## Design of novel, environmentally benign chemicals

Concerns about the depletion of the ozone layer by chlorofluorocarbon (CFC) refrigerants have led to extensive research efforts to find environmentally benign CFC replacements. In the 1980s, it was proposed to search computationally for CFC replacements by using group contribution techniques to mine the property databases of the petroleum industry in order to predict properties of potentially new compounds. The main question is then how to efficiently search the astronomically large space of all potential group combinations in order to identify the most promising refrigerants. In work funded by an NSF/Lucent Technologies Industrial Ecology Fellowship, the Sahinidis group developed an optimization model and a systematic methodology for the Freon replacement problem. CFCs used as refrigerants in the past turned out to be some of the solutions of this model in addition to several novel potential alternative refrigerants (compounds that either do not appear in chemical databases or appear but have never been used as refrigerants.) This was the first theoretical approach to propose novel potential refrigerants and find the complete set of solutions to a Freon replacement problem that was open for over 20 years. Recently, we extended our approach to the problem of identifying replacements for refrigerants currently used in retail food refrigeration. With funding from the U.S. Environmental Protection Agency and the National Science Foundation, we assembled a team with expertise in synthetic chemistry, thermodynamics, computer science, and refrigeration in order to solve this problem. In a recent breakthrough, our methodology has yielded over 3000 potential new refrigerants, twelve of which appear to be environmentally benign and synthesizable. The above developments hold promise to replace current ozone-depleting and greenhouse gases by environmentally benign chemicals. We plan to develop similar computational techniques for the design of fire suppressants, solvents, fuel additives, polymers, and drugs with an emphasis on minimizing the environmental impact over the entire life cycle of the new compounds.

Coupling our molecular design algorithms with derivative-free optimization in the space of molecular properties enables a much greater molecular diversity to be considered in the search space as compared to traditional methods. Additionally, by using COSMO-RS/-SAC as alternatives to UNIFAC as the method used to capture mixture thermodynamics, we are able to model a wide variety of practical mixture design problems. To fully incorporate COSMO-RS/-SAC into mixture design, we developed group contribution methods for estimating a few necessary parameters for COSMO-based methods. This development was key for cases in which UNIFAC-like methods are insufficient and allowed for modeling of many relevant species in chemical reactions (transition states, charges, etc.) directly at the quantum level in conjunction with our molecular and mixture design methodology.

## The protein side-chain conformation problem and other structural bioinformatics problems

The protein side-chain conformation problem is central in protein bioinformatics with applications in protein structure prediction and design. Computational complexity results show that the problem is hard to solve. Yet, instances from realistic applications are large and demand fast and reliable algorithms. We have recently proposed a new algorithm which, for the first time, integrates residue reduction and rotamer reduction techniques for solving this problem. Our approach dramatically simplifies the topology of the underlining residue graph and solves problems using only 1 to 10% of the time required by the integer programming approach available in the literature. In addition, on a set of hard side-chain conformation problems, our algorithm runs 2 to 78 times faster than SCWRL 3.0, which is widely used for solving these problems. We are currently developing combinatorial optimization algorithms for several additional structural bioinformatics problems, including DNA sequencing by hybridization, and protein structural (3D) alignment.

## Bioinformatics software developed by the group:

- CMOS: software for the contact map overlap problem for protein structural (3D) alignment.
- R3: software for the protein side-chain conformation prediction problem.
- SBH: software for finding all near-optimal solutions of the combinatorial problem in DNA sequencing by hybridization—forthcoming.

## Medical diagnosis and prognosis

Over the past two decades, diagnostic techniques have grown out of the desire to replace surgical biopsy by breast cancer diagnosis that is based solely on the use of samples obtained through fine needle aspirates (FNA). Once FNA samples are taken from the breast mass, the material is examined for a number of characteristics of each nucleus, including size, shape, and texture. These attributes are then used to classify the sample as benign or malignant. The fundamental question is how to perform this last step of differentiating between benign and malignant samples.One approach is to use FNA data from hundreds of patients that were surgically diagnosed, and learn how to diagnose future patients based on FNA samples alone. The underlying mathematical problem calls for the development of a discriminant function that separates two sets of points in a high dimensional space.This is a problem that arises in data analysis and pattern recognition problems in many domains, including financial modeling and investment planning, behavioral modeling, and data-driven managerial decision making. Except for certain special cases that are easily solvable, this problem is known to be challenging. We explore the hypothesis that it suffices to develop the best possible pair of hyperplanes for separating the experimental data. In a recent Ph.D. thesis, we have devised a systematic computational methodology for solving this problem. On a set of approximately 600 clinical FNA samples from the University of Wisconsin Hospital, our methodology yielded 99% correct breast cancer diagnosis. Compared to the 95% accuracy of the best previous techniques, our developments would imply 24 fewer misdiagnosed patients for this small sample alone. This is an important improvement given that millions of patients undergo breast cancer diagnosis every year. We plan to: (a) apply this methodology to a variety of cancer data sets, (b) extend our methodology to diagnosis of other types of medical conditions, and (c) develop a similar methodology for prognosis of the long-term disease behavior.

## Selected Publications:

- Sahinidis, N. V. and M. Tawarmalani, Applications of global optimization to process and molecular design,
__Computers & Chemical Engineering__*,*24(9-10), 2157-2169, 2000. - Sahinidis, N. V., M. Tawarmalani, and M. Yu, Design of alternative refrigerants via global optimization,
__AIChE J.__, 49(7), 1761-1775, 2003. - Vaia, A. and N. V. Sahinidis, An integer programming approach to the phase problem for centrosymmetric structures,
__Acta Crystallographica A__, 59(5), 452-458, 2003. - Vaia, A. and N. V. Sahinidis, Polynomial-time algorithms for the integer minimal principle for centrosymmetric structures,
__Acta Crystallographica A__, 61, 445-452, 2005. - Chang, Y. and N. V. Sahinidis, Optimization of metabolic pathways under stability considerations,
__Computers & Chemical Engineering__, Special Issue on Systems Engineering Challenges and Opportunities in Systems Biology, 29(3), 467-479, 2005. - Xie, W. and N. V. Sahinidis, A Branch-and-reduce algorithm for the contact map overlap problem, RECOMB 2006,
__Lecture Notes in Bioinformatics__, Vol. 3909, 516-529, Springer, 2006 (the acceptance rate at RECOMB 2006 was 18.5%). - Xie, W. and N. V. Sahinidis, Residue-rotamer-reduction algorithm for the protein side-chain conformation problem,
__Bioinformatics__, 22(2), 188-194, 2006. - Smith, A. B., H. Xu and N. V. Sahinidis, An integer minimal principle and triplet sieve method for phasing centrosymmetric structures,
__Acta Crystallographica A__, 63(2), 164-171, 2007. - Xie, W. and N. V. Sahinidis, A reduction-based exact algorithm for the contact map overlap problem,
__Journal of Computational Biology__, 14(5), 637-654, 2007. - Xu, H., A. B. Smith, N. V. Sahinidis, and C. M. Weeks, SnB version 2.3: Triplet sieve phasing for centrosymmetric structures,
__Journal of Applied Crystallography__, 41, 644-646, 2008. - Samudra, A. and N. V. Sahinidis, Design of secondary refrigerants: A combined optimization-enumeration approach, in M. M. El-Halwagi and A. A. Linninger (eds.): Proceedings of the 7th International Conference on the Foundations of Computer-Aided Process Design, CRC Press, pp. 879-886, 2009
- Shah, S. B. and N. V. Sahinidis, SAS-Pro: Simultaneous residue assignment and structure superposition for protein structure alignment,
__PLoS ONE/u>, 7:e37493, 2012.__ - Samudra, A. and N. V. Sahinidis, Optimization-based framework for computer-aided molecular design,
__AIChE Journal/u>, 59, 3686-3701, 2013.__ - Samudra, A. and N. V. Sahinidis, Design of heat transfer media components for retail food refrigeration,
__Industrial & Engineering Chemistry Research/u>, 52, 8518-8526, 2013.__ - Austin, N., A. Samudra, N. V. Sahinidis and D. W. Trahan, Mixture design using derivative-free optimization in the space of individual component properties,
__AIChE Journal/u>, 62, 1514-1530, 2016.__ - Austin, N., N. V. Sahinidis and D. W. Trahan, Computer-aided molecular design: An introduction and review of tools, applications, and solution techniques,
__Chemical Engineering Research and Design____, 116, 2-26, 2016.__ - Austin, N., N. V. Sahinidis and D. W. Trahan, A COSMO-based approach to computer-aided mixture design,
__Chemical Engineering Science____, 159, 93-105, 2017.__ - Austin, N., N. V. Sahinidis, I. Konstantinov and D. W. Trahan, COSMO-based computer-aided molecular/mixture design: A focus on reaction solvents,
__AIChE Journal____, 64, 104-122, 2018.__