Regression Analysis to Predict the Secondary Structure of Proteins

A method is presented for protein secondary structure prediction based on the use of multidimensional regression. 200 proteins are chosen from RCSB Protein Database. Their secondary structures obtained through x-ray crystallography analyses are downloaded from the same source. Primary and secondary structure of proteins are concatenated separately to create a sequence of 169 026 residues. First 150 000 of the amino acid residues and corresponding secondary structures are chosen to create a regression model. The remaining 19 026 residues are used for testing. Since we expect three outputs a-helices "S", b-sheets "H", and coiled coils "C", our regression modes consists of  parameters. These parameters are tuned and a correct classification rate of 62.50% is achieved on the test data. Furthermore, the performance of the regression model compared with online secondary structure estimation algorithms on 14 unused proteins, and the performance of the regression model is found comparable with the online estimation tools.


INTRODUCTION
Large-scale sequencing projects produced a large number of protein sequences. In 1993 the number was 26,000 (Bairoch & Boeckmann, 1963;Ewbank & Creighton, 1992) sequences, but before the end of the century the number easily past the 500,000 limit. Today, at the end of the year 2014 the number reached to 546,790.
To compare the number of known proteins sequences, the number of proteins which is known by structure is still very limited, in 1993 it was at about 1000 (Bernstein et al., 1977). Today it reached at 105,025 increased efforts focused on narrowing the widening gap. The most reliable prediction of the structure of new proteins is done by detection of significant similarities to proteins of known structure (Taylor & Orengo, 1989;Sander & Schneider, 1991;Vriend & Sander, 1991). But only about one-seventh of new sequences have similarities to known structures (Bork et al., 1992) in the years 1993.

Figure 1. Number of proteins whose structures are known
Attempts to predict structure from sequence by physical simulation techniques, such as molecular dynamics (Momany et al., 1975;Karplus & Petsko, 1990), have fallen far short of solving the task of finding the "hidden" relation between the primary and tertiary structure. Although the folding process may require catalysts such as chaperonins (Hubbard & Sander, 1991), the basic hypothesis that the three dimensional (tertiary) structure of a protein is uniquely determined by i t s sequence o f a m i n o a c i d s ( primary structure) appears to remain valid (Anfinsen et al., 1963;Ewbank & Creighton, 1992). A simple reduction of the prediction problem is the projection of the threedimensional structure onto one dimension, i.e. onto a string of secondary structure assignments for each residue.
One of the problems of these prediction methods is that the formation of secondary structure elements is only to a certain degree due to sequentially local interaction of amino acids (Nagano & Hasegawa, 1975;Taylor, 1988;Zhong et al., 1992). However, most methods known to date do rely on local information. For the 1980's these methods have hovered around 60 to 64% in overall threestate accuracy. Some methods predicted, e.g. β-strands, only 12 percentage points better than the chance value of 33 % (Biou et al., 1988). In 1990's, the reported overall accuracy of 66,5% (Zhang et al., 1992) and single examples of predictions of proteins of unknown structure have generated enthusiasm in the field (Barton et al., 1991;Benner et al., 1992;Rost & Sander, 1992;Russell et al., 1992). At those times it was claimed that predictions cannot be better than 65( ± 2) % (Garnier, 1992).
In 1993, B. Rost, and C. Sander (Rost & Sander, 1993) presented the results of an in-depth analysis of the performance of multi-layered (neural) networks. By appropriately processing the information about structure contained in a multiple sequence alignment, it proves possible to increase the accuracy of secondary structure prediction above 70%.
Following decades brought new ideas. In his comprehensive review B. Rost (Rost, 2001) summarized the state of art at the beginning of 2000's. In his report there was at least five methods that pass the 75% correct classification limit. He concludes saying: 88% is a limit, but shall we ever reach close to there?
In this paper we check the validity of the basic hypothesis that the secondary, and three dimensional tertiary structure of a protein is uniquely determined by its sequence of amino acids, that is its primary structure.
The amount of variability in the secondary structure conformation of proteins at each residue suggests its relative importance and possible functions. Variability of outcomes at identical environments has also been a central concern in statistics. It would seem natural, then, to apply statistical methods to study structural variability in protein structures. In this paper, we undertake such an approach. We use the most classic field of statistical analysis that is regression to analyze secondary structures of a family of multiple protein structures (Zar, 2010;Ho, 2013). We assume that variations in protein structure can be represented by a statistical formulation.Our formulation can be solved using techniques from regression analysis to obtain a model with high generalization power.

FORMULATION OF THE PROBLEM
To estimate the conformation of the protein at a given residue, we consider 6 right and 6 left neighbors of this residue. Our hypothesis is that the conformation at the central residue is determined by these neighbors and by itself.
Primary structure:

DETTAL CDNGSG
Secondary structure: CCCCCC SSSSSS (a) Database Primary structures of 200 proteins are obtained from the PDB website. Secondary structures of these proteins are obtained in the form of the x-ray crystallography analyses in three conformations helix "h", sheet "s" , and others ".". Others are interpreted as coils "c". Alanine Aspartic acid Asp D 5 Cysteine Cys C 6 Glutamine Gln Q 7 Glutamic acid Glu E 8 Glycine Gly G 9 Histidine His  H  10  Isoleucine  Ile  I  11  Leucine  Leu  L  12  Lysine  Lys  K  13  Methionine  Met  M  14  Phenylalanine Phe  F  15  Proline  Pro  P  16  Serine  Ser  S  17  Threonine  Thr  T  18  Tryptophan  Trp  W  19  Tyrosine  Tyr  Y  20 Valine Val V

Table 1 Names and symbols of 20 amino acids
Based on the protein chain it is easy to create its relevant sequence of amino acids replacing an amino acid in chain by its code in Latin alphabet. As a result a word on the amino acids' alphabet is received. This word can be called a protein primary structure on the condition that letters in this word are in the same order as amino acids in the protein chain are.
A secondary structure of a protein is a subsequence of amino acids coming from the relevant protein. These subchains form in the three dimensional space regular structures which are the same in shape for different proteins. In the analysis, a similar representation for the secondary structures as for the primary ones has been used. A secondary structure is represented by a word on the relevant alphabet of secondary structures -each kind of a secondary structure has its own unique letter α-helix, H; βsheet S, and coil C. An alphabet of secondary structures consisting of three different secondary structures has been considered in the analysis.
(c) Coding the Data In this paper, data corresponding to an amino acid consists of 6 right, and 6 left neighboring amino acids of this amino acid in the primary chain of the protein as in Table 2. In the second row, secondary structure conformations of these neighboring amino acids are given.  Table 3 Codes for secondary structure letters H, E, and C.

A E E K E A V L G L W G K H H H H H E E E E C C C E
The data corresponding to an amino acid is coded by a 20×13 matrix:

THE MULTIPLE-REGRESSION EQUATION
A simple linear regression for a population of paired variables is the relationship In this relationship, and represent the dependent and independent variables, respectively; is the regression coefficient in the sampled population; and , the Y intercept, is the predicted value of in the population when is zero. And the subscript i in this equation indicates the i th pair of X and Y data in the sample.
In some situations, however, Y may be considered dependent upon more than one variable. Thus, = + 11 11 + 12 12 + ⋯ + ( 2) or, more succinctly, in the existence of n independent variables.
In the particular multiple regression model of this article, we have three sets of one dependent variable and 20 × 13 independent variables.
The population parameters 11 , 12 , … , are termed partial regression coefficients because each expresses only part of the dependence relationship; expresses how much Y would change for a unit change in , if all other independent variables were held constant. It is sometimes said that is a measure of the relationship of Y to after controlling other independent variables; that is, it is a measure of the extent to which Y is related to after removing the effects of other independent variables. The Y intercept, , is the value of Y when all 11 , 12 , … , are zero.
A regression with × independent variables defines an × dimensional surface, sometimes referred to as a "response surface" or "hyperplane." The population data whose relationship is described by Equation (2) will probably not all lie exactly on a plane, so this equation may be expressed as = + 11 11 + 12 12 + ⋯ + + (4) , the " residual," or " error," is the amount by which differs from what is predicted by + 11 11 + 12 12 + ⋯ + , where the sum of all 's is zero, the 's are assumed to be normally distributed.
If we sample the population containing the × + 1variables , 11 , 12 , … , in Equation (3), we can compute sample statistics to estimate the population parameters in the model.
The multiple-regression function derived from a sample of data would be � = + 11 11 + 12 12 + ⋯ + The sample statistics , 11 , … , are estimates of the population parameters , 11 , 12 , … , , respectively, where each partial regression coefficient is the expected change in Y in the population for a change of one unit in if all of the other × − 1 independent variables are held constant, and a is the expected population value of Y when each is zero.
Theoretically, in multiple-regression analyses there is no limit to × , the number of independent variables ( ) that can be proposed as influencing the dependent variable (Y), as long as the size of the data ≥ × + 2. At least + 2 data points are required to perform a multiple regression analysis, where n is the number of independent variables determining each data point.
The criterion for defining the "best fit" multiple regression equation is most commonly that of least squares, which represents the regression equation with the minimum residual sum of N squares : which leads = 1,2, … , , = 1,2, … , .
For × + 1 unknowns , 11 , … , , we have × + 1 linear equations in (9). After flatting the matrix of unknowns to a vector with × + 1 components, , 11 , 12 , … , , the coefficient matrix becomes and the right hand side vector of the linear system of equations which has × + 1 components is

IMPLEMENTATION OF THE MULTIPLE-REGRESSION MODEL
In our data we have three types of conformations, H, S, and C. Therefore we have three different dependent variables. Accordingly, we look for three different multivariate regression models for each of them. First dependent variable has the value 1 for H, and zero for S, and C, second dependent variable has the value 1 for S, and zero for H, and C, and the third has the value 1 for C, and zero for H, and S.

Training Data
First 150000 of the amino acid residues and corresponding secondary structures of around 170 proteins are concatenated to form a long string of amino acids. Then from this string 13-tuples are formed, and amino acids occurring in right and left neighborhoods of the central amino acid, together with the central amino acid are coded as shown in Table 4. These 20 × 13 matrices are the values of independent variables. The value of the dependent variable depends on the conformation of the central amino acid. For this data, the matrix A in (10), which is the same for all three models is computed. The right hand side vector c in (11)

Testing Data
Using the remaining 19026 residues of the concatenated proteins, the testing data is coded and prepared as in for the training data. Each testing data is sent to the three models and the value of the dependent variable is computed. The model that produces the largest output, determines the conformation of the central amino acid of the data considered. Then the prediction of the regression model and the true conformations are compared to find the confusion matrix, and success in the estimation of the conformations of the testing data as helix, sheet, and coil. Correct classification rates of the training and testing data are given in Table 5.
Training % Testing % Regression Analysis 58.84 62.50 Table 5 Correct classification rates on the training and testing data

RESULTS AND DISCUSSION
To compare the robustness of the system with the ones that exist as free excess tools in the web, we have chosen 14 additional proteins from NCBI Protein database with their secondary structure estimates through x-ray analysis. The secondary structures of these proteins are obtained through the tools given in Chou-Fasman website 1 . Experiment is made using Chou-Fasman (C-F), and Neural Network (ANN) estimates. Comparison of the regression results of this paper and results from these experiments are seen in  Table 6 Correctness of the estimates for the secondary structure of three experiments using Chou-Fasman, Neural Network, and regression model.
These results show that regression analysis which relies on a database of 200 proteins has a estimation power that is comparable with the famous online estimation tools.