Protein Secondary Structure Prediction Based on Physicochemical Features and PSSM by KNN

In this paper, we propose a protein secondary structure prediction method based on the k-nearest neighborhood (KNN position-specific scoring matrix (PSSM) profiles, propensity matrix of amino acids in three conformations (HEC) and three features; hydrophobicity, net charges, and side chain mass. First, the with the optimal k-value is found. Then, the Euclidean distance of 26 dimensional data for each amino acid of a protein, to the all other proteins are computed. The conformations of the nearest seven amino acids are pooled. Majority of the pooled votes is given to the amino acid of the quarry protein as the conformation H, E, or C. a filter to refine the predicted results from KNN. After f accuracy of the prediction goes up to the level of 90% f This validates that considering PSSM, the propensity matrix, and physicochemical features may exhibit better performance. cochemical A protein primary sequence is composed of 20 different kinds of amino acids. Each of them is denoted by a different letter in the Latin alphabet as shown below. In this paper protein secondary structure are investigated based on protein primary structure and its physicochemical properties.Due to the differences of their side chain sizes, shapes, reactivity, and the ability to form hydrogen bonds the secondary structure of a protein sequence comes from different folding of amino acids into helices, sheets and coils (Chou, and Fasman, 1978; Garnier et. al., 1978) Furthermore, owing to the differences of the side chain sizes, the number of electric charges, coupled with the affinity for water, the tertiary structures of protein sequences are not all the same sequences are similar. Thus, the exploration of molecular ) technique with physicochemical KNN data vectors of Finally, we use iltering, the or some proteins.


INTRODUCTION
Although it is probably true that one can infer protein properties by given protein primary structure, current state of the art approaches are not able to implement this in practice. There is many different approaches and algorithms which are designed to predict the secondary structure of protein from it's know primary sequence but no algorithm can predict with desirable accuracy. In this paper protein secondary structure are investigated based on protein primary structure and its physi properties.

Southeast Europe Journal of Soft Computing
Available online: www.scjournal.ius.edu.ba VOL4 NO. 1 March 2015-ISSN 2233-1859 Protein Secondary Structure Prediction Based on Physicochemical Features and Faculty of Engineering and Natural Sciences, Hrasnicka Cesta 15, Ilidža 71210 Sarajevo, Bosnia and Herzegovina

Abstract
In this paper, we propose a protein secondary structure prediction method based on the k-nearest neighborhood (KNN position-specific scoring matrix (PSSM) profiles, propensity matrix of amino acids in three conformations (HEC) and three features; hydrophobicity, net charges, and side chain mass. First, the with the optimal k-value is found. Then, the Euclidean distance of 26 dimensional data for each amino acid of a protein, to the all other proteins are computed. The conformations of the nearest seven amino acids are pooled. Majority of the pooled votes is given to the amino acid of the quarry protein as the conformation H, E, or C. a filter to refine the predicted results from KNN. After f accuracy of the prediction goes up to the level of 90% f This validates that considering PSSM, the propensity matrix, and physicochemical features may exhibit better performance.
Although it is probably true that one can infer protein properties by given protein primary structure, current state of the art approaches are not able to implement this in practice. There is many different approaches and ich are designed to predict the secondary structure of protein from it's know primary sequence but no algorithm can predict with desirable accuracy. In this paper protein secondary structure are investigated based on protein primary structure and its physicochemical A protein primary sequence is composed of 20 different kinds of amino acids. Each of them is denoted by a different letter in the Latin alphabet as shown below.
In this paper protein secondary structure are investigated based on protein primary structure and its physicochemical properties.Due to the differences of their side chain sizes, shapes, reactivity, and the ability to form hydrogen bonds the secondary structure of a protein sequence comes from different folding of amino acids into helices, sheets and coils (Chou, and Fasman, 1978;Garnier et. al., 1978) Furthermore, owing to the differences of the side chain sizes, the number of electric charges, coupled with the affinity for water, the tertiary structures of protein sequences are not all the same sequences are similar. Thus, the exploration of molecular Protein Secondary Structure Prediction Based on Physicochemical Features and In this paper, we propose a protein secondary structure prediction KNN) technique with , propensity matrix of three physicochemical , net charges, and side chain mass. First, the KNN the Euclidean distance of 26the data vectors of omputed. The conformations of the nearest seven amino acids are pooled. Majority of the pooled votes is given to the amino acid of the quarry protein as the conformation H, E, or C. Finally, we use After filtering, the the level of 90% for some proteins.
propensity matrix, and better performance.
A protein primary sequence is composed of 20 different Each of them is denoted by a different letter in the Latin alphabet as shown below.
In this paper protein secondary structure are investigated protein primary structure and its physicochemical ue to the differences of their side chain sizes, shapes, reactivity, and the ability to form hydrogen bonds, he secondary structure of a protein sequence comes from into helices, sheets and (Chou, and Fasman, 1978;Garnier et. al., 1978).
owing to the differences of the side chain sizes, the number of electric charges, coupled with the affinity for water, the tertiary structures of protein ences are not all the same, even their primary Thus, the exploration of molecular structures on protein sequences is divided into primary, secondary, tertiary, and even quaternary structures (Huang, and Chen, 2013).  Threonine  Thr  T  18  Tryptophan  Trp  W  19  Tyrosine  Tyr  Y  20 Valine Val V Through x-ray analysis, given a protein primary sequence, its corresponding secondary structure may be obtained as follows.

FEATURE EXTRACTION
Five relevant kinds of features are extracted from protein sequences to predict protein secondary structure; i.e., 1) conformation parameters, 2) Position specific scoring matrix (PSSM) profiles, 3) Net charge, 4) Hydrophobic, and 5) Side chain mass.

Extracting Primary and Secondary Sequences:
Amino acid primary and secondary structure was extracted from the PDB website (http://www.rcsb.org/pdb/home/home.do) using the PDB codes of 25PDB. Then, we can further extract five different features from amino acid sequences as follows. 2.2 Propensity matrix: Intrinsic properties of amino acids enable us to figure out their tendency for being in certain conformation. The main idea of using propensity table is to get benefits from amino acid properties and find out statistically significant contribution to prediction capacity. In general, protein secondary structure is divided into three types: α-helix (H), β-sheet (E), and coil (C), so that there are three values for each amino acid. In the feature extraction, all the conformation parameters are calculated from a data set. The conformation parameters for each amino acid S ij are defined as follows: ܵ = ೕ , ݅ = 1,2, . . ,30; ݆ = 1,2,3. (1) In this formula, i indicates the 20 amino acids, and j indicates the 3 types of secondary structure: H, E, and C.
Here, a i is the amount of the ith amino acid in a data set whereas a ij is the amount of the ith amino acids with the jth secondary structure.  The conformation parameters for each amino acid in a data set of 20347 proteins are shown in Table 2. The reason of using conformation parameters as features is that the folding of each residue has something to do with forming a specific structure.

PSSM Profiles
PSSM profiles are generated by PSIBLAST (Position Specific Iterative-Basic Local Alignment Search Tool) program (Alteschul et al., 1997). Since PSSM profiles are involved with biological evolution, we consider them as features in our work. A PSSM profile has L×20 elements, where L is the length of a query sequence. These profiles are then used as the input features to feed an SVM, employing a sliding window method. The position weight matrix was introduced by American geneticist Gary Stormo andcolleagues in 1982 (Gary.S et all, 1982). PSSM has found good alternative to consensus sequence. Consensus sequences had previously been used to represent patterns in biological sequences, but had difficulties in the prediction of new occurrences of these patterns. First, a database containing all known sequences (or non-redundant database) is selected. Then, low complexity regions are removed from the nr database. Finally, PSI-BLAST program is used to query each sequence in 25PDB, and generates PSSM profiles after three iterations. Here, multiple sequence alignment (MSA) and BLOSUM62 matrix (Henikoff, and Henikoff, 1992) are used in this process.

Net Charges
One of the physical properties of amino acids is their charges. Five of the amino acids are charged amino acids: R, D, E, H, and K. Residues which have similar electric charge repel each other and it interrupts the hydrogen bonds in the main chain of amino acids. It prevents the formation of α-helix. In addition, continues β-sheet formation are not possible when the residues have similar charges. This physical property of amino acids helps to predict secondary structure of proteins. Net charge of each amino acid can be obtained from from Amino Acid index database (Kawashima, et. al, 1999;Kawashima, and Kanehisa, 2000;Kawashima, et. al, 2008;Nakai, et. al., 1998;Tomii, and Kanehisa, 1996), as shown in Table 3.

Hydrophobicity
Some of the amino acids do not like to reside in an aqueous environment and they called hydrophobic amino acids. They are generally seen buried within the hydrophobic core of protein since for protein folding, polar residues prefer to stay outside of protein in order to prevent non polar residues from exposing to polar solvent. Hydrophobic protein can be used as one of the parameter to predict the secondary structure of proteins. In α-helix, generally hydrophobic segments are followed by hydrophilic segment. Unlike α-helix, β-sheet structure is affected by the environment due to its structural characteristics so it is not a case in β-sheets. The hydrophobic values of amino acids can also be obtained from Amino Acid index database (or AAindex) as shown in Table 4. Positive values indicated more hyrophobicity.  Fig. 3 is the same for 20 amino acids, the size of the side chain R group still influences structure folding. Side chains of amino acids are the structural elements which make amino acids different. These unique R groups influencing the conformation of protein secondary structure and they can give a clue to predict the secondary structural element depends on their existence in certain position. The site chain R group form in the outside of the main chain of αhelix structure but when large R groups distributed continuously, they can make α-helix structure unstable. For instance, proline is composed of 5 atoms in a ring, which is difficult to form hydrogen bonds. In addition, generally it is observed that R group of B-sheet structure is smaller than those of other structure. Side chain mass is considered one of the important features that can contribute to predict secondary structure of proteins.  The KNN used in the experiments is a classifier for predicting the secondary structure H, E, and C. Three-fold cross-validation is employed on the 25PDB data set to find the optimal neighbor number k. Here, the distance of the data vectors are first measured by Euclidean distance. Then other distance measures are also used for comparison.

Filter
It is not possible for amino acid to form α-helix or βsheet alone. Incorrect predicted results should be eliminated by replacement with reasonable conformation if single conformation exists in the predicted sequences. For the current scanning window (i-1, i, i+1) in the predicted secondary structure, two possible structures could happen at position i: Case H: if str(i-1) and str(i+1) are H, then str(i) is not changed; otherwise, extend the examined segment to (i-3, i-2, i-1, i, i+1, i+2, i+3) and replace str(i) with the majority structure in the examined segment.

A. Data Set
Many different dataset are used for predicting secondary structure of proteins, such as RS126 (Rost, and Sander, 1993), CB513 (Cuff, and Barton, 1999), CASP (Moult, et. al., 1995), EVA (Eyrich, et. al., 2001). The 25PDB dataset selected for our studies the similarity between sequences of 25PDB is less than 25%. 25 PDB designed for predicting protein classes but it is found useful for predicting the secondary structure of protein since similarity is very small, this let us to predict secondary structure of protein more accurately. 25PDB contain 1674 amino acid sequences and it can be downloaded from http://biomine.ece.ualberta.ca/SCPRED/SCPRED.htm

B. Performance Measures
Two kinds of performance measures are frequently used in protein secondary structure prediction; i.e., Q3 or threestate overall per-residue accuracy. Q3 is a residue based measure of three-structure overall percent-age of correctly predicted residues, which can be represented as Formula (2).
where N is the total number of predicted residues, N H is the correctly classified secondary structure for helix, N E for sheet, and N C for coil.

C. Experimental Results
In this section, first we expose the accuracy in secondary structure prediction by charts which shows the frequencies of proteins at each accuracy level. It is seen that in all-α, and all-β protein classes up to 90% accuracy is achieved (Figure 3., and   . In all-β protein classes up to 86% secondary structure prediction accuracy is achieved Figure 5. In α+β protein classes up to 80% secondary structure prediction accuracy is achieved Figure 6. In α/β protein classes up to 77% secondary structure prediction accuracy is achieved

D. Filtering Effect
For all-α protein class, filtering of outputs improved the mean accuracy from 65.73% to 67.09%. The frequencies of the percentage increases in accuracy are shown in Figure 7 below. For all-β protein class, filtering of outputs improved the mean accuracy from 59.60% to 61.60%. The frequencies of the percentage increases in accuracy are shown in Figure 8 below. For α+β protein class, filtering of outputs improved the mean accuracy from 49.97% to 52.80%. The frequencies of the percentage increases in accuracy are shown in Figure 9 below. For α/β protein class, filtering of outputs improved the mean accuracy from 52.07% to 54.36%. The frequencies of the percentage increases in accuracy are shown in Figure 10 below. Average accuracies without the filter and with the filter are given in Table 6.

CONCLUSIONS
In this paper, we propose a protein secondary structure prediction method using PSSM profiles and four physicochemical features, including conformation parameters, net charges, hydrophobic, and side chain mass. In the experiments, the KNN with the optimal neighbor size k found first. Then, the majority of the conformations of the k neighbors of a given amino acid in a certain class is given to this amino acid as secondary structure.
Finally, we use the filter to refine the predicted results from the KNN. Although the tool KNN is the simplest one of all methods, we succeeded accuracy in secondary structure prediction of proteins up to 90% for the 25PDB data set. In summary, considering these physicochemical features and PSSM matrix, results in better performances.