Methodology
The working of the tool can be described as two step process:
(a) Extraction of sequence- and structure-based features from the query protein (input processing) and
(b) Hierarchical identification involving structural class, fold within structural class and the closest template (please see Flowchart for schematic representation).
Input processing: The input to the HPFP tool is the amino acid sequence of the query protein for which structural fold has to be predicted. From the amino acid sequence several fold-discriminatory features are calculated. They are:
i. Secondary state frequencies of amino acids and amino acid pairs
ii. Solvent accessibility state frequencies of amino acid and amino acid pairs
The two structural properties viz., secondary structures and solvent accessibility of the amino acid resides are predicted from the query protein sequence. We use PSIPRED (McGuffin et al., 2000) and ACCpro (Cheng et al., 2005) for prediction of secondary structures and solvent accessibilities respectively.
Hierarchical identification:
a) Structural class identification: Structural class for the query protein is predicted using our SVM-based method for structural class prediction.
b) Fold identification: After structural class is identified the query is subjected to fold identification within that structural class. Fold-recognition essentially means identifying the most apt fold out of the known protein folds. This identification is carried out by means of Support Vector Machine (SVM) which is a supervised machine-learning method first developed by Vapnik (1995). We use the following multi-class methods viz., All-together method (referred to as Crammer and Singer method) and the two binary classification based methods: one versus all and one versus one. All SVM computations are carried out using LIBSVM (Chang and Lin, 2001) using RBF kernel with the values of the cost parameter C and the kernel parameter g optimized by us. Before the actual predictions are carried out, SVM models are created for every fold. This is referred to as training of SVM.
c) Identification of the closest template: Once fold of the query is predicted pair-wise comparisons are made between the proteins of the predicted fold with the query to identify close structural homologues which forms essential input for structural modeling using homology modeling tools. HPFP reports 10 closest homologues.
(a) Extraction of sequence- and structure-based features from the query protein (input processing) and
(b) Hierarchical identification involving structural class, fold within structural class and the closest template (please see Flowchart for schematic representation).
Input processing: The input to the HPFP tool is the amino acid sequence of the query protein for which structural fold has to be predicted. From the amino acid sequence several fold-discriminatory features are calculated. They are:
i. Secondary state frequencies of amino acids and amino acid pairs
ii. Solvent accessibility state frequencies of amino acid and amino acid pairs
The two structural properties viz., secondary structures and solvent accessibility of the amino acid resides are predicted from the query protein sequence. We use PSIPRED (McGuffin et al., 2000) and ACCpro (Cheng et al., 2005) for prediction of secondary structures and solvent accessibilities respectively.
Hierarchical identification:
a) Structural class identification: Structural class for the query protein is predicted using our SVM-based method for structural class prediction.
b) Fold identification: After structural class is identified the query is subjected to fold identification within that structural class. Fold-recognition essentially means identifying the most apt fold out of the known protein folds. This identification is carried out by means of Support Vector Machine (SVM) which is a supervised machine-learning method first developed by Vapnik (1995). We use the following multi-class methods viz., All-together method (referred to as Crammer and Singer method) and the two binary classification based methods: one versus all and one versus one. All SVM computations are carried out using LIBSVM (Chang and Lin, 2001) using RBF kernel with the values of the cost parameter C and the kernel parameter g optimized by us. Before the actual predictions are carried out, SVM models are created for every fold. This is referred to as training of SVM.
c) Identification of the closest template: Once fold of the query is predicted pair-wise comparisons are made between the proteins of the predicted fold with the query to identify close structural homologues which forms essential input for structural modeling using homology modeling tools. HPFP reports 10 closest homologues.