Title: | Features Extracted from Position Specific Scoring Matrix (PSSM) |
---|---|
Description: | Returns almost all features that has been extracted from Position Specific Scoring Matrix (PSSM) so far, which is a matrix of L rows (L is protein length) and 20 columns produced by 'PSI-BLAST' which is a program to produce PSSM Matrix from multiple sequence alignment of proteins see <https://www.ncbi.nlm.nih.gov/books/NBK2590/> for mor details. some of these features are described in Zahiri, J., et al.(2013) <DOI:10.1016/j.ygeno.2013.05.006>, Saini, H., et al.(2016) <DOI:10.17706/jsw.11.8.756-767>, Ding, S., et al.(2014) <DOI:10.1016/j.biochi.2013.09.013>, Cheng, C.W., et al.(2008) <DOI:10.1186/1471-2105-9-S12-S6>, Juan, E.Y., et al.(2009) <DOI:10.1109/CISIS.2009.194>. |
Authors: | Javad zahiri [aut], Alireza mohammadi [aut, cre] |
Maintainer: | Alireza mohammadi <[email protected]> |
License: | GPL-3 |
Version: | 0.2.2 |
Built: | 2025-03-11 04:17:28 UTC |
Source: | https://github.com/biocool-lab/pssmcool |
This package is a Comprehensive toolkit for generating various numerical features from PSSM Matrix correspond to each protein sequence and covers all features introduced in POSSUM Package and some other new features. see PSSMCOOL_Rpubs for more information.
Maintainer: Alireza Mohammadi [email protected]
Authors:
Alireza Mohammadi [email protected]
Javad Zahiri [email protected]
All functions in this package:
Useful links:
Report bugs at https://github.com/Alireza9651501005/PSSMCOOL/issues
For getting this feature which was used to protein structural class prediction,
at first mean of every column in PSSM Matrix is computed to achieve a 20-dimensional vector
called AAC.then by fusing it with other vector of length 400 called TPC, which is similar to DPC_PSSM
AATP feature vector of length 420 is obtained.
AATP_TPCC(pssm_name)
AATP_TPCC(pssm_name)
pssm_name |
is name of PSSM Matrix file |
a feature vector of length 420
Zhang, S., Ye, F. and Yuan, X. (2012) Using principal component analysis and support vector machine to predict protein structural class for low-similarity sequences via PSSM, Journal of Biomolecular Structure & Dynamics, 29, 634-642.
as<-AATP_TPCC(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GQS7.txt.pssm"))
as<-AATP_TPCC(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GQS7.txt.pssm"))
This feature consists of two types of feature vectors. at first, each protein sequence is divided into 20 equal parts, each of which is called a block, and in each block the row vectors of the PSSM matrix related to that block are added together and The resulting final vector is divided by the length of that block, which is 5 Finally, by placing these 20 vectors side by side, the first feature vector of length 400 is obtained. The second feature for each amino acid in each column is the average of the positive numbers in that column and for each block, and these 20 numbers, corresponding to 20 blocks, are placed next to each other, and therefore For each of the 20 types of amino acids, a vector of length 20 is obtained, and by placing these together, the vector of the second feature, length 400, is obtained.
AB_PSSM(pssm_name)
AB_PSSM(pssm_name)
pssm_name |
name of PSSM Matrix file |
two feature vectors AB-PSSM and RPM-PSSM each of length 400
Jeong, J.C., Lin, X. and Chen, X.W. (2011) On position-specific scoring matrix for protein function prediction , IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM, 8, 308-315.
zz<- AB_PSSM(system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL"))
zz<- AB_PSSM(system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL"))
In this feature at first PSSM Matrix is divided to 20 Blocks. Then for each
Block mean of columns is computed to get 20-dimensional vector, eventually by appending these vectors to each other final feature
vector is obtained which would be of length 400. this feature vector is similar to PSSMBLOCK
for N=20.
Averag_Block(pssm_name)
Averag_Block(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of length 400
L. Nanni, A. Lumini, and S. J. T. S. W. J. Brahnam, "An empirical study of different approaches for protein classification," vol. 2014, 2014.
v<-Averag_Block(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
v<-Averag_Block(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This feature vector is constructed from PSSM Matrix as:
where i varies between 1 and L and j between 1 and 20, L indicates protein length
and "arg" represents the argument of the maximum the ith base of the consensus sequence (CS) is then set to be the
th amino acid in the amino acid alphabet and a consensus sequence is constructed.
consunsus_sequence(pssm_name)
consunsus_sequence(pssm_name)
pssm_name |
is the name of PSSM Matrix file |
consunsus sequence wich extracted from PSSM
Y. Liang, S. Liu, S. J. C. Zhang, and m. m. i. medicine, "Prediction of protein structural classes for low-similarity sequences based on consensus sequence and segmented PSSM," vol. 2015, 2015.
w<-consunsus_sequence(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
w<-consunsus_sequence(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This feature vector is constructed by fusing consensus sequence (CS), segmented PsePSSM, and segmented auto-covariance transformation (ACT) based on PSSM. by consensus sequence a 40-dimensional feature vector is obtained, in segmented PsePSSM group, by dividing PSSM Matrix to 2 and 3 segments a 380-dimensional feature vector is obtained and in ACT group, similar to the previous group at first PSSM Matrix is divided to 2 and 3 segments then a feature vector of length 280 is obtained.eventually by fusing these features a 700-dimensional feature vector is obtained.
CS_PSe_PSSM(pssm_name, vec_name)
CS_PSe_PSSM(pssm_name, vec_name)
pssm_name |
name of PSSM Matrix file |
vec_name |
a character that user imports to specify kind of feature vector which it can be varied between four values |
If vec_name equals to "segmented_psepssm" then a feature vector of length 380 is obtained. if vec_name equals to "segmented_acpssm" then a feature vector of length 280 is obtained, and if vec_name equals to "cspssm" the obtained feature vector would be of length 40 eventually if vec_name equals to "total" then feature vector would be of length 700.
feature vector that its length depends on the vec_name which user imports
Y. Liang, S. Liu, S. J. C. Zhang, and m. m. i. medicine, "Prediction of protein structural classes for low-similarity sequences based on consensus sequence and segmented PSSM," vol. 2015, 2015.
A<-CS_PSe_PSSM(system.file("extdata", "C7GSI6.txt.pssm", package="PSSMCOOL"),"total")
A<-CS_PSe_PSSM(system.file("extdata", "C7GSI6.txt.pssm", package="PSSMCOOL"),"total")
In this feature each column of PSSM Matrix, can be regarded as a time series. Each PSSM contains 20 columns Hence, each PSSM can be considered as 20 time series.The detrended moving-average cross-correlation analysis (DMCA) is developed to measure the level of cross-correlation between two non-stationary time series by fusing the detrended cross-correlation analysis (DCCA) and the detrended moving average(DMA).this function utilises this algorithm for each column and each pair of columns to produce a feature vector of length 290.
DFMCA_PSSM(pssm_name, n)
DFMCA_PSSM(pssm_name, n)
pssm_name |
name of PSSM Matrix file |
n |
A parameter called the window size that must be smaller than the length of the sequence |
feature vector of length 210
parameter n must be equal or greater than 3 and equal or less then L which L is length of protein
Y. Liang, S. Zhang, S. J. S. Ding, and Q. i. E. Research, "Accurate prediction of Gram-negative bacterial secreted protein types by fusing multiple statistical features from PSI-BLAST profile," vol. 29, no. 6, pp. 469-481, 2018.
Y. Liang and S. J. A. b. Zhang, "Prediction of apoptosis protein’s subcellular localization by fusing two different descriptors based on evolutionary information," vol. 66, no. 1, pp. 61-78, 2018.
as<-DFMCA_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),7)
as<-DFMCA_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),7)
To construct this feature vector, Two-Dimensional DCT algorithm has been used by applying
dct
function from dtt package which DCT stands for Discrete Cosin Transform.
Discrete_Cosine_Transform(pssm_name)
Discrete_Cosine_Transform(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of length 400
Wang, L., et al., Advancing the prediction accuracy of protein-protein interactions by utilizing evolutionary information from position-specific scoring matrix and ensemble classifier. 2017. 418: p. 105-110.
Y. Wang, Y. Ding, F. Guo, L. Wei, and J. J. P. o. Tang, "Improved detection of DNA-binding proteins via compression technology on PSSM information," vol. 12, no. 9, 2017.
as<-Discrete_Cosine_Transform(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
as<-Discrete_Cosine_Transform(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This feature is used to predict the disulfide bond within a protein.
disulfid(pssm_name)
disulfid(pssm_name)
pssm_name |
name of PSSM Matrix file |
For the purpose of predicting disulfide bond in protein at first, the total number of cysteine amino acids in the protein sequence is counted and their position in the protein sequence is identified. Then, using a sliding window with length of 13, moved on the PSSM matrix from top to bottom so that the middle of the window is on the amino acid cysteine, then the rows below the matrix obtained from the PSSM matrix with dimension of 13 x 20 are placed next to each other to get a feature vector with a length of 260 = 20 * 13 per cysteine, and if the position of the first and last cysteine in the protein sequence is such that the middle of sliding window is not on cysteine residue when moving on PSSM Matrix, then the required number of zero rows from top and bottom is added to the PSSM matrix to achieve this goal.Thus, for every cystine amino acid presented in protein sequence, a feature vector with a length of 260 is formed.Then all the pairwise combinations of these cysteines is wrote in the first column of a table, and in front of each of these pairwise combinations, the corresponding feature vectors are sticked together to get a feature vector of length 520 for each of these compounds.Finally, the table obtained in this way will have the number of rows equal to the number of all pairwise combinations of these cysteines and the number of columns will be equal to 521 (the first column includes the name of these pair combinations). And it is easy to divide this table into training and testing data and predict the desired disulfide bonds between cysteines.
a table with number of all cyctein pairs in rows and 521 columns correspond to feature vector length.
D.-J. Yu et al., "Disulfide connectivity prediction based on modelled protein 3D structural information and random forest regression," vol. 12, no. 3, pp. 611-621, 2014.
N. J. Mapes Jr, C. Rodriguez, P. Chowriappa, S. J. C. Dua, and s. b. journal, "Residue adjacency matrix based feature engineering for predicting cysteine reactivity in proteins," vol. 17, pp. 90-100, 2019.
aq<-disulfid(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
aq<-disulfid(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This feature results from the connection of two vectors. The vector is the first feature of a
vector with a length of 40, which calculates the average of positive and negative values for each column
separately and puts them together. in the second feature vector, correspond to each column the difference
between the numbers in the rows that have distance of k is calculated, and then the square average for the
differences that are positive is calculated, and the same action for the differences that are negative is
performed. since k varies between 1 and , and because the value of
in this function is equal to 2,
the length of the second feature vector will be 80, which by merging with the first feature vector, the total
feature vector of length 120 will be obtained.
DP_PSSM(pssm_name, a = 2)
DP_PSSM(pssm_name, a = 2)
pssm_name |
name of PSSM matrix file |
a |
fixed parameter that user chooses which usually equals to 2 |
feature vector of length 120
Juan, E.Y., et al. (2009) Predicting Protein Subcellular Localizations for Gram-Negative Bacteria using DP-PSSM and Support Vector Machines. Complex, Intelligent and Software Intensive Systems, 2009. CISIS'09. International Conference on. IEEE, pp. 836-841.
ss<-DP_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
ss<-DP_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This feature is combination of amino asid composition and dipeptide composition feature vectors. DPC feature stands for dipeptide composition, which multiplies the values in two consecutive rows and two different columns, calculating this for each of the different columns and obtaining the sum of these and for each. And for both columns, the product divides their sum by L-1, and because the result depends on two different columns, length of this feature vector would be 400. AAC-PSSM is actually mean of PSSM Matrix columns which its length is 20. eventually AADP-PSSM is combination of these vectors and with length 420.
DPC_PSSM(pssm_name)
DPC_PSSM(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of length 420
Liu, T., Zheng, X. and Wang, J. (2010) Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile, Biochimie, 92, 1330-1334.
ss<-DPC_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
ss<-DPC_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
In construction of this feature vector, the dwt.nondyadic
function is used from
"waveslim", package
to calculate the discrete wavelet transform for each column of the PSSM matrix, which considers it as a
discrete signal. At last, 4 levels DWT is used to analysis of these discrete signals of PSSM (each column) and
extracted the PSSM-DWT feature from PSSM of protein.
dwt_PSSM(pssm_name)
dwt_PSSM(pssm_name)
pssm_name |
name of pssm Matrix file |
feature vector of length 80
Y. Wang, Y. Ding, F. Guo, L. Wei, and J. J. P. o. Tang, "Improved detection of DNA-binding proteins via compression technology on PSSM information," vol. 12, no. 9, 2017.
Y. Wang, Y. Ding, J. Tang, Y. Dai, F. J. I. A. t. o. c. b. Guo, and bioinformatics, "CrystalM: a multi-view fusion approach for protein crystallization prediction," 2019.
as<-dwt_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
as<-dwt_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This is a feature vector of length 420 which is used for prediction of protein structural class for low-similarity sequences.at first ED-PSSM Matrix with 20*20 dimensions is constructed from PSSM Matrix then by using this Matrix, EDP and EEDP vectors are obtained eventually MEDP feature vector is obtained by fusing these vectors.
EDP_MEDP(pssm_name)
EDP_MEDP(pssm_name)
pssm_name |
is name of PSSM Matrix file |
a feature vectors of length 420
Zhang, L., Zhao, X. and Kong, L. (2014) Predict protein structural class for low-similarity sequences by evolutionary difference information into the general form of Chou's pseudo amino acid composition, Journal of Theoretical Biology, 355, 105-110.
as<-EDP_MEDP(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GS61.txt.pssm"))
as<-EDP_MEDP(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GS61.txt.pssm"))
This function produces list of two feature vectors named D-FPSSM and S-FPSSM which then used by FPSSM2 function to construct feature vector of length 100 for each pair of proteins which then used for protein-protein interaction prediction in each dataset.
FPSSM(pssm_name, hk)
FPSSM(pssm_name, hk)
pssm_name |
name of PSSM Matrix file |
hk |
a parameter that indicates which amino acid alphabet must be used |
two feature vectors of different length which is used in later steps.
Zahiri, J., et al. (2013) PPIevo: protein-protein interaction prediction from PSSM based evolutionary information, Genomics, 102, 237-242.
q<-FPSSM(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"),8)
q<-FPSSM(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"),8)
This function takes two PSSM files as argument and uses FPSSM function for making feature vector of length 100 correspond to this pair of proteins.
FPSSM2(pssm_name1, pssm_name2, hk)
FPSSM2(pssm_name1, pssm_name2, hk)
pssm_name1 |
The name of first PSSM Matrix file |
pssm_name2 |
The name of second PSSM Matrix file |
hk |
a parameter that indicates which amino acid alphabet must be used |
Feature vector of length 100
Zahiri, J., et al. (2013) PPIevo: protein-protein interaction prediction from PSSM based evolutionary information, Genomics, 102, 237-242.
s1<-system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL") s2<-system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL") s<-FPSSM2(s1,s2,8)
s1<-system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL") s2<-system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL") s<-FPSSM2(s1,s2,8)
This function produces a feature vector of length 100 which the first 20 components of this vector is the normalized occurrence frequency of the native amino acids in the protein. the next 20 components are mean of 20 PSSM columns and grey system model approach as elaborated in (Min et al. 2013) is used to define the next 60 components.
grey_pssm_pseAAC(pssm_name)
grey_pssm_pseAAC(pssm_name)
pssm_name |
name of PSSM matrix file |
feature vector of length 100
J.-L. Min, X. Xiao, and K.-C. J. B. r. i. Chou, "iEzy-Drug: A web server for identifying the interaction between enzymes and drugs in cellular networking," vol. 2013, 2013.
X. Xiao, M. Hui, and Z. J. T. J. o. m. b. Liu, "IAFP-Ense: an ensemble classifier for identifying antifreeze protein by incorporating grey model and PSSM into PseAAC," vol. 249, no. 6, pp. 845-854, 2016.
M. Kabir et al., "Improving prediction of extracellular matrix proteins using evolutionary information via a grey system model and asymmetric under-sampling technique," vol. 174, pp. 22-32, 2018.
as<-grey_pssm_pseAAC(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
as<-grey_pssm_pseAAC(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This function produces all possible 2-mers or 3-mers by counting paths of length 2 or 3 in a dataframe which is thought as a graph
k_mers(s, h)
k_mers(s, h)
s |
a dataframe with 2 columns |
h |
is length of k-mer |
all k-mers by counting paths of length h in dataframe which is considered as a graph
s1<-LETTERS[1:4] s2<-LETTERS[3:6] s<-data.frame(s1,s2) dc<-k_mers(s,3)
s1<-LETTERS[1:4] s2<-LETTERS[3:6] s<-data.frame(s1,s2) dc<-k_mers(s,3)
This feature is almost identical to the DPC_PSSM
feature, and in fact the DPC feature is part of
this feature (for k=1) and for two different columns, considers rows that differ by the size of the unit k.
k_seperated_bigrame(pssm_name, k)
k_seperated_bigrame(pssm_name, k)
pssm_name |
is name of PSSM Matrix file |
k |
a parameter that specifies separated length between amino acids |
a feature vector of length 400
Saini, H., et al.(2016) Protein Fold Recognition Using Genetic Algorithm Optimized Voting Scheme and Profile Bigram.
w<-k_seperated_bigrame(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),5)
w<-k_seperated_bigrame(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),5)
For product of this feature vector similar to smoothed_PSSM feature, firstly PSSM Matrix is smoothed by appending zero vectors to its head and tail and utilizing sliding window of size odd, then this smoothed PSSM Matrix is condensed by the Kidera factors to produce feature vector for each residue.
kiderafactor(pssm_name, v = NULL)
kiderafactor(pssm_name, v = NULL)
pssm_name |
name of PSSM Matrix file |
v |
vector of amino acids positions which we want to produce feature vector for them. |
matrix of feature vectors
C. Fang, T. Noguchi, H. J. I. j. o. d. m. Yamana, and bioinformatics, "Condensing position-specific scoring matrixs by the Kidera factors for ligand-binding site prediction," vol. 12, no. 1, pp. 70-84, 2015.
w<-kiderafactor(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),c(2,3,8,9))
w<-kiderafactor(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),c(2,3,8,9))
This function uses Linear predictive coding algorithm for each column of PSSM Matrix
. so in this script lpc
function is used which produces a 14-dimensional
vector for each column, since PSSM has 20 column eventually it will be obtained a 20*14=280 dimensional
feature vector for each PSSM Matrix by this function.
LPC_PSSM(pssm_name)
LPC_PSSM(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of length 280
L. Li et al., "PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations," vol. 9, no. 3, 2014.
w<-LPC_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
w<-LPC_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
In this function three different autocorrelation descriptors based on PSSM are adopted, which include: normalized Moreau-Broto autocorrelation, Moran autocorrelation and Geary autocorrelation descriptors.Autocorrelation descriptor is a powerful statistical tool and defined based on the distribution of amino acid properties along the sequence, which measures the correlation between two residues separated by a distance of d in terms of their evolution scores.
MBMGACPSSM(pssm_name)
MBMGACPSSM(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of length 560
Y. Liang, S. Liu, S. J. M. C. i. M. Zhang, and i. C. Chemistry, "Prediction of protein structural class based on different autocorrelation descriptors of position-specific scoring matrix," vol. 73, no. 3, pp. 765-784, 2015.
w<-MBMGACPSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
w<-MBMGACPSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This feature vector is combination of feature vector and vector of
correlation factors correspond to 20 columns in PSSM Matrix.
actually is mean of PSSM Matrix
columns of length 20.
pse_pssm(pssm_name, g = 15)
pse_pssm(pssm_name, g = 15)
pssm_name |
is the name of PSSM matrix file |
g |
a parameter Which its size corresponds to the database used. |
feature vector of length 20+20*g
D.-J. Yu et al., "Learning protein multi-view features in complex space," vol. 44, no. 5, pp. 1365-1379, 2013.
Chou, K.C. and Shen, H.B. (2007) MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM, Biochemical and Biophysical Research Communications, 360, 339-345.
v<-pse_pssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
v<-pse_pssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
The AC variable measures the correlation of the same property between two residues separated by a distance of lg along the sequence
pssm_ac(pssm_name, lg = 18)
pssm_ac(pssm_name, lg = 18)
pssm_name |
name of the PSSM Matrix file |
lg |
a parameter which indicates distance between two residues |
feature vector which its length depends on parameter lg
in use of this function The lg parameter must be less than the length of the smallest sequence in the database.
Dong, Q., Zhou, S. and Guan, J. (2009) A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation, Bioinformatics, 25, 2655-2662.
q<-pssm_ac(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),17)
q<-pssm_ac(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),17)
The CC variable measures the correlation of two different properties between two residues separated by a distance of t along the sequence.
pssm_cc(pssm_name, t)
pssm_cc(pssm_name, t)
pssm_name |
name of PSSM Matrix file |
t |
shortest protein length in dataset minus one |
feature vector of length 380
Dong, Q., Zhou, S. and Guan, J. (2009) A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation, Bioinformatics, 25, 2655-2662.
aa<-pssm_cc(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"),18)
aa<-pssm_cc(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"),18)
In this feature, by considering a specific column, at first sum of all components in this column is denoted by "L", then starting from the first row in this column, the components are added together to reaching a value less than or equal to 25 is calculated and stored . In the next step, the same work is done starting from the first row to reaching a value less than or equal to 50 is done started from the last row, To reaching 25 By appending these saved numbers together for each column, a vector of length 4 is obtained. If this is done for all the columns and the obtained vectors are connected to each other, for each protein, a feature vector of length 80 is obtained which its name is PSSM-SD.
PSSM_SD(pssm_name)
PSSM_SD(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of length 80
A. Dehzangi, K. Paliwal, J. Lyons, A. Sharma, A. J. I. A. T. o. C. B. Sattar, and Bioinformatics, "A segmentation-based method to extract structural and evolutionary features for protein fold recognition," vol. 11, no. 3, pp. 510-519, 2014.
ww<-PSSM_SD(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
ww<-PSSM_SD(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This feature vector uses PSSM-SD to produce Segmented Auto Covariance Features.
pssm_seg(pssm_name, m = 4)
pssm_seg(pssm_name, m = 4)
pssm_name |
name of PSSM Matrix file |
m |
a parameter between 1 and 11 |
feature vector of length 100
A. Dehzangi, K. Paliwal, J. Lyons, A. Sharma, A. J. I. A. T. o. C. B. Sattar, and Bioinformatics, "A segmentation-based method to extract structural and evolutionary features for protein fold recognition," vol. 11, no. 3, pp. 510-519, 2014.
q<-pssm_seg(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),3)
q<-pssm_seg(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),3)
This function firstly normalizes PSSM Matrix by formula:
then for any standard amino acid specifies its
position in protein sequence whereby a sub-matrix from PSSM corresponding
to these positions will be extracted, then for this sub-matrix computes
colSums
of its columns to create a vector of length 20, eventually a feature vector of
length 400 will be obtained.
pssm400(pssm_name)
pssm400(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of length 400
if a specific amino acid did not exist in protein then colSums
of whole
PSSM is computed.
q<-pssm400(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"))
q<-pssm400(system.file("extdata","C7GQS7.txt.pssm",package="PSSMCOOL"))
This feature, which stands for auto covariance transformation, for jth column calculates the average of this column, and then subtracts the resulting number from the elements on the i and (i + g)th rows of this column, and finally multiplies them. by changing the variable i from 1 to L-g, it calculates the sum of these, since the variable j changes between 1 and 20, and the variable g between 1 and 10 eventually a feature vector of length 200 will be obtained.
PSSMAC(pssm_name)
PSSMAC(pssm_name)
pssm_name |
name of PSSM Matrix files |
feature vector of length 200
L. Zou, C. Nan, and F. J. B. Hu, "Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles," vol. 29, no. 24, pp. 3135-3142, 2013.
w<-PSSMAC(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
w<-PSSMAC(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
In this feature at first PSSM Matrix is divided to Blocks based on Number N which user imports. Then for each Block mean of columns is computed to get 20-dimensional vector, eventually by appending these vectors to each other final feature vector is obtained.
PSSMBLOCK(pssm_name, N)
PSSMBLOCK(pssm_name, N)
pssm_name |
neme of PSSM Matrix file |
N |
number of blocks |
feature vector that it's length depends on parameter N
J.-Y. An, L. Zhang, Y. Zhou, Y.-J. Zhao, and D.-F. J. J. o. c. Wang, "Computational methods using weighed-extreme learning machine to predict protein self-interactions with protein evolutionary information," vol. 9, no. 1, p. 47, 2017.
as<-PSSMBLOCK(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),5)
as<-PSSMBLOCK(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),5)
To obtain this feature, first the columns of the PSSM matrix are merged to obtain an L*10 matrix. Then, with a relationship similar to the auto covariance transformation feature, this feature with a length of 110 is obtained from this matrix.
rpssm(pssm_name)
rpssm(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of length 110
Ding, S., et al. (2014) A protein structural classes prediction method based on predicted secondary structure and PSI-BLAST profile, Biochimie, 97, 60-65.
w<-rpssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
w<-rpssm(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This function gets a PSSM Matrix as input and extracts corresponding protein and consunsus sequence from it. By placing these two vectors next to each other, a dataframe is created. In each row of this dataframe, each component is connected to the next row components by an arrow. so a directed graph is produced. next by use of previous functions a feature vector of length 400 or 8000 is created.
scsh2(pssm_name, k)
scsh2(pssm_name, k)
pssm_name |
is name of PSSM Matrix file |
k |
a parameter indicates length of k-mer |
feature vector of length 400 or 8000
Zahiri, J., et al., LocFuse: human protein–protein interaction prediction via classifier fusion using protein localization information. Genomics, 2014. 104(6): p. 496-503.
zz<- scsh2(system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL"),2)
zz<- scsh2(system.file("extdata","C7GRQ3.txt.pssm",package="PSSMCOOL"),2)
This descriptor is a variant of the Average Block descriptor and is designed to group together rows related to the same amino acid, thus considering domains of a sequence with similar conservation rates.
single_Average(pssm_name)
single_Average(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of length 400
L. Nanni, A. Lumini, and S. J. T. S. W. J. Brahnam, "An empirical study of different approaches for protein classification," vol. 2014, 2014.
w<-single_Average(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
w<-single_Average(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
In this function at first a Matrix called smoothed-PSSM is constructed from PSSM Matrix by applying "ws" parameter which called sliding window size and taken from user and usually is equals to 7. Then using other window size parameter "w" which usually equals to 11 at each position smoothed feature vector is constructed.
smoothed_PSSM(pssm_name, ws, w, v = NULL)
smoothed_PSSM(pssm_name, ws, w, v = NULL)
pssm_name |
name of PSSM Matrix file |
ws |
window size for smoothing PSSM Matrix |
w |
window size for extracting feature vector |
v |
vector of desired positions to extract their features |
In the construction of a smoothed PSSM, each row vector of a residue is represented and
smoothed by the summation of ws surrounding row vectors
For the N-terminal and C-terminal of a protein, (w-1)/2 ZERO vectors, are appended to the
head or tail of a smoothed PSSM profile. Using the smoothed PSSM encoding scheme the feature vector of a residue
is represented by
The feature values in each vector are normalized to a range between -1 and 1.
a matrix of feature vectors
Cheng, C.W., et al. (2008) Predicting RNA-binding sites of proteins using support vector machines and evolutionary information, BMC Bioinformatics, 9 Suppl 12, S6.
w<-smoothed_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),7,11,c(2,3,8,9))
w<-smoothed_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"),7,11,c(2,3,8,9))
In this function each column can be viewed as a stochastic time series, and each PSSM contains 20 columns, in other words, each PSSM contains 20 stochastic time series and Second-order moving average (SOMA) algorithm is applied to these columns to extract SOMA PSSM feature vector.
SOMA_PSSM(pssm_name)
SOMA_PSSM(pssm_name)
pssm_name |
name of PSSM file |
feature vector of length 160
Y. Liang, S. J. J. o. M. G. Zhang, and Modelling, "Predict protein structural class by incorporating two different modes of evolutionary information into Chou's general pseudo amino acid composition," vol. 78, pp. 110-117, 2017.
w<-SOMA_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
w<-SOMA_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
Singular value decomposition is a general purpose matrix factorization approach
that has many useful applications in signal processing and statistics. In this function SVD is
applied to a matrix representation of a protein with the aim of reducing its dimensionality
Given an input matrix Mat with dimensions N*M SVD is used to calculate its factorization
of the form: where
is a diagonal matrix whose diagonal
entries are known as the singular values of Mat. The resulting descriptor is the ordered
set of singular values:
where L=min(M,N).
and here
svd
function is used for this purpose.
SVD_PSSM(pssm_name)
SVD_PSSM(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of length 20
L. Nanni, A. Lumini, and S. J. T. S. W. J. Brahnam, "An empirical study of different approaches for protein classification," vol. 2014, 2014.
w<-SVD_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
w<-SVD_PSSM(system.file("extdata", "C7GQS7.txt.pssm", package="PSSMCOOL"))
This function produces all possible k-mers from 20 amino acids for use in other functions.
three_mer(k)
three_mer(k)
k |
is length of k-mer which user imports |
a matrix which its first row includes all k-mers
ax<-three_mer(3)
ax<-three_mer(3)
This feature vector is 8000-dimentional feature vector wich is computed from tri-gram probability matrix T obtained from PSSM Matrix.to achieve this purpose elements in three successive rows and arbitrary columns are multiplied together then these results are added together by changing variable i from 1 to L-1, which i is counter of row and L indicates protein length. since there are 20 columns thus final feature vector would be of length 8000.
trigrame_pssm(pssm_name)
trigrame_pssm(pssm_name)
pssm_name |
name of PSSM Matrix file |
feature vector of lenght 8000
Paliwal, K.K., et al. (2014) A tri-gram based feature extraction technique using linear probabilities of position specific scoring matrix for protein fold recognition, IEEE transactions on nanobioscience, 13, 44-50
as<-trigrame_pssm(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GSI6.txt.pssm"))
as<-trigrame_pssm(paste0(system.file("extdata",package="PSSMCOOL"),"/C7GSI6.txt.pssm"))