1 Introduction
Discovering potential protein biomarkers with clinical values is regularly a long, costly, and challenging process [Rifai et al., 2006]. Each phase of the study (i.e., discovery, qualification, verification, and validation) costs substantial amount of time and resources. In this paper we develop a statistical method to expedite the process. We aim for accurately identifying the truly relevant markers at the discovery phase, which can be directly entered for verification or even validation phrase. Figure 1 illustrates our objective.
Here we focus on proteomics studies. Because following a large number of genomics studies, scientists realized that there is not necessarily high correlation between genes and proteins profiles, which implies that proteomics provides important complementary information to genomics [Pandey et al., 2000]. Moreover, since most of the drugs are directed to interact with proteins, the protein markers are as informative as gene markers.
In proteomics, mass spectrometry is considered the most productive and reliable technique for protein quantification. In particular, dataindependent acquisition (DIA) [Carvalho et al., 2010, Gillet et al., 2012] has been proven to produce high quality and reproducible throughput. Hence DIA is a very effective technique for protein quantification in discovery studies. Currently there are a number of established univariate and multivariate methods being applied to identify potential biomarkers in mass spectrometry proteomics.
LIMMA [Smyth, 2004], SAM [Tusher et al., 2001], and Rank Products [Breitling et al., 2004]
are popular univariate methods. LIMMA fits linear models to each marker, then applies empirical Bayes method to form moderated tstatistics. The moderated tstatistics are used to rank and select the markers. SAM assigns a score to each marker based its “relative difference”. Markers with scores larger than a threshold value is selected. Rank Products is a nonparametric approach, based on ranks of fold changes. Markers with the smallest values are selected. Meanwhile Principal Component Analysis (PCA), Penalized Logistic Regression (PLR), Partial Least SquareDiscriminant Analysis (PLSDA), and Support Vector Machine (SVM) are wellknown multivariate methods regularly used in this field as well.
In discovery studies the sample size is small while the number of markers explored is large, owing to the high throughput of DIA. Nowadays there are typically thousands of protein markers quantified by DIA experiments, out of which usually only 1050 markers are truly relevant. Discovery studies encounter sparsity and high dimensionality issues. The above univariate methods are not effective since they often identify too many false positives due to sparsity issue, leading to an unacceptable high empirical false discovery rate (FDR). For example this is observed in multiple testing procedure. On the other hand the multivariate methods mentioned above are not effective either because of high dimensionality issue. They often identify too few of the truly relevant markers, leading to an unacceptable low sensitivity. The multivariate methods mentioned above have powerful model structures. Because they only need a small number of markers to achieve high classification accuracy, these methods stop identifying more relevant markers once they reach sufficiently accurate classification results. High FDR results in considerable waste of time and resource at the subsequent phases, while low sensitivity leads to many unidentified useful biomarkers. Therefore, there is a genuine need of a novel statistical method capable of accurately identifying the correct set of truly relevant protein markers at the discovery phase.
Here we develop a likelihood ratio based approach. In this paper, we consider the scenario where are two conditions, such as disease vs. normal. Since experiments at the initial discovery phase are mostly controlled, our method focuses on data from balanced design given two conditions. Two discriminant statistics are developed based on the likelihood ratio idea. One is for the sample size greater or equal to five per condition. The other one is for very small sample size, less than 5 per condition. The main idea behind our likelihood ratio approach is that the expression profiles for the irrelevant plain markers follow unimodal distributions, where those for the truly relevant markers follow bimodal distributions. The discriminant statistics are the likelihood ratios of a bimodal distribution versus a unimodal distribution, calculated for every protein. For larger sample size, the bimodal distribution is approximated by a mixture of normals, and the unimodal distribution is approximated by a univariate normal. For very small sample size, we use kernel method to estimate the distribution of the expression profiles, then compute the ratio of the kernel estimate versus a univariate normal. Our likelihood ratio statistics have large values for the truly relevant markers, and small values for the plain markers. Our proposed method is robust to deviations from normality as long as the null expression profiles are unimodal, even though the likelihood of normal distribution is used. Notice with the multiple ttests, the null and alternative hypotheses compare the means of two conditions. If the null hypothesis is rejected, the two conditions have different mean values. Essentially this indicates that marker does not follow a unimodal distribution. Our approach goes beyond simply considering the mean values. We explicitly estimate or approximate the likelihood functions under the null and alternative hypotheses. Hence our approach achieves a much more accurate result.
We compare our method with the aforementioned univariate and multivariate methods for mass spectrometrybased proteomics studies. Evaluations are performed with experimental datasets where truly relevant markers are known from controlled mixtures or previous literature, as well as simulated datasets. The paper is organized as follows. Section 2 describe our likelihood ratio based approach for precise protein biomarker discovery. Section 3 shows the experimental results for our approach compared with other methods. Section 4 concludes the paper.
2 A Likelihood Ratio Approach
In this paper we develop two discriminant statistics which accurately differentiate the true markers from the plain markers. We first examine the distributions of expression profiles of plain markers, as shown in Figure 2. These are proteins from HRM data, discussed in Section 3. Plain protein markers from other experimental datasets generally share this property, i.e., their expression profiles follow unimodal distributions. We also examine the distributions of expression profiles of the truly relevant markers, as shown in Figure 3. These are also proteins from HRM data. True protein markers from other experimental datasets generally follow a bimodal distribution too, given there are two conditions.
We discover that the distributions of expression profiles are usually unimodal for plain markers, and bimodal for truly relevant markers. In addition, we observe that the distributions of expression profile for truly relevant markers can be approximated by a mixtures of normals, whereas for plain markers it can be approximated by a univariate normal, given enough sample size to estimate the normal parameters. Without enough sample size, we apply kernel method to estimate the bimodal density.
2.1 A Discriminant Statistic for
Let and be the sample size for each condition, and be the total sample size. The mixture proportions of the two conditions are the sample proportions, . We develop a likelihood ratio based discriminant statistic for larger sample size, where , . It is the ratio of a mixture of normals’ density versus a univariate normal density. With relatively larger sample sizes, we approximate the bimodal distribution using a mixture of normals, and the unimodal distribution using a univariate normal. The likelihood ratio clearly differentiates the true markers from the plain markers.
Protein 1  …  Protein p  

Sample 1  …  
Sample  … 
Table 1 shows a sample data layout, with . Let and be the means of each normal distribution composing the mixture while and
are their respective variances. The mixing coefficients are
and . Let and be the mean and variance of the pooled univariate normal. The discriminant statistic is calculated for every protein. Assume there are proteins. The discriminant statistic for the th protein, , is calculated as follows.(1) 
Below is the likelihood of a mixture of normals.
(2) 
Below is the likelihood of the pooled normal.
(3) 
The maximum of the likelihood function is computed by plugging in the maximum likelihood estimates (MLE) of parameters to the likelihood function. Let , , , and be the MLEs. MLEs are obtained by maximizing the corresponding likelihood functions. The MLEs have closed form [Bilmes, 1998] as follows. We have the MLEs for the pooled normal as follows.
Let . . Let
The MLEs for the mixture of normals given the mixing coefficients and are the following, .
The discriminant statistic, , for the th protein is the following.
(4) 
Consequently, from (4) we can tell that is expected to have a large value if protein j is a truly relevant marker. Otherwise, is expected to be small.
2.2 Asymptotic Property of
Here we derive the asymptotic property of the statistic under the null hypothesis, which enables us to quantify the statistical significance on each protein provided sample size is sufficient. The discriminant statistic can be thought of as the statistic for testing whether the th protein truly has expression profiles correlate with conditions or not, with the following null and alternative hypotheses.
Hence we write the hypotheses in terms of notations as follows.
(5) 
The statistic for the th protein is specified in Equation 4. We note that is the likelihood ratio statistic for testing between full and reduced models. Therefore, the quantity is the deviance of the goodnessoffit test. As the asymptotic property of the deviance had been known, we know that the asymptotic null distribution of
is chisquare. Note that the degrees of freedom is the difference in dimension of parameter spaces between
and . In particular, the statistic has 2 degrees of freedom because its corresponding hypotheses in Equation (5) has 4 free parameters in the distribution under and only 2 in the distribution under . We have(6) 
Although the property of is asymptotic in theory, in practice it takes around 10 to 30 samples per condition to converge to chisquare, assuming a balanced design () as it is the most common case in discovery studies. Table 2
shows the empirical tail probability
, with , using simulated data discussed in Section 3.2. It demonstrates that the convergence of to is valid and efficient. In general, this convergence is satisfactory when there are at least 10 samples per condition (n10).3  0.22  0.30  0.33 

5  0.10  0.15  0.21 
8  0.04  0.09  0.16 
10  0.03  0.08  0.13 
15  0.02  0.07  0.11 
20  0.02  0.05  0.10 
30  0.01  0.05  0.10 
2.3 Alternative Statistic for Very Small Sample Size
When there are at least five samples per condition, the sample means and the sample variances are sufficiently accurate estimates. Consequently we obtain a reasonable likelihood function of a mixture normal to be used in the discriminant statistic in Section 2.1. However discovery studies usually face small sample sizes. Sometimes there are less than five sample per condition, which leads to inaccurate sample means and sample variances. When the sample size is too small, the statistic in Section 2.1 does not perform well. We notice that plain markers have unimodal expression file distributions is a strong identifying feature. Here we develop an alternative statistic which performs well for very small sample size, with
. The idea for very small sample size is to compare the estimated empirical distribution of a protein’s expression profile with a pooled univariate normal, where the samples from two conditions are combined. The null hypothesis is that the particular protein in question is plain. We use kernel density estimation method for approximating the empirical distribution of protein expression profiles, which provide results quite sensitive to nonunimodal patterns. Notice for larger sample size, the estimated empirical distribution can be simplified to an approximate mixture of normals.
For the th protein, we estimate its expression profile distribution’s density function using the normal kernel as follows.
(7) 
where that is the sample size of each condition; is the expression intensity as displayed in Table 1; K() is the density of standard normal; and is the optimal bandwidth for the th protein derived according to minimum mean integrated squared error as follows.
(8) 
where is the kernel density estimation from (7) with bandwidth of whilst is the (unknown) true density of expression profile.
As opposed to simply testing the parameters when the sample size is at least five per condition, as shown in (5), here we compare two distributions. For the th protein density function , let
be the corresponding cumulative distribution function (CDF). Under the alternative hypothesis
does not follow a unimodal distribution. Let be the CDF of the pooled univariate normal under the null hypothesis that the observed expressions are drawn from a single univariate normal distribution. The null and the alternative hypothesis are written as follows.We then use the KolmogorovSmirnov (KS) method for testing the hypotheses. KS test uses the maximum difference between the CDF of kernel density and that of univariate normal, both evaluated at each observed expression, as the test statistic.
(9) 
Massey [Massey Jr, 1951] claimed that KS test is superior to the chisquare test especially in cases of small sample size. Since the normal CDF in hypotheses (2.3) is not specified in advance (i.e., the mean and variance are estimated from the data), the significance level and pvalue of KS test should be derived carefully. Hence, we apply the method developed in Lilliefors [Lilliefors, 1967] to obtain the pvalue as mean and variance of univariate normal are not specified in advance but estimated from the sample.
3 Experimental Results
3.1 Experimental Datasets
The following two experimental datasets are used to evaluate the performance of our approach in Section 3. In this section, we use two criteria to compare different approaches. One is sensitivity, which is the ratio of the number of truly relevant markers among the selected markers vs the actual number of truly relevant markers. The other is FDR, which is the ratio of false positives among the selected markers vs the number of selected markers. The truly relevant protein markers are known in the two datasets. Hence we are able to compare the sensitivity and FDR of different approaches including ours.
Hyper Reaction Monitoring (HRM) Data:
Hyper Reaction Monitoring (HRM) is a novel DIA technique proposed by [Bruderer et al., 2015] where spectral libraries were based on normalized retention time for data extraction. The main contribution of [Bruderer et al., 2015] is to show the capability of their DIA technique. Compared with DDA technique, they demonstrated the increased capability that DIA technique brings to protein biomarker discovery. In order to show that HRM is superior to the established oldgeneration DDA technique, they conducted experiment where 3785 proteins were quantified and 12 of them were known to be correlated with conditions. This is a typical setup for discovery study in that there are 8 conditions with 3 samples in each condition. Later in our experiments, we merge the samples from similar conditions so that there are 2 conditions with 12 samples each.
Yeast Data:
SWATHMS [Selevsek et al., 2015] was invented as an innovative DIA experiment, which combined the strength of DIA and target data analysis for more throughput with high accuracy. In the interest of proving the competence of SWATHMS they quantified protein markers on Saccharomyces cerevisiae with osmotic stress imposed over time. There are 6 time points, regarded as conditions, and 3 samples in each condition. In total, 2602 proteins were quantified in which 50 were known to be regulated in response to osmotic stress. In addition, proteins responding to osmatic stress are mainly associated with arbohydrate and amino acid metabolism. Therefore, the ability SWATHMS provides to recover known biomarkers shows that DIA data has promising potential to assist in discovering relevant protein markers.
This experimental data is another typical discovery study, where many markers are explored while sample size is limited. In our experiments, we again merge the samples from the first 3 time points, and those from the last 3 time points. We then have 2 condition where there are 9 samples each. Merging the samples is reasonable here since the change of protein expressions, if any, happens gradually as the experiment progressed.
3.2 Simulated Datasets
We also simulate a large number of datasets with different configurations to evaluate our approach. In general, in each simulation run 1000 protein markers are generated, and the expression profiles of each protein are simulated according to the following steps. Note that the parameters we use are the estimated values of those from the data sets acquired and processed by OpenSWATH [Röst et al., 2014], where proteins were under the matrices of water and human. Water is the simplest matrix among them, with clean signals, whereas human is the most complex under considerable interference. Simulated data points are denoted as , which is the expression intensity for the th protein of the th subject within the th condition. An example of simulated data layout is shown in Table 1.

Step 1. Generate the expression intensity according to the following equation ([Röst et al., 2014]).

Step 2. In the equation above stands for the main effect of a condition, the effect of each subject within a condition, and the main effect of each protein. We then generate the source of variations for the true markers and plain markers as follows. We simulate 2 conditions, .
For true markers . For plain markers . Then the other items in the above equation are simulated for normal distributions as follows.

Step 3. Set the parameter values according to [Röst et al., 2014]. We have , , and . The sample size for each condition and the number of proteins vary in different simulation runs. The error term ’s variance depends on the matrices. We have for human background, the most complex matrix; for water background, the matrix with the smallest level of interference.
3.3 Robust Performance Against Deviations from Normality
Although the two statistics in Section 2.1 are constructed as a ratio of a mixture distribution or the empirical distribution versus a pooled univariate normal, they show robust performance with nonnormal distributions too. Figure 2 shows the empirical expression profile distributions of plain markers in HRM data. The most important feature of plain markers is that their expression profile distributions are unimodal. Hence the fundamental idea behind our approach is to separate bimodal distributions from unimodal distributions, which lead to accurate identification of true markers. The underlying actual distributions do not need to strictly follow normality assumptions for our approach to perform well. Here we demonstrate the robust performance of the two statistics against strong deviations from normality.
In one experiment, we have 10 samples per condition. We simulate 1000 proteins as specified in Section 3.2
but with different error term distributions, where there are 30 true markers and 970 plain markers. The error terms are simulated according to Cauchy distribution with location parameter 15 and scale parameter 2 (Figure
4) and chisquare distribution with two degrees of freedom (Figure
5). Cauchy distribution has much heavier tails than normal and chisquare with two degrees of freedom is very skewed distribution. Plain markers have
whereis randomly sampled following a uniform distribution over
. The effect size is 3 fold for true markers. The fold change is defined on log base 2 scale. True markers have and where is randomly sampled from a uniform distribution over . The resulting expression profile distributions for both plain markers and true markers are strongly nonnormal.The proteins are ordered by their values. The largest 100 proteins are shown in Figures 4 and 5. The ordered values of the statistic , which works well for nottoosmall sample size, show a sudden drop after the largest values. We choose the cutoff for at where the drop occurs. This pattern is also observed from the two experimental datasets in Section 3.4. We select the markers with values larger than the cutoff. Based on this selection method, we are able to accurately identify 1) the number of truly relevant markers; 2) the relevant markers themselves.
In another experiment we simulate 1000 proteins as in the first experiment, only there are 3 samples per conditions. The statistic for very small sample size is used. Figures 6 and 7 show the order values of , the 100 largest ones, which exhibit the same pattern. With very small sample size, we are still able to accurately identify the relevant markers.
3.4 High Sensitivity and Low FDR
HRM data  Yeast data  
Sensitivity  Empirical FDR  Sensitivity  Empirical FDR  
LIMMA  
RP  
SAM  
ttest  
PCA  
PLR  
PLSDA  
SVM  
Our 
We apply the statistic to the two experimental datasets, Yeast data and HRM data as discussed earlier. Figures 8 and 9 show the ordered values of . Again we observe a sudden drop of values. We set the cutoff of to be 2.5. We then select the proteins that have values larger than cutoff. In Table 3 we compare the commonly used univariate and multivariate methods with our statistic based on sensitivity and FDR. For univariate methods FDR cutoff is set at 0.05, while for multivariate methods the algorithms stop based on their default stopping criteria. The univariate methods select too many markers, and suffer from high empirical FDR. The multivariate methods have strong model structures. They do not need all the relevant markers to achieve high classification accuracy. Hence they select too few markers, leading to very low sensitivity. Our approach, to cut off at where values suddenly drop, returns superior sensitivity with very low FDR.
Next we compare our approach with the commonly used methods in simulation studies. In a single run we simulate 1000 proteins as specified in Section 3.2 with the error terms normally distributed, where there are 30 true markers and 970 plain markers. Again plain markers have where is randomly sampled following a uniform distribution over . The effect size is 3 fold for true markers. True markers have and where is randomly sampled following a uniform distribution over . In one experiment we have 10 samples per condition and conduct 50 runs, with the average sensitivity and the average FDR shown in Table 4. In another experiment we have 3 samples per condition and conduct 50 runs, with the average sensitivity and the average FDR shown in Table 5.
Water  Human  
Sensitivity  Empirical FDR  Sensitivity  Empirical FDR  
LIMMA  
RP  
SAM  
ttest  
PCA  
PLR  
PLSDA  
SVM  
Our 
Water  Human  
Sensitivity  Empirical FDR  Sensitivity  Empirical FDR  
LIMMA  
RP  
SAM  
ttest  
PCA  
PLR  
PLSDA  
SVM  
Our 
4 Conclusion and Discussion
Our approach is superior to the popularly applied methods univariate and multivariate methods given reasonable sample size. Our approach shows good performance for very small sample size, especially under water background. In addition, we notice that nowadays most of the experimental designs have allowed us to merge samples from adjacent conditions (groups). At the end of the day we may not have to rely on the alternative procedure for very small sample size cases.
Meanwhile, as the primary goal of this paper is to accurately identify relevant biomarkers in discovery study, which is mostly balanced design with moderate or small sample size, the method we propose is mainly for the balanced design given two conditions. Thus how to handle the data acquired from highly unbalanced design under multiple conditions (more than two) with potentially very small sample size for certain conditions is part of our future work.
The nature of discovery study combined with the advancement of DIA experiment result in sparse and high dimensional data. It is known that popular univariate methods struggle under sparsity and multivariate methods stumble under high dimensionality.
Nonetheless, our proposed likelihoodbased statistic resolves the problem from a critical perspective – accurately estimates the number of truly relevant markers. This feature can effectively resolves the aforesaid drawbacks from current methods.
From multiple experimental and simulated data sets we demonstrate that our approach is able to accurately identify the set of truly relevant protein markers. The implication is that our proposed method can simplify the unnecessarily lengthy and costly process of biomarker discovery without compromising the findings of markers correlating with conditions.
References
 [Rifai et al., 2006] Rifai, N., Gillette, M.A. and Carr, S.A., 2006. Protein biomarker discovery and validation: the long and uncertain path to clinical utility, Nature biotechnology, 24(8), 971983.
 [Pandey et al., 2000] Pandey, A. and Mann, M., 2000. Proteomics to study genes and genomes. Nature, 405(6788), 837846.
 [Carvalho et al., 2010] Carvalho, P.C., Han, X., Xu, T., Cociorva, D., Carvalho, M.D.G., Barbosa, V.C. and Yates III, J.R., 2010. XDIA: improving on the labelfree dataindependent analysis. Bioinformatics, 26(6), 847848.
 [Gillet et al., 2012] Gillet, L.C., Navarro, P., Tate, S., Röst, H., Selevsek, N., Reiter, L., Bonner, R. and Aebersold, R., 2012. Targeted data extraction of the MS/MS spectra generated by dataindependent acquisition: a new concept for consistent and accurate proteome analysis. Molecular & Cellular Proteomics, 11(6), O111016717.
 [Smyth, 2004] Smyth, G. K., 2004. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol, 3(1), 3.
 [Tusher et al., 2001] Tusher, V.G., Tibshirani, R. and Chu, G., 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences, 98(9), 51165121.
 [Breitling et al., 2004] Breitling, R., Armengaud, P., Amtmann, A. and Herzyk, P., 2004. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS letters, 573(13), 8392
 [Bruderer et al., 2015] Bruderer, R., Bernhardt, O.M., Gandhi, T., Miladinović, S.M., Cheng, L.Y., Messner, S., Ehrenberger, T., Zanotelli, V., Butscheid, Y., Escher, C. and Vitek, O., 2015. Extending the limits of quantitative proteome profiling with dataindependent acquisition and application to acetaminophentreated threedimensional liver microtissues. Molecular & Cellular Proteomics, 14(5), 14001410.
 [Selevsek et al., 2015] Selevsek, N., Chang, C.Y., Gillet, L.C., Navarro, P., Bernhardt, O.M., Reiter, L., Cheng, L.Y., Vitek, O. and Aebersold, R., 2015. Reproducible and consistent quantification of the Saccharomyces cerevisiae proteome by SWATHmass spectrometry. Molecular & Cellular Proteomics, 14(3), 739749.
 [Röst et al., 2014] Röst, H.L., Rosenberger, G., Navarro, P., Gillet, L., Miladinović, S.M., Schubert, O.T., Wolski, W., Collins, B.C., Malmström, J., Malmström, L. and Aebersold, R., 2014. OpenSWATH enables automated, targeted analysis of dataindependent acquisition MS data. Nature biotechnology, 32(3), 219223.

[Bilmes, 1998]
Bilmes, J.A., 1998. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models.
International Computer Science Institute, 4(510), 126.  [Massey Jr, 1951] Massey Jr, F. J., 1951. The KolmogorovSmirnov test for goodness of fit. Journal of the American statistical Association, 46(253), 6878.
 [Lilliefors, 1967] Lilliefors, H. W., 1967. On the KolmogorovSmirnov test for normality with mean and variance unknown. Journal of the American statistical Association, 62(318), 399402.
Comments
There are no comments yet.