If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, DTU, 2800 Kgs. Lyngby, DenmarkInstituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, CP1650 San Martín, Argentina
Post-translational modifications of proteins play a crucial part in carcinogenesis. Phosphorylated peptides have shown to be presented by MHC class I molecules and recognised by cytotoxic T cells, making them a promising target for immunotherapy. Identification of phosphorylated MHC class I ligands has so far predominantly been done using bioinformatic tools trained on unmodified peptides. Only one tool, PhosMHCpred, has been developed specifically for the prediction of phosphorylated MHC class I ligands so far and this tool has been trained only on a limited number of alleles and provides a limited peptide length coverage (only including 9-mers).
Here we propose a method, termed NetMHCphosPan, for the prediction of MHC presented phosphopeptides. The method is trained using the NNAlign_MA framework, which allows incorporating mixed data types and information leverage between data sets resulting in a greatly improved MHC and peptide length coverage and an overall increased predictive power compared to PhosMHCpred. Motif deconvolution suggested a strong preference for phosphosites to be located in position 4 of the binding motif, and enrichment of proline at P5 and arginine at P1. The improved performance, driven by the extended length and allelic coverage, of NetMHCphosPan over current state-of-the-art methods, was further validated on a large benchmark data set independent from the model development.
In conclusion, we have confirmed the high power of NNAlign_MA for motif deconvolution of complex immuno-peptidomics data and have developed a novel method for prediction of MHC presented phosphopeptides with improved predictive power and a broader peptide length and MHC coverage compared to current state-of-the-art methods. The developed method is available at http://www.cbs.dtu.dk/services/NetMHCphosPan-1.0.
Major histocompatibility complex (MHC) class I molecules are the cells' gateway to present peptides derived from endogenous protein degradation to cytotoxic T cells (CTLs). Proteins in the cytosol are degraded to peptides by the proteasome after which they are transported into the endoplasmic reticulum by the transporter associated with antigen processing (TAP) complex. Here, after optional further processing by Endoplasmic Reticulum Aminopeptidases 1 and 2 (ERAP1 and 2), the peptides are loaded onto a MHC class I molecule and transported to the cell surface for antigen presentation to CTLs [
]. Each of these alleles is associated with a set of peptides matching the binding motif of that particular MHC class I molecule. Previously, the focus has been on determining the binding motif of unmodified peptides to their MHC class I molecule. In fact, multiple algorithms have been developed to predict this binding event (reviewed in [
Post-translational modifications (PTMs), particularly phosphorylations, have been shown to play a crucial part in carcinogenesis. Studies have shown that phosphorylated peptides can go through the endogenous pathway, bind to MHC class I molecules [
]. Predictions of phosphorylated peptides binding motif to MHC class I molecules have so far predominantly been done with tools developed for prediction of unmodified peptides, where the phosphosite has been exchanged with the unmodified amino acid or an "X" for "unknown" amino acid. These studies have found three tendencies in the binding motifs; the phosphosite was in most cases placed at position 4 (P4) and in several cases followed by proline at P5. In addition, arginine was observed enriched in multiple motifs at P1 [
]. These early findings were based on predictions without using the explicit information from the phosphosite. Solleder et al. developed a new prediction tool, which integrates MixMHCp, a deconvolution tool with an extended amino acid dictionary to allow for phosphosites. Using this tool, they confirmed the three tendencies earlier described [
]. Furthermore, they proposed the proline enrichment at P5 and arginine at P1 to be the result of kinase phosphorylation motifs and established an enrichment of phosphorylated peptides among the HLA-C ligands, and serine to be the most abundant phosphorylated amino acid.
In this study, we confirmed and extended these earlier results. Using the motif deconvolution method, NNAlign_MA [
] on the data from Solleder et al. allowed us to expand HLA allele and length coverage, and develop an improved model for prediction of HLA class I presentation of phospho-ligands. The developed model was tested and compared to current state-of-the-art methods on an independent data set. This dataset was constructed using both previously published phospho-peptide datasets and novel data generated specifically for this study.
Materials and methods
Training data sets
The phosphopeptide data sets used for training were extracted from the paper by Solleder et al. (2020)  where 61 phosphopeptide data sets had been collected from 9 previously published articles, and 2 additional datasets were extracted from Alpizar et al. [
]. All phosphopeptides data sets were multi-allelic, i.e. obtained from cell lines expressing two or more HLA alleles. A great amount of redundancy was present in the phosphopeptides. Removing this redundancy and excluding phosphopeptides with a length above 13 amino acids [
] resulted in a total of 3,653 phosphopeptides each unique for their respective cell lines.
Random natural phosphopeptides were added as negatives for each phosphopeptide length in each of the 63 cell line data sets with phosphorylated ligands. These peptides were sampled from the phosphorylated version of the human proteome downloaded from PhosphositeSitePlus, a database of mass spectrometry (MS) experimentally found phosphosites [
]. More than 97% of MHC presented phosphopeptides contained only a single phospho amino acid per peptide ligand. Therefore, to further mimic the distribution of phospho amino acids in the natural MHC presented phosphopeptides, this phosphosite distribution was reproduced for the random negatives. Negatives for each length [
]. This ensured a flat distribution of random negatives allowing the model to learn the true length distribution of phosphopeptides. In total, the phosphopeptide training set consisted of 3,653 positives and 58,229 negatives.
Next, the phosphopeptide data set was combined with single allele (SA) data from the NetMHCpan-4.1 publication [
]. This consisted of binding affinity data (BA) (177,400 data points covering 153 different MHC molecules), single-allele eluted ligand (SA-EL) data (45,276 positive data points covering 55 different MHC molecules). Also here, the data were enriched with 5 times random natural negatives as described above for the phosphopeptide data sets.
After the merge of positive and negative data, the phosphosites were translated to a regular STY amino acid encoding to cluster together phosphopeptides and regular peptides with the same amino acid sequence. A common motif Hobohm 1 clustering with length 8 amino acids was used to create five partitions for cross-validation from the combined data including both BA and EL data as described earlier [
] using five-fold cross-validation and early stopping with up to 200 minimization cycles, an optimal motif length of 9, an SA burn-in period of 20 iterations, and 200,000 samples per training cycle. Models were trained using 56 or 66 hidden neurons and five random network initializations for each architecture resulting in 10 ensemble models for each cross-validation step. Peptides were represented using sparse sequence encoding with an extended amino acid alphabet: ARNDCQEGHILKMFPSTWYVsty, where “sty” represents the phosphorylated amino acids pS, pT, and pY as described above.
Benchmark data set
An independent benchmark data set was extracted from previous publications including C1R-A1 (HLA-A*01:01), C1R-A2 (HLA-A*02:01), C1R-A24 (HLA-A*24:02) and GR (HLA-A*01:01, HLA-A*03:01, HLA-B*07:02, HLA-B*27:05, HLA-C*02:02, and HLA-C*07:02) [
], combined with an in-house data set (see below) obtained from two multiallelic cell lines DoTc2 (HLA-A*03:01, HLA-B*55:01, and HLA-C*03:03) and SiHa (HLA-A*24:02, HLA-B*40:02, and HLA-C*03:04). For all data sets, peptides sharing an overlap with 8 or more residues to a peptide in the training data were removed resulting in a benchmark with 225 phosphopeptides (available at https://github.com/mnielLab/NetMHCphosPan-1.0). Supplementary table 1 includes an overview of the data.
The in-house data set was generated by immunoprecipitation of HLA-class I using the pan-HLA class I antibody W6/332 as described [
]. In brief, HLA complex components were purified from cleared cell lysates of 5 × 108 cells per sample by incubation with 2 mg W63/32 antibody cross-linked to 1 ml Protein A Sepharose beads (Abcam) overnight at 4 degrees Celsius. Beads were washed, and HLA complex components eluted from the beads using 10% acetic acid. peptides were separated from larger complex components by preparative HPLC, and peptide fractions were pooled and analysed by nano-liquid chromatography–tandem mass spectrometry (nano-LC-MS/MS) using an Ultimate 3000 RSLCnano System (PepMap C18 column, 2 µm particle size, 75 µm × 50 cm; Thermo Scientific) coupled with an Orbitrap Fusion Lumos Tribrid mass spectrometer (Thermo Scientific). A 60 min linear gradient from 3% to 25% ACN in 5% DMSO/0.1% formic acid at a flow rate of 250 nl/min was used for peptides elution. Mass spectrometry (MS) detection was performed with a resolution of 120,000 for full MS (300-1500m/z scan range). Precursors were selected using TopSpeed mode at a 2 s cycle time, an isolation width of 1.2 amu, a resolution of 30,000 and high-energy collisional dissociation (HCD) setting of 28 for peptides with 2-4 charges and 32 for singly-charged precursors. Data were analysed using Peaks v10 (Bioinformatics solutions) with the following settings: precursor mass tolerance: 5 ppm; fragment mass tolerance: 0.03 Da; digestion: none; fixed modifications: none; variable modifications: phosphorylation on (STY). The false discovery rate at a score of -10lgP=15 was determined as 1.88% for DoTc2 cells, and 0.90% for SiHa cells, respectively. This data set and its HLA-typing are available at https://github.com/mnielLab/NetMHCphosPan-1.0/tree/main/Benchmark.
For evaluation on multiple-allele data sets, a percentile rank score transformation was applied. This transformation was made from the score distribution of 100,000 natural random peptides. Using this transformation, a prediction score of a peptide was defined as the score to the HLA molecule with the lowest percentile rank value across the HLAs of a given cell line. The performance on the cross-validation data was evaluated by concatenating all test set predictions and calculating the area under the ROC curve (AUC), AUC with a maximum false-positive rate of 10% (AUC0.1), and positive predictive value (PPV) based on the prediction rank scores. To calculate the PPV, the predictions were ordered by their prediction rank score, then if the particular cell line had N positives, the ratio of positives among the top N predictions was reported as the PPV. Performance on the independent benchmark data was evaluated in terms of F-rank values. F-rank values were calculated from the rank R of each positive peptide within a related set of negative peptides in the evaluation data set, i.e. as (R-1)/total_peptides, so an F-rank of zero would be a perfect performance, while 0.5 would be a random prediction. For that purpose, an evaluation set of negative peptides was constructed for each positive peptide including all peptides of the length of the positive peptide overlapping known phospho sites [
] within the positive source protein. In the cases of peptides containing two or more phosphosites, only combinations of single phosphorylated peptides were included in the evaluation dataset. This F-rank calculation was done for each positive cell line combination, meaning that proteins with multiple phospho ligands were benchmarked individually for each phospho ligand. The independent benchmark performance was further evaluated in terms of AUC, AUC0.1, and PPV by pooling the predictions of the evaluation sets.
Sequence logos were made using the R package "ggseqlogoMOD" a further developed version of the package “ggseqlogo” by Wagih et al. [
All plots in this study were made in R using the tidyverse and patchwork packages.
Here, a novel predictor for HLA presentation of phosphorylated peptides is presented. The model is trained on a comprehensive data set of 3,653 HLA eluted phospho-ligands (pEL) from 63 different cell lines (for details refer to materials and methods). The method was developed using the pan-specific NNAlign_MA machine learning framework [
], allowing concurrent motif deconvolution and model training, enabling leveraging of information between data sets, potentially resulting in improved predictive power and boosted allelic coverage.
The pEL peptides varied in length from 8 to 13 amino acids, with the majority being 9-mers (Fig. 1A). The distribution of the three phosphorylation types in the training data set were pS: 73%, pT: 23%, and pY: 4% (Fig. 1B), in agreement with Solleder et al. [
]). Although a higher proportion of pTyr-sites was reported in PhosphoSitePlus, the observed lower frequency of this post-translational modification by Sharma et al. is due to pTyr being transient and related to specific activation functions on the cell. The peptides were distributed unevenly among the cell lines, with some cell lines having as few as three phosphorylated peptides whereas the highest occurrence in a single cell line was 283 phosphopeptides (Fig. 1C). The HLA-typing of the cell lines mostly all consisted of an even distribution among the HLA-A, HLA-B, and HLA-C alleles, the exception being the C1R-* cell lines which only expressed two HLA-C and one HLA-B allele.
A model, termed NetMHCphosPan, was trained and evaluated using cross-validation applying the neural network-based tool NNAlign_MA [
] integrating pEL data with single-allelic (SA) EL and binding affinity (BA) data of unmodified peptides. For details on the data, data partitioning, and model training refer to materials and methods. In short, NNAlign_MA allows integration of mixed data types (BA and EL) in a pan-specific machine learning framework based on “pseudo-labeling” which enables on-the-fly motif deconvolution of EL data from cell lines expressing multiple HLA alleles (so-called MA data). This is achieved using a short burnin pre-training on SA data resulting in a model with pan-specific prediction potential. Subsequently, this pre-trained model is applied to annotate the MA data and assign each ligand to the most likely of the multiple possible HLA molecules. This MA annotation is repeated during the model training. This results in an expansion of the training set size, and an overall improved predictive power compared to motif deconvolution conducted and methods trained in a per-dataset manner [
Fig. 2 displays the cross-validation performance values measured in terms of AUC, AUC0.1, and the positive predictive value (PPV) for NetMHCphosPan. Here prediction values for ligands in the phosphopeptides data sets were defined as the prediction score to the HLA with the lowest predicted percentile rank value across the different HLA molecules expressed in the given cell line. Further, the NetMHCpan-4.1 method was included to assess the performance of a baseline method trained solely on unmodified peptide data. This was done in three different manners; replacing all modified amino acids with “X”, with corresponding unmodified amino acid, or with an amino acid of similar charge (pS -> D, pT -> E, pY -> Y) before performing the predictions.
NetMHCphosPan significantly outperformed all three NetMHCpan-4.1 implementations (p<0.012 in all comparisons, binomial test excluding ties), exemplified by a more than 5% increased median PPV performance. Moreover, the proportion of phospho-ligands with predicted rank values > 20% in the NetMHCphosPan evaluation, revealed a very low proportion of contaminants (5%) confirming a high quality of the data. Note, that this value corresponds nicely to the 5% FDR threshold used to annotate the MS data [
]. An example of such contaminants is sERHVAQF which was identified across 14 out of 15 C1R-HLA-C cell line data sets despite consistently being predicted as a poor binder.
Next, the predicted position of the phosphosite in the phosphorylated ligand was investigated. The result of this analysis is shown in Fig. 3 and Supp. Fig. S1. Here, the phosphopeptides were divided based on their predicted HLA prediction rank score, to investigate if patterns could differentiate binding from non-binding, i.e MS coprecipitated contaminant, peptides. In line with earlier results [
], this analysis revealed a clear and highly significant (p<0.05, comparison of ratio) enrichment of phosphosite located at P4 for ligands for all peptide lengths (Fig. 3 and Supp. Fig. S1). In contrast, the phosphopeptides categorized as HLA non-binders/contaminants had a close to even distribution of the phosphosite in all positions, with exception of 8mers where the non-binding peptides tended to have the phosphosite located at P1. This P1 enrichment is however driven by the single 8mer sERHVAQF contaminant mentioned earlier (data not shown). Refer to Supplementary Fig. S1 for an illustration also including the intermediate binding category. The plot further demonstrates that phosphosites only very rarely are located at the common HLA anchor positions (P1, P2, and PΩ). That is, for each of these positions, phosphosites are found in less than 2.5% of the binding peptides.
To further visualize this bias in the location of the phosphosite in the pEL data, we plotted the sequence logos of the 9-mer predicted cores of the phosphopeptides for all 35 HLA molecules covered with 20 or more phospho-ligands in the motif deconvolution. These sequence logos were compared to logos generated from unmodified peptides predicted by NetMHCpan-4.1 for the same HLA alleles (see Fig. 4, motifs predicted by NetMHCphosPan are included in Supplementary Fig. S2). This Fig. additionally highlights the overall preferred phosphosite location at P4. One clear exception from this was the preferred phosphosite position at P8 observed for HLA-B08:01 in agreement with [
]. Further comparing the motif for the unmodified peptides and phosphorylated peptides, and in line with earlier works, an enrichment of proline at P5 was observed. Several HLA molecules also had a preference for arginine at P1. These findings align with and extend the earlier findings by Solleder et al. [
], where motifs for phosphorylated ligands to 22 HLA molecules characterized with 20 or more ligands were described (Supplementary Fig. S3). Accurate motifs were identified for 35 HLA molecules compared to 22 described by Solleder et al. [
]. The larger HLA coverage described here is a direct result of the enhanced motif deconvolution potential of NNAlign_MA single data set approaches imposed by its ability to leverage information between data sets, reconcile peptides of variable length to a common core of 9 amino acids, and integrate prior motif knowledge from un-modified SA data.
This higher motif deconvolution power is further revealed when comparing the total number of ligands with deconvoluted HLA restriction between NNAlign_MA and the MixMHCp method applied by Solleder et al.. Here, NNAlign_MA assigns significantly more (38%, p-value 0.02 binomial test excluding ties) phospho-ligands to individual HLA molecules compared to that obtained by the MixMHCp deconvolution [
] when including all peptides lengths and HLA molecules characterized by 20 or more ligands for at least one of the two methods, and 45% more (p<0.005, Binomial test excluding ties) when focusing only on 9-mer peptides (Supp. Table 2). Also in this analysis were peptides with percentile rank scores over 20 excluded from the NNAlign_MA analysis as those were regarded as contaminants. This analysis further underlines the importance of including peptides other than 9mers for motif characterization, as they constitute 47% of the pEL collected for the training data, and the increased deconvolution potential of the NNAlign_MA framework over other strategies. This increased coverage of overall 50% is most pronounced for HLA-B where NNAlign_MA allows for accurate motif deconvolution for 15 HLA molecules compared to 7 covered by 20 or more 9-mer ligands reported by Solleder et al. [
Further, investigating the relative proportion of phosphorylated ligands assigned to HLA-A, HLA-B, and HLA-C loci alleles after the motif deconvolution for cell lines with HLA expression at all three loci/complete HLA allele information demonstrated the expected enrichment for HLA-A and HLA-B presentation (proportion assigned phosphorylated ligands for HLA-A and HLA-B of 46% and 35%) over HLA-C (19%). This proportion of HLA-C presented ligands is however substantially higher compared to that observed for unmodified peptides where the proportion of HLA-C presented ligands most often is close to 10% [
]. As expected, this pattern was changed when investigating the C1R-C cell lines, only expressing HLA-B and HLA-C allele, where the observed proportions were 22% and 78% for HLA-B and HLA-C respectively. This result is in line with that reported in Solleder et al. where a clear enrichment of phosphorylated peptides among HLA-C ligands was also observed.
Finally, NetMHCphosPan was validated on an independent benchmark data set consisting of 225 unique phosphopeptides not sharing overlap with training data extracted from 6 cell lines. The performance was evaluated in terms of F-rank, PPV, AUC, and AUC 0.1. The F-rank metric (false positive rank), is calculated on a per protein-ligand basis from the proportion of source protein peptides with a prediction score greater than the observed ligand, and hence has a direct intuitive interpretation. This metric hence has an optimal value of 0, and 0.5 for random predictions. PPV, AUC, and AUC 0.1 values were calculated on the combined data set covering all 225 unique phospho ligands. For details on the data and performance evaluation refer to materials and methods. The performance was compared to PhosMHCpred developed by the Gfeller lab and to NetMHCpan-4.1 with all phosphosites replaced with “X”. Since PhosMHCpred is only able to predict 9-mers and a limited set of HLA alleles, we first conducted the benchmark on a subset of 101 (48%) phosphopeptides leaving out all non-9mers and the complete C1R-A24 cell line data (Fig. 5). Further, the list of possible alleles in the MA data was reduced to the set covered by PhosMHCpred, i.e excluding HLA-B27:05, HLA-C02:02 from GR; HLA-A24:02 from SiHa; and HLA-C03:03 from DoTc2. Additionally, 7 peptides were considered non-HLA presented peptides (i.e. MS co-immunoprecipitated/ contaminant) and therefore excluded from the benchmark, as they were predicted with percentile rank scores above 20% in at least one of the three models. This F-rank benchmark demonstrates an overall comparable performance between NetMHCphosPan and PhosMHCpred and a significantly improved performance of NetMHCphosPan compared to NetMHCpan-4.1 (p<0.001, binomial test excluding ties, Fig. 5 right panel). Looking at the individual data sets, this observation is maintained, with NetMHCphosPan significantly outperforming the two other methods for the larger DoTc2 data set (p<0.05 in both cases, binomial test excluding ties). Similar conclusions were obtained using the PPV, AUC, and AUC 0.1 performance metrics. Here the performance values for the 9mer data set in Fig. 5 were AUC = 0.955/0.936/0.942, AUC 0.1 = 0.773/0.743/0.761, PPV = 0.296/0.268/0.241 for NetMHCphosPan, phosMHCpred and NetMHCpan respectively.
Further, Fig. 6 shows the predictive power for all peptides (208, excluding as in Fig. 5 likely non-HLA presented data points) for all six data sets including for the MA data the complete set of expressed HLA alleles. This plot demonstrates that the high performance of NetMHCphosPan is maintained also outside the restrictions of PhosMHCpred and confirms the overall improved (though not statistically significant due to the small data set sizes) performance over NetMHCpan across all peptide lengths. Additionally, AUC, AUC 0.1 and PPV performance metrics showed an improved performance of NetMHCphosPan compared to NetMHCpan. Here AUC = 0.911/0.894, AUC 0.1 = 0.711/0.694, and PVV = 0.267/0.171 performance values for NetMHCphosPan and NetMHCpan were obtained, respectively. These results thus confirm the overall superior predictive power of NetMHCphosPan over the two other methods.
In summary, these two independent benchmark calculations demonstrate not only that NetMHCphosPan achieves comparable performance to that of PhosMHCpred for the limited set of peptide length and alleles covered by PhosMHCpred, but also that this high performance for NetMHCphosPan extends to a much increased allelic and peptide length coverage. The NetMHCphosPan method is available as a web-server at http://www.cbs.dtu.dk/services/NetMHCphosPan-1.0.
In this study, we have applied the NNAlign_MA machine learning framework to develop an improved predictor for HLA antigen presentation of phosphorylated peptides. In contrast to the MixMHCp approach used for motif deconvolution in the earlier work by Solleder et al. [
], the NNAlign_MA approach allows for the integration of information from mixed data types including both phosphorylated and unmodified peptides along with the MHC pseudo-sequence in an HLA pan-specific prediction model, enabling motif deconvolution by information shared between datasets. This has the potential to result in a motif deconvolution with broader allelic coverage and associated improved predictive power [
Using this NNAlign_MA framework, a model termed NetMHCphosPan, was trained on an extensive data set of 3,653 phosphorylated HLA ligands obtained from 63 cell lines expressing a total of 69 distinct HLA molecules, combined with a large data set of un-modified binding affinity and single allele HLA eluted ligand data (obtained from the training data of NetMHCpan-4.1 [
]). Performance estimated using cross-validation confirmed a significantly improved predictive power of NetMHCphosPan compared to NetMHCpan (run using "X" in the phosphosite) but also revealed an at first surprising high performance of the latter. Investigating the location of the phosphosite across the 3,653 modified ligands demonstrated a clear and highly statistically significant enrichment at the P4 position independent of peptide length. This observation is in agreement with earlier findings including those of Solleder et al. [
Analyzing the relative proportion of phosphorylated ligands assigned to HLA-A, HLA-B, and HLA-C loci alleles after the motif deconvolution revealed the expected enrichment for HLA-A and HLA-B presentation over HLA-C. However, the proportion of HLA-C presented ligands was found to be higher in phosphorylated ligands compared to that most often observed for unmodified peptides, in line with the earlier results of Solleder et al. [
Investigating binding motifs from the data set deconvolution, the P4 phosphosite enrichment was confirmed for the vast majority of individual HLA molecules, with one clear outlier, HLA-B*08:01 having a phosphosite enrichment at P8 in line with earlier work [
]. Additionally, analyses of the binding motifs suggested an enrichment of proline at P5 and arginine at P1 in phosphorylated peptides in line with kinase-domain-specific binding motifs such as CDC2 and CAMK2A [
]. Finally, we observed very few examples of ligands with phosphosites located at the conventional HLA anchor binding positions (P1, P2, and PΩ).
Our analyses further confirmed the superior power of the pan-data NNAlign_MA method over single data set approaches allowing accurate motif deconvolution for 35 HLA molecules compared to the 22 obtained from the same data using the method of Solleder et al. [
], these 35 molecules extrapolate to 6,861 distinct HLA molecules providing close to complete coverage of the worldwide prevalent HLA molecules.
Finally, the NetMHCphosPan method was benchmarked against the PhosMHCpred method developed by Solleder et al. and NetMHCpan-4.1 on a large independent dataset covering 208 phosphorylated HLA ligand either obtained from literature studies or generated specifically for this study. The subset of data covered by PhosMHCpred (this method only allows for prediction of 9-mer peptides, and only covers 22 HLA molecules) demonstrated a comparable performance between NetMHCphosPan and PhosMHCpred, and a statistically improved performance of NetMHCphosPan compared to NetMHCpan. This high performance of NetMHCphosPan was next confirmed when evaluating on the complete data set including all HLAs and peptides of length 8-13.
In conclusion, we have developed a method for prediction of HLA antigen presentation of phosphorylated peptides with improved predictive power as well as greatly expanded HLA and peptide length coverage compared to current state-of-the-art methods. We expect this method will have high impact for rational identification for novel immune targets and development of novel immunotherapies [
]. Further, the NNAlign_MA modeling framework is readily extendable to cover peptides with other post-translational modifications such as citrullination, deamidation, and glycosylation, providing an ideal and ready to apply platform to analyse and train prediction models for HLA presentation of such modified peptides once the relevant EL data becomes available.
This work was supported in part by the Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services , under Contract No. HHSN272201200010C , and by the Science and Technology Council of Investigation (CONICET-Argentina). The authors declare that they have no competing interests.
MN and CB conceived the main conceptual project idea, CTR and CB performed most of the data handling and computational experiments. NT and XP generated the experimental data. The manuscript was written by CTR, CB and MN with contributions from all authors. All authors have read and approved the final version of the manuscript.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.