Fig 1

Fig. 1

Descriptive plots of the phosphopeptide training set. A) Overview of the peptide length distribution. B) Distribution of the phosphorylation types pS, pT and pY in pEL, phosphoSitePlus and Sharma et al. [18x[18]Hornbeck, PV, Zhang, B, Murray, B, Kornhauser, JM, Latham, V, and Skrzypek, E. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. januar. 2015; 43: D512–D520

Crossref | PubMed | Scopus (1609)
| Google ScholarSee all References
,24x[24]Sharma, K, D'Souza, RCJ, Tyanova, S, Schaab, C, Wiśniewski, JR, and Cox, J. Ultradeep Human Phosphoproteome Reveals a Distinct Regulatory Nature of Tyr and Ser/Thr-Based Signaling. Cell Rep. 2014; 8: 1583–1594 (september)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (623)
| Google ScholarSee all References
]. C) Distribution of the 3653 phosphopeptides across the 63 cell lines. D) Overview of the HLA-typing of the individual cell lines.

Fig 2

Fig. 2

Cross-validation performance of NetMHCphosPan and NetMHCpan-4.1 on the training data. Performance for NetMHCphosPan was evaluated using 5-fold cross-validation. For NetMHCpan-4.1, all phosphosites were replaced with “X” (NetMHCpan-4.1_X), corresponding unmodified amino acids (NetMHCpan-4.1_STY), or with amino acids of similar charge (NetMHCpan-4.1_DEY). Performance metrics included are A) AUC, B) AUC0.1 (AUC integrated up to a false-positive rate of 10%), and C) PPV (predicted positive rate, calculated as described in material and methods). P-values stated are from Wilcoxon signed-rank test. *P<0.05, **P<0.01, ***P<0.001, ****P<0.0001.

Fig 3

Fig. 3

Distribution of phosphosite locations in phosphorylated ligands of different lengths. Binders are peptides with predicted percentile rank below 2%, and non-binders peptides with a predicted percentile rank above 20% from NetMHCphosPan cross-validation. In total 2916, 563 and 174 ligands were assigned to each of the three categories binding (<2%), intermediate (2-20%), and Non-binding (>20%).  Number of ligands assigned to each Binding/Non-binding category is displayed in the Fig. as (#binding:#non-binding) next to the peptide length. The bars represent the true data, while error bars are estimated using bootstrapping.

Fig 4

Fig. 4

Sequence logo representations of binding motifs for HLA molecules characterized with 20 or more phosphorylated peptides. Binding motifs of NetMHCphosPan for phosphorylated peptides compared to NetMHCpan-4.1 binding motifs of unmodified peptides. Phosphosites are shown in purple. Noted in parenthesis is the number of peptides included in the logo generation. For NetMHCphosPan, ligands with predicted percentile rank scores greater than 20% were excluded. For NetMHCpan-4.1, logos were generated from the top 0.1% scoring predictions from a set of 125,000 random natural 8-11mer peptides.

Fig 5

Fig. 5

Boxplot of the F-rank performance values on 9-mer peptides in the benchmark over the different cell lines, and all cell line data collected under the label All. Methods included are NetMHCpan-4.1 with all phosphosites replaced with “X”, NetMHCphosPan, and PhosMHCpred, the method proposed by Solleder et al.. In parenthesis above each box-plot is noted the number of peptides associated with the respective cell lines. Only peptides with a predicted percentile rank score of 20 or less for at least one of the three methods are included. F-rank values equal to 0 are displayed as 0.0001.

Fig 6

Fig. 6

Boxplot of the F-rank performance values for the benchmark over the different lengths and cell lines. The number of peptides associated with each length is noted in parenthesis. Methods included are NetMHCpan-4.1 with all phosphosites replaced with “X”, and NetMHCphosPan. Only peptides with a predicted percentile rank score of 20 or less for at least one of the two methods are included. F-rank values equal to 0 are displayed as 0.0001.

Expand allCollapse all

Abstract

Post-translational modifications of proteins play a crucial part in carcinogenesis. Phosphorylated peptides have shown to be presented by MHC class I molecules and recognised by cytotoxic T cells, making them a promising target for immunotherapy. Identification of phosphorylated MHC class I ligands has so far predominantly been done using bioinformatic tools trained on unmodified peptides. Only one tool, PhosMHCpred, has been developed specifically for the prediction of phosphorylated MHC class I ligands so far and this tool has been trained only on a limited number of alleles and provides a limited peptide length coverage (only including 9-mers).

Here we propose a method, termed NetMHCphosPan, for the prediction of MHC presented phosphopeptides. The method is trained using the NNAlign_MA framework, which allows incorporating mixed data types and information leverage between data sets resulting in a greatly improved MHC and peptide length coverage and an overall increased predictive power compared to PhosMHCpred. Motif deconvolution suggested a strong preference for phosphosites to be located in position 4 of the binding motif, and enrichment of proline at P5 and arginine at P1. The improved performance, driven by the extended length and allelic coverage, of NetMHCphosPan over current state-of-the-art methods, was further validated on a large benchmark data set independent from the model development.

In conclusion, we have confirmed the high power of NNAlign_MA for motif deconvolution of complex immuno-peptidomics data and have developed a novel method for prediction of MHC presented phosphopeptides with improved predictive power and a broader peptide length and MHC coverage compared to current state-of-the-art methods. The developed method is available at http://www.cbs.dtu.dk/services/NetMHCphosPan-1.0.

Graphical abstract

Figure thumbnail ga1

Introduction

Major histocompatibility complex (MHC) class I molecules are the cells' gateway to present peptides derived from endogenous protein degradation to cytotoxic T cells (CTLs). Proteins in the cytosol are degraded to peptides by the proteasome after which they are transported into the endoplasmic reticulum by the transporter associated with antigen processing (TAP) complex. Here, after optional further processing by Endoplasmic Reticulum Aminopeptidases 1 and 2 (ERAP1 and 2), the peptides are loaded onto a MHC class I molecule and transported to the cell surface for antigen presentation to CTLs [1x[1]Neefjes, J, Jongsma, MLM, Paul, P, and Bakke, O. Towards a systems understanding of MHC class I and MHC class II antigen presentation. Nat Rev Immunol. 2011; 11: 823–836 (december)

Crossref | PubMed | Scopus (1079)
| Google ScholarSee all References
4x[4]Zarling, AL, Polefrone, JM, Evans, AM, Mikesh, LM, Shabanowitz, J, and Lewis, ST. Identification of class I MHC-associated phosphopeptides as targets for cancer immunotherapy. Proc Natl Acad Sci. 3. oktober. 2006; 103: 14889–14894

Crossref | PubMed | Scopus (133)
| Google ScholarSee all References
].

In humans, MHC class I molecules are encoded by three main classical genes, HLA-A, HLA-B, and HLA-C, all of which are highly polymorphic resulting in 19,586 different alleles known to date [5x[5]Robinson, J, Barker, DJ, Georgiou, X, Cooper, MA, Flicek, P, and Marsh, SGE. IPD-IMGT/HLA database. Nucleic Acids Res. 2019; 31: gkz950 (oktober)

Crossref | Scopus (1071)
| Google ScholarSee all References
]. Each of these alleles is associated with a set of peptides matching the binding motif of that particular MHC class I molecule. Previously, the focus has been on determining the binding motif of unmodified peptides to their MHC class I molecule. In fact, multiple algorithms have been developed to predict this binding event (reviewed in [6x[6]Nielsen, M, Andreatta, M, Peters, B, and Buus, S. Immunoinformatics: predicting peptide–MHC binding. Annu Rev Biomed Data Sci. 20. juli. 2020; 3: 191–215

Crossref
| Google ScholarSee all References
]).

Post-translational modifications (PTMs), particularly phosphorylations, have been shown to play a crucial part in carcinogenesis. Studies have shown that phosphorylated peptides can go through the endogenous pathway, bind to MHC class I molecules [7x[7]Alpízar, A, Marino, F, Ramos-Fernández, A, Lombardía, M, Jeko, A, and Pazos, F. A molecular basis for the presentation of phosphorylated peptides by HLA-B antigens. Mol Cell Proteomics MCP. februar. 2017; 16: 181–193

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
], and in addition, get recognised by specific CTLs [2x[2]Andersen, MH, Bonfill, JE, Neisig, A, Arsequell, G, Sondergaard, I, and Valencia, G. Phosphorylated peptides can be transported by TAP molecules, presented by class I MHC molecules, and recognized by phosphopeptide-specific CTL. J Immunol Baltim Md 1950. 1. oktober. 1999; 163: 3812–3818

Google ScholarSee all References
4x[4]Zarling, AL, Polefrone, JM, Evans, AM, Mikesh, LM, Shabanowitz, J, and Lewis, ST. Identification of class I MHC-associated phosphopeptides as targets for cancer immunotherapy. Proc Natl Acad Sci. 3. oktober. 2006; 103: 14889–14894

Crossref | PubMed | Scopus (133)
| Google ScholarSee all References
]. This makes them promising targets for immunotherapy in cancer patients [4x[4]Zarling, AL, Polefrone, JM, Evans, AM, Mikesh, LM, Shabanowitz, J, and Lewis, ST. Identification of class I MHC-associated phosphopeptides as targets for cancer immunotherapy. Proc Natl Acad Sci. 3. oktober. 2006; 103: 14889–14894

Crossref | PubMed | Scopus (133)
| Google ScholarSee all References
]. Predictions of phosphorylated peptides binding motif to MHC class I molecules have so far predominantly been done with tools developed for prediction of unmodified peptides, where the phosphosite has been exchanged with the unmodified amino acid or an "X" for "unknown" amino acid. These studies have found three tendencies in the binding motifs; the phosphosite was in most cases placed at position 4 (P4) and in several cases followed by proline at P5. In addition, arginine was observed enriched in multiple motifs at P1 [3x[3]Zarling, AL, Ficarro, SB, White, FM, Shabanowitz, J, Hunt, DF, and Engelhard, VH. Phosphorylated peptides are naturally processed and presented by major histocompatibility complex class I molecules in vivo. J Exp Med. 2000; 192: 1755–1762 (18. december)

Crossref | PubMed | Scopus (167)
| Google ScholarSee all References
, 8x[8]Meyer, VS, Drews, O, Günder, M, Hennenlotter, J, Rammensee, H-G, and Stevanovic, S. Identification of natural MHC class II presented phosphopeptides and tumor-derived MHC class I phospholigands. J Proteome Res. 6. juli. 2009; 8: 3666–3674

Crossref | PubMed | Scopus (37)
| Google ScholarSee all References
, 9x[9]Cobbold, M, De La Pena, H, Norris, A, Polefrone, JM, Qian, J, and English, AM. MHC class I-associated phosphopeptides are the targets of memory-like immunity in leukemia. Sci Transl Med. 2013; 5 (18. september203ra125-203ra125)

Crossref | PubMed | Scopus (143)
| Google ScholarSee all References
, 10x[10]Marcilla, M, Alpízar, A, Lombardía, M, Ramos-Fernandez, A, Ramos, M, and Albar, JP. Increased diversity of the HLA-B40 ligandome by the presentation of peptides phosphorylated at their main anchor residue*. Mol Cell Proteomics. februar. 2014; 13: 462–474

Abstract | Full Text | Full Text PDF | PubMed | Scopus (25)
| Google ScholarSee all References
, 11x[11]Bassani-Sternberg, M, Bräunlein, E, Klar, R, Engleitner, T, Sinitcyn, P, and Audehm, S. Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry. Nat Commun. 2016; 7: 13404 (december)

Crossref | PubMed | Scopus (386)
| Google ScholarSee all References
, 12x[12]Abelin, JG, Keskin, DB, Sarkizova, S, Hartigan, CR, Zhang, W, and Sidney, J. Mass spectrometry profiling of HLA-associated peptidomes in mono-allelic cells enables more accurate epitope prediction. Immunity. februar. 2017; 46: 315–326

Abstract | Full Text | Full Text PDF | PubMed | Scopus (327)
| Google ScholarSee all References
, 13x[13]Mohammed, F, Cobbold, M, Zarling, AL, Salim, M, Barrett-Wilt, GA, and Shabanowitz, J. Phosphorylation-dependent interaction between antigenic peptides and MHC class I: a molecular basis for the presentation of transformed self. Nat Immunol. 2008; 9: 1236–1243 (november)

Crossref | PubMed | Scopus (107)
| Google ScholarSee all References
, 14x[14]Mohammed, F, Stones, DH, Zarling, AL, Willcox, CR, Shabanowitz, J, and Cummings, KL. The antigenic identity of human class I MHC phosphopeptides is critically dependent upon phosphorylation status. Oncotarget. 2017; 8: 54160–54172 (15. august)

Crossref | PubMed | Scopus (31)
| Google ScholarSee all References
, 15x[15]Mommen, GPM, Frese, CK, Meiring, HD, van Gaans-van den Brink, J, de Jong, APJM, and van Els, CACM. Expanding the detectable HLA peptide repertoire using electron-transfer/higher-energy collision dissociation (EThcD). Proc Natl Acad Sci. 25. marts. 2014; 111: 4507–4512

Crossref | PubMed | Scopus (139)
| Google ScholarSee all References
]. These early findings were based on predictions without using the explicit information from the phosphosite. Solleder et al. developed a new prediction tool, which integrates MixMHCp, a deconvolution tool with an extended amino acid dictionary to allow for phosphosites. Using this tool, they confirmed the three tendencies earlier described [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
]. Furthermore, they proposed the proline enrichment at P5 and arginine at P1 to be the result of kinase phosphorylation motifs and established an enrichment of phosphorylated peptides among the HLA-C ligands, and serine to be the most abundant phosphorylated amino acid.

In this study, we confirmed and extended these earlier results. Using the motif deconvolution method, NNAlign_MA [17x[17]Alvarez, B, Reynisson, B, Barra, C, Buus, S, Ternette, N, and Connelley, T. NNAlign_MA; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved T-cell epitope predictions. Mol Cell Proteomics. 2019; 18: 2459–2477 (december)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
] on the data from Solleder et al. allowed us to expand HLA allele and length coverage, and develop an improved model for prediction of HLA class I presentation of phospho-ligands. The developed model was tested and compared to current state-of-the-art methods on an independent data set. This dataset was constructed using both previously published phospho-peptide datasets and novel data generated specifically for this study.

Materials and methods

Training data sets

The phosphopeptide data sets used for training were extracted from the paper by Solleder et al. (2020) [16] where 61 phosphopeptide data sets had been collected from 9 previously published articles, and 2 additional datasets were extracted from Alpizar et al. [7x[7]Alpízar, A, Marino, F, Ramos-Fernández, A, Lombardía, M, Jeko, A, and Pazos, F. A molecular basis for the presentation of phosphorylated peptides by HLA-B antigens. Mol Cell Proteomics MCP. februar. 2017; 16: 181–193

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
]. All phosphopeptides data sets were multi-allelic, i.e. obtained from cell lines expressing two or more HLA alleles. A great amount of redundancy was present in the phosphopeptides. Removing this redundancy and excluding phosphopeptides with a length above 13 amino acids [17x[17]Alvarez, B, Reynisson, B, Barra, C, Buus, S, Ternette, N, and Connelley, T. NNAlign_MA; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved T-cell epitope predictions. Mol Cell Proteomics. 2019; 18: 2459–2477 (december)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
] resulted in a total of 3,653 phosphopeptides each unique for their respective cell lines.

Random natural phosphopeptides were added as negatives for each phosphopeptide length in each of the 63 cell line data sets with phosphorylated ligands. These peptides were sampled from the phosphorylated version of the human proteome downloaded from PhosphositeSitePlus, a database of mass spectrometry (MS) experimentally found phosphosites [18x[18]Hornbeck, PV, Zhang, B, Murray, B, Kornhauser, JM, Latham, V, and Skrzypek, E. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. januar. 2015; 43: D512–D520

Crossref | PubMed | Scopus (1609)
| Google ScholarSee all References
]. More than 97% of MHC presented phosphopeptides contained only a single phospho amino acid per peptide ligand. Therefore, to further mimic the distribution of phospho amino acids in the natural MHC presented phosphopeptides, this phosphosite distribution was reproduced for the random negatives. Negatives for each length [8x[8]Meyer, VS, Drews, O, Günder, M, Hennenlotter, J, Rammensee, H-G, and Stevanovic, S. Identification of natural MHC class II presented phosphopeptides and tumor-derived MHC class I phospholigands. J Proteome Res. 6. juli. 2009; 8: 3666–3674

Crossref | PubMed | Scopus (37)
| Google ScholarSee all References
, 9x[9]Cobbold, M, De La Pena, H, Norris, A, Polefrone, JM, Qian, J, and English, AM. MHC class I-associated phosphopeptides are the targets of memory-like immunity in leukemia. Sci Transl Med. 2013; 5 (18. september203ra125-203ra125)

Crossref | PubMed | Scopus (143)
| Google ScholarSee all References
, 10x[10]Marcilla, M, Alpízar, A, Lombardía, M, Ramos-Fernandez, A, Ramos, M, and Albar, JP. Increased diversity of the HLA-B40 ligandome by the presentation of peptides phosphorylated at their main anchor residue*. Mol Cell Proteomics. februar. 2014; 13: 462–474

Abstract | Full Text | Full Text PDF | PubMed | Scopus (25)
| Google ScholarSee all References
, 11x[11]Bassani-Sternberg, M, Bräunlein, E, Klar, R, Engleitner, T, Sinitcyn, P, and Audehm, S. Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry. Nat Commun. 2016; 7: 13404 (december)

Crossref | PubMed | Scopus (386)
| Google ScholarSee all References
, 12x[12]Abelin, JG, Keskin, DB, Sarkizova, S, Hartigan, CR, Zhang, W, and Sidney, J. Mass spectrometry profiling of HLA-associated peptidomes in mono-allelic cells enables more accurate epitope prediction. Immunity. februar. 2017; 46: 315–326

Abstract | Full Text | Full Text PDF | PubMed | Scopus (327)
| Google ScholarSee all References
, 13x[13]Mohammed, F, Cobbold, M, Zarling, AL, Salim, M, Barrett-Wilt, GA, and Shabanowitz, J. Phosphorylation-dependent interaction between antigenic peptides and MHC class I: a molecular basis for the presentation of transformed self. Nat Immunol. 2008; 9: 1236–1243 (november)

Crossref | PubMed | Scopus (107)
| Google ScholarSee all References
] were generated as 5 times the quantity of the most prevalent ligand length of that cell line as described earlier [17x[17]Alvarez, B, Reynisson, B, Barra, C, Buus, S, Ternette, N, and Connelley, T. NNAlign_MA; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved T-cell epitope predictions. Mol Cell Proteomics. 2019; 18: 2459–2477 (december)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
]. This ensured a flat distribution of random negatives allowing the model to learn the true length distribution of phosphopeptides. In total, the phosphopeptide training set consisted of 3,653 positives and 58,229 negatives.

Next, the phosphopeptide data set was combined with single allele (SA) data from the NetMHCpan-4.1 publication [19x[19]Reynisson, B, Alvarez, B, Paul, S, Peters, B, and Nielsen, M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2. juli. 2020; 48: W449–W454

Crossref | PubMed | Scopus (0)
| Google ScholarSee all References
]. This consisted of binding affinity data (BA) (177,400 data points covering 153 different MHC molecules), single-allele eluted ligand (SA-EL) data (45,276 positive data points covering 55 different MHC molecules). Also here, the data were enriched with 5 times random natural negatives as described above for the phosphopeptide data sets.

Data partitioning

After the merge of positive and negative data, the phosphosites were translated to a regular STY amino acid encoding to cluster together phosphopeptides and regular peptides with the same amino acid sequence. A common motif Hobohm 1 clustering with length 8 amino acids was used to create five partitions for cross-validation from the combined data including both BA and EL data as described earlier [20x[20]Nielsen, M, Lundegaard, C, and Lund, O. Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC Bioinformatics. 4. juli. 2007; 8: 238

Crossref | PubMed | Scopus (366)
| Google ScholarSee all References
]. After partitioning, the phosphosites were back-translated to “sty”. This final training data set is available at https://github.com/mnielLab/NetMHCphosPan-1.0.

Model training

The model was trained with NNAlign_MA [17x[17]Alvarez, B, Reynisson, B, Barra, C, Buus, S, Ternette, N, and Connelley, T. NNAlign_MA; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved T-cell epitope predictions. Mol Cell Proteomics. 2019; 18: 2459–2477 (december)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
] using five-fold cross-validation and early stopping with up to 200 minimization cycles, an optimal motif length of 9, an SA burn-in period of 20 iterations, and 200,000 samples per training cycle. Models were trained using 56 or 66 hidden neurons and five random network initializations for each architecture resulting in 10 ensemble models for each cross-validation step. Peptides were represented using sparse sequence encoding with an extended amino acid alphabet: ARNDCQEGHILKMFPSTWYVsty, where “sty” represents the phosphorylated amino acids pS, pT, and pY as described above.

Benchmark data set

An independent benchmark data set was extracted from previous publications including C1R-A1 (HLA-A*01:01), C1R-A2 (HLA-A*02:01), C1R-A24 (HLA-A*24:02) and GR (HLA-A*01:01, HLA-A*03:01, HLA-B*07:02, HLA-B*27:05, HLA-C*02:02, and HLA-C*07:02) [7x[7]Alpízar, A, Marino, F, Ramos-Fernández, A, Lombardía, M, Jeko, A, and Pazos, F. A molecular basis for the presentation of phosphorylated peptides by HLA-B antigens. Mol Cell Proteomics MCP. februar. 2017; 16: 181–193

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
,21x[21]Mei, S, Ayala, R, Ramarathinam, SH, Illing, PT, Faridi, P, and Song, J. Immunopeptidomic analysis reveals that deamidated HLA-bound peptides arise predominantly from deglycosylated precursors. Mol Cell Proteomics. juli. 2020; 19: 1236–1247

Abstract | Full Text | Full Text PDF | PubMed | Scopus (11)
| Google ScholarSee all References
], combined with an in-house data set (see below) obtained from two multiallelic cell lines DoTc2 (HLA-A*03:01, HLA-B*55:01, and HLA-C*03:03) and SiHa (HLA-A*24:02, HLA-B*40:02, and HLA-C*03:04). For all data sets, peptides sharing an overlap with 8 or more residues to a peptide in the training data were removed resulting in a benchmark with 225 phosphopeptides (available at https://github.com/mnielLab/NetMHCphosPan-1.0). Supplementary table 1 includes an overview of the data.

The in-house data set was generated by immunoprecipitation of HLA-class I using the pan-HLA class I antibody W6/332 as described [22x[22]Purcell, AW, Ramarathinam, SH, and Ternette, N. Mass spectrometry-based identification of MHC-bound peptides for immunopeptidomics. Nat Protoc. juni. 2019; 14: 1687–1707

Crossref | PubMed | Scopus (113)
| Google ScholarSee all References
]. In brief, HLA complex components were purified from cleared cell lysates of 5 × 108 cells per sample by incubation with 2 mg W63/32 antibody cross-linked to 1 ml Protein A Sepharose beads (Abcam) overnight at 4 degrees Celsius. Beads were washed, and HLA complex components eluted from the beads using 10% acetic acid. peptides were separated from larger complex components by preparative HPLC, and peptide fractions were pooled and analysed by nano-liquid chromatography–tandem mass spectrometry (nano-LC-MS/MS) using an Ultimate 3000 RSLCnano System (PepMap C18 column, 2 µm particle size, 75 µm × 50 cm; Thermo Scientific) coupled with an Orbitrap Fusion Lumos Tribrid mass spectrometer (Thermo Scientific). A 60 min linear gradient from 3% to 25% ACN in 5% DMSO/0.1% formic acid at a flow rate of 250 nl/min was used for peptides elution. Mass spectrometry (MS) detection was performed with a resolution of 120,000 for full MS (300-1500m/z scan range). Precursors were selected using TopSpeed mode at a 2 s cycle time, an isolation width of 1.2 amu, a resolution of 30,000 and high-energy collisional dissociation (HCD) setting of 28 for peptides with 2-4 charges and 32 for singly-charged precursors. Data were analysed using Peaks v10 (Bioinformatics solutions) with the following settings: precursor mass tolerance: 5 ppm; fragment mass tolerance: 0.03 Da; digestion: none; fixed modifications: none; variable modifications: phosphorylation on (STY). The false discovery rate at a score of -10lgP=15 was determined as 1.88% for DoTc2 cells, and 0.90% for SiHa cells, respectively. This data set and its HLA-typing are available at https://github.com/mnielLab/NetMHCphosPan-1.0/tree/main/Benchmark.

Performance evaluation

For evaluation on multiple-allele data sets, a percentile rank score transformation was applied. This transformation was made from the score distribution of 100,000 natural random peptides. Using this transformation, a prediction score of a peptide was defined as the score to the HLA molecule with the lowest percentile rank value across the HLAs of a given cell line. The performance on the cross-validation data was evaluated by concatenating all test set predictions and calculating the area under the ROC curve (AUC), AUC with a maximum false-positive rate of 10% (AUC0.1), and positive predictive value (PPV) based on the prediction rank scores. To calculate the PPV, the predictions were ordered by their prediction rank score, then if the particular cell line had N positives, the ratio of positives among the top N predictions was reported as the PPV. Performance on the independent benchmark data was evaluated in terms of F-rank values. F-rank values were calculated from the rank R of each positive peptide within a related set of negative peptides in the evaluation data set, i.e. as (R-1)/total_peptides, so an F-rank of zero would be a perfect performance, while 0.5 would be a random prediction. For that purpose, an evaluation set of negative peptides was constructed for each positive peptide including all peptides of the length of the positive peptide overlapping known phospho sites [18x[18]Hornbeck, PV, Zhang, B, Murray, B, Kornhauser, JM, Latham, V, and Skrzypek, E. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. januar. 2015; 43: D512–D520

Crossref | PubMed | Scopus (1609)
| Google ScholarSee all References
] within the positive source protein. In the cases of peptides containing two or more phosphosites, only combinations of single phosphorylated peptides were included in the evaluation dataset. This F-rank calculation was done for each positive cell line combination, meaning that proteins with multiple phospho ligands were benchmarked individually for each phospho ligand. The independent benchmark performance was further evaluated in terms of AUC, AUC0.1, and PPV by pooling the predictions of the evaluation sets.

Sequence logos

Sequence logos were made using the R package "ggseqlogoMOD" a further developed version of the package “ggseqlogo” by Wagih et al. [23x[23]Wagih, O. ggseqlogo: a versatile R package for drawing sequence logos. Hancock J, redaktør. Bioinformatics. 2017; 33: 3645–3647 (15. november)

Crossref | PubMed | Scopus (257)
| Google ScholarSee all References
].  This package extension was introduced in Solleder et al. [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
] and can be found at https://github.com/GfellerLab/ggseqlogo.

Plots

All plots in this study were made in R using the tidyverse and patchwork packages.

Results

Here, a novel predictor for HLA presentation of phosphorylated peptides is presented. The model is trained on a comprehensive data set of 3,653 HLA eluted phospho-ligands (pEL) from 63 different cell lines (for details refer to materials and methods). The method was developed using the pan-specific NNAlign_MA machine learning framework [17x[17]Alvarez, B, Reynisson, B, Barra, C, Buus, S, Ternette, N, and Connelley, T. NNAlign_MA; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved T-cell epitope predictions. Mol Cell Proteomics. 2019; 18: 2459–2477 (december)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
], allowing concurrent motif deconvolution and model training, enabling leveraging of information between data sets, potentially resulting in improved predictive power and boosted allelic coverage.

The pEL peptides varied in length from 8 to 13 amino acids, with the majority being 9-mers (Fig. 1A). The distribution of the three phosphorylation types in the training data set were pS: 73%, pT: 23%, and pY: 4% (Fig. 1B), in agreement with Solleder et al. [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
]. This distribution was highly different from that observed by Sharma et al. [24x[24]Sharma, K, D'Souza, RCJ, Tyanova, S, Schaab, C, Wiśniewski, JR, and Cox, J. Ultradeep Human Phosphoproteome Reveals a Distinct Regulatory Nature of Tyr and Ser/Thr-Based Signaling. Cell Rep. 2014; 8: 1583–1594 (september)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (623)
| Google ScholarSee all References
] in the phosphorylated version of human proteome (pS: 84.1%, pT: 15.5%, and pY: 0.04%, [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
,24x[24]Sharma, K, D'Souza, RCJ, Tyanova, S, Schaab, C, Wiśniewski, JR, and Cox, J. Ultradeep Human Phosphoproteome Reveals a Distinct Regulatory Nature of Tyr and Ser/Thr-Based Signaling. Cell Rep. 2014; 8: 1583–1594 (september)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (623)
| Google ScholarSee all References
]), and likewise different to all known phosphorylations reported on PhosphoSitePlus (pS: 58%, pT: 25%, and pY: 17%, [18x[18]Hornbeck, PV, Zhang, B, Murray, B, Kornhauser, JM, Latham, V, and Skrzypek, E. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. januar. 2015; 43: D512–D520

Crossref | PubMed | Scopus (1609)
| Google ScholarSee all References
]). Although a higher proportion of pTyr-sites was reported in PhosphoSitePlus, the observed lower frequency of this post-translational modification by Sharma et al. is due to pTyr being transient and related to specific activation functions on the cell. The peptides were distributed unevenly among the cell lines, with some cell lines having as few as three phosphorylated peptides whereas the highest occurrence in a single cell line was 283 phosphopeptides (Fig. 1C). The HLA-typing of the cell lines mostly all consisted of an even distribution among the HLA-A, HLA-B, and HLA-C alleles, the exception being the C1R-* cell lines which only expressed two HLA-C and one HLA-B allele.

Fig 1 Opens large image

Fig. 1

Descriptive plots of the phosphopeptide training set. A) Overview of the peptide length distribution. B) Distribution of the phosphorylation types pS, pT and pY in pEL, phosphoSitePlus and Sharma et al. [18x[18]Hornbeck, PV, Zhang, B, Murray, B, Kornhauser, JM, Latham, V, and Skrzypek, E. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. januar. 2015; 43: D512–D520

Crossref | PubMed | Scopus (1609)
| Google ScholarSee all References
,24x[24]Sharma, K, D'Souza, RCJ, Tyanova, S, Schaab, C, Wiśniewski, JR, and Cox, J. Ultradeep Human Phosphoproteome Reveals a Distinct Regulatory Nature of Tyr and Ser/Thr-Based Signaling. Cell Rep. 2014; 8: 1583–1594 (september)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (623)
| Google ScholarSee all References
]. C) Distribution of the 3653 phosphopeptides across the 63 cell lines. D) Overview of the HLA-typing of the individual cell lines.

A model, termed NetMHCphosPan, was trained and evaluated using cross-validation applying the neural network-based tool NNAlign_MA [17x[17]Alvarez, B, Reynisson, B, Barra, C, Buus, S, Ternette, N, and Connelley, T. NNAlign_MA; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved T-cell epitope predictions. Mol Cell Proteomics. 2019; 18: 2459–2477 (december)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
] integrating pEL data with single-allelic (SA) EL and binding affinity (BA) data of unmodified peptides. For details on the data, data partitioning, and model training refer to materials and methods. In short, NNAlign_MA allows integration of mixed data types (BA and EL) in a pan-specific machine learning framework based on “pseudo-labeling” which enables on-the-fly motif deconvolution of EL data from cell lines expressing multiple HLA alleles (so-called MA data). This is achieved using a short burnin pre-training on SA data resulting in a model with pan-specific prediction potential. Subsequently, this pre-trained model is applied to annotate the MA data and assign each ligand to the most likely of the multiple possible HLA molecules. This MA annotation is repeated during the model training. This results in an expansion of the training set size, and an overall improved predictive power compared to motif deconvolution conducted and methods trained in a per-dataset manner [17x[17]Alvarez, B, Reynisson, B, Barra, C, Buus, S, Ternette, N, and Connelley, T. NNAlign_MA; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved T-cell epitope predictions. Mol Cell Proteomics. 2019; 18: 2459–2477 (december)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
].

Fig. 2 displays the cross-validation performance values measured in terms of AUC, AUC0.1, and the positive predictive value (PPV) for NetMHCphosPan. Here prediction values for ligands in the phosphopeptides data sets were defined as the prediction score to the HLA with the lowest predicted percentile rank value across the different HLA molecules expressed in the given cell line. Further, the NetMHCpan-4.1 method was included to assess the performance of a baseline method trained solely on unmodified peptide data. This was done in three different manners; replacing all modified amino acids with “X”, with corresponding unmodified amino acid, or with an amino acid of similar charge (pS -> D, pT -> E, pY -> Y) before performing the predictions.

Fig 2 Opens large image

Fig. 2

Cross-validation performance of NetMHCphosPan and NetMHCpan-4.1 on the training data. Performance for NetMHCphosPan was evaluated using 5-fold cross-validation. For NetMHCpan-4.1, all phosphosites were replaced with “X” (NetMHCpan-4.1_X), corresponding unmodified amino acids (NetMHCpan-4.1_STY), or with amino acids of similar charge (NetMHCpan-4.1_DEY). Performance metrics included are A) AUC, B) AUC0.1 (AUC integrated up to a false-positive rate of 10%), and C) PPV (predicted positive rate, calculated as described in material and methods). P-values stated are from Wilcoxon signed-rank test. *P<0.05, **P<0.01, ***P<0.001, ****P<0.0001.

NetMHCphosPan significantly outperformed all three NetMHCpan-4.1 implementations (p<0.012 in all comparisons, binomial test excluding ties), exemplified by a more than 5% increased median PPV performance. Moreover, the proportion of phospho-ligands with predicted rank values > 20% in the NetMHCphosPan evaluation, revealed a very low proportion of contaminants (5%) confirming a high quality of the data. Note, that this value corresponds nicely to the 5% FDR threshold used to annotate the MS data [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
]. An example of such contaminants is sERHVAQF which was identified across 14 out of 15 C1R-HLA-C cell line data sets despite consistently being predicted as a poor binder.

Next, the predicted position of the phosphosite in the phosphorylated ligand was investigated. The result of this analysis is shown in Fig. 3 and Supp. Fig. S1. Here, the phosphopeptides were divided based on their predicted HLA prediction rank score, to investigate if patterns could differentiate binding from non-binding, i.e MS coprecipitated contaminant, peptides. In line with earlier results [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
], this analysis revealed a clear and highly significant (p<0.05, comparison of ratio) enrichment of phosphosite located at P4 for ligands for all peptide lengths (Fig. 3 and Supp. Fig. S1). In contrast, the phosphopeptides categorized as HLA non-binders/contaminants had a close to even distribution of the phosphosite in all positions, with exception of 8mers where the non-binding peptides tended to have the phosphosite located at P1. This P1 enrichment is however driven by the single 8mer sERHVAQF contaminant mentioned earlier (data not shown). Refer to Supplementary Fig. S1 for an illustration also including the intermediate binding category. The plot further demonstrates that phosphosites only very rarely are located at the common HLA anchor positions (P1, P2, and PΩ). That is, for each of these positions, phosphosites are found in less than 2.5% of the binding peptides.

Fig 3 Opens large image

Fig. 3

Distribution of phosphosite locations in phosphorylated ligands of different lengths. Binders are peptides with predicted percentile rank below 2%, and non-binders peptides with a predicted percentile rank above 20% from NetMHCphosPan cross-validation. In total 2916, 563 and 174 ligands were assigned to each of the three categories binding (<2%), intermediate (2-20%), and Non-binding (>20%).  Number of ligands assigned to each Binding/Non-binding category is displayed in the Fig. as (#binding:#non-binding) next to the peptide length. The bars represent the true data, while error bars are estimated using bootstrapping.

To further visualize this bias in the location of the phosphosite in the pEL data, we plotted the sequence logos of the 9-mer predicted cores of the phosphopeptides for all 35 HLA molecules covered with 20 or more phospho-ligands in the motif deconvolution. These sequence logos were compared to logos generated from unmodified peptides predicted by NetMHCpan-4.1 for the same HLA alleles (see Fig. 4, motifs predicted by NetMHCphosPan are included in Supplementary Fig. S2). This Fig. additionally highlights the overall preferred phosphosite location at P4. One clear exception from this was the preferred phosphosite position at P8 observed for HLA-B08:01 in agreement with [3x[3]Zarling, AL, Ficarro, SB, White, FM, Shabanowitz, J, Hunt, DF, and Engelhard, VH. Phosphorylated peptides are naturally processed and presented by major histocompatibility complex class I molecules in vivo. J Exp Med. 2000; 192: 1755–1762 (18. december)

Crossref | PubMed | Scopus (167)
| Google ScholarSee all References
]. Further comparing the motif for the unmodified peptides and phosphorylated peptides, and in line with earlier works, an enrichment of proline at P5 was observed. Several HLA molecules also had a preference for arginine at P1. These findings align with and extend the earlier findings by Solleder et al. [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
], where motifs for phosphorylated ligands to 22 HLA molecules characterized with 20 or more ligands were described (Supplementary Fig. S3). Accurate motifs were identified for 35 HLA molecules compared to 22 described by Solleder et al. [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
]. The larger HLA coverage described here is a direct result of the enhanced motif deconvolution potential of NNAlign_MA single data set approaches imposed by its ability to leverage information between data sets, reconcile peptides of variable length to a common core of 9 amino acids, and integrate prior motif knowledge from un-modified SA data.

Fig 4 Opens large image

Fig. 4

Sequence logo representations of binding motifs for HLA molecules characterized with 20 or more phosphorylated peptides. Binding motifs of NetMHCphosPan for phosphorylated peptides compared to NetMHCpan-4.1 binding motifs of unmodified peptides. Phosphosites are shown in purple. Noted in parenthesis is the number of peptides included in the logo generation. For NetMHCphosPan, ligands with predicted percentile rank scores greater than 20% were excluded. For NetMHCpan-4.1, logos were generated from the top 0.1% scoring predictions from a set of 125,000 random natural 8-11mer peptides.

This higher motif deconvolution power is further revealed when comparing the total number of ligands with deconvoluted HLA restriction between NNAlign_MA and the MixMHCp method applied by Solleder et al.. Here, NNAlign_MA assigns significantly more (38%, p-value 0.02 binomial test excluding ties) phospho-ligands to individual HLA molecules compared to that obtained by the MixMHCp deconvolution [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
] when including all peptides lengths and HLA molecules characterized by 20 or more ligands for at least one of the two methods, and 45% more (p<0.005, Binomial test excluding ties) when focusing only on 9-mer peptides (Supp. Table 2). Also in this analysis were peptides with percentile rank scores over 20 excluded from the NNAlign_MA analysis as those were regarded as contaminants. This analysis further underlines the importance of including peptides other than 9mers for motif characterization, as they constitute 47% of the pEL collected for the training data, and the increased deconvolution potential of the NNAlign_MA framework over other strategies. This increased coverage of overall 50% is most pronounced for HLA-B where NNAlign_MA allows for accurate motif deconvolution for 15 HLA molecules compared to 7 covered by 20 or more 9-mer ligands reported by Solleder et al. [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
].

Further, investigating the relative proportion of phosphorylated ligands assigned to HLA-A, HLA-B, and HLA-C loci alleles after the motif deconvolution for cell lines with HLA expression at all three loci/complete HLA allele information demonstrated the expected enrichment for HLA-A and HLA-B presentation (proportion assigned phosphorylated ligands for HLA-A and HLA-B of 46% and 35%) over HLA-C (19%). This proportion of HLA-C presented ligands is however substantially higher compared to that observed for unmodified peptides where the proportion of HLA-C presented ligands most often is close to 10% [17x[17]Alvarez, B, Reynisson, B, Barra, C, Buus, S, Ternette, N, and Connelley, T. NNAlign_MA; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved T-cell epitope predictions. Mol Cell Proteomics. 2019; 18: 2459–2477 (december)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
]. As expected, this pattern was changed when investigating the C1R-C cell lines, only expressing HLA-B and HLA-C allele, where the observed proportions were 22% and 78% for HLA-B and HLA-C respectively. This result is in line with that reported in Solleder et al. where a clear enrichment of phosphorylated peptides among HLA-C ligands was also observed.

Finally, NetMHCphosPan was validated on an independent benchmark data set consisting of 225 unique phosphopeptides not sharing overlap with training data extracted from 6 cell lines. The performance was evaluated in terms of F-rank, PPV, AUC, and AUC 0.1. The F-rank metric (false positive rank), is calculated on a per protein-ligand basis from the proportion of source protein peptides with a prediction score greater than the observed ligand, and hence has a direct intuitive interpretation. This metric hence has an optimal value of 0, and 0.5 for random predictions. PPV, AUC, and AUC 0.1 values were calculated on the combined data set covering all 225 unique phospho ligands. For details on the data and performance evaluation refer to materials and methods. The performance was compared to PhosMHCpred developed by the Gfeller lab and to NetMHCpan-4.1 with all phosphosites replaced with “X”. Since PhosMHCpred is only able to predict 9-mers and a limited set of HLA alleles, we first conducted the benchmark on a subset of 101 (48%) phosphopeptides leaving out all non-9mers and the complete C1R-A24 cell line data (Fig. 5). Further, the list of possible alleles in the MA data was reduced to the set covered by PhosMHCpred, i.e excluding HLA-B27:05, HLA-C02:02 from GR; HLA-A24:02 from SiHa; and HLA-C03:03 from DoTc2. Additionally, 7 peptides were considered non-HLA presented peptides (i.e. MS co-immunoprecipitated/ contaminant) and therefore excluded from the benchmark, as they were predicted with percentile rank scores above 20% in at least one of the three models. This F-rank benchmark demonstrates an overall comparable performance between NetMHCphosPan and PhosMHCpred and a significantly improved performance of NetMHCphosPan compared to NetMHCpan-4.1 (p<0.001, binomial test excluding ties, Fig. 5 right panel). Looking at the individual data sets, this observation is maintained, with NetMHCphosPan significantly outperforming the two other methods for the larger DoTc2 data set (p<0.05 in both cases, binomial test excluding ties). Similar conclusions were obtained using the PPV, AUC, and AUC 0.1 performance metrics. Here the performance values for the 9mer data set in Fig. 5 were AUC = 0.955/0.936/0.942, AUC 0.1 = 0.773/0.743/0.761, PPV = 0.296/0.268/0.241 for NetMHCphosPan, phosMHCpred and NetMHCpan respectively.

Fig 5 Opens large image

Fig. 5

Boxplot of the F-rank performance values on 9-mer peptides in the benchmark over the different cell lines, and all cell line data collected under the label All. Methods included are NetMHCpan-4.1 with all phosphosites replaced with “X”, NetMHCphosPan, and PhosMHCpred, the method proposed by Solleder et al.. In parenthesis above each box-plot is noted the number of peptides associated with the respective cell lines. Only peptides with a predicted percentile rank score of 20 or less for at least one of the three methods are included. F-rank values equal to 0 are displayed as 0.0001.

Further, Fig. 6 shows the predictive power for all peptides (208, excluding as in Fig. 5 likely non-HLA presented data points) for all six data sets including for the MA data the complete set of expressed HLA alleles. This plot demonstrates that the high performance of NetMHCphosPan is maintained also outside the restrictions of PhosMHCpred and confirms the overall improved (though not statistically significant due to the small data set sizes) performance over NetMHCpan across all peptide lengths. Additionally, AUC, AUC 0.1 and PPV performance metrics showed an improved performance of NetMHCphosPan compared to NetMHCpan. Here AUC = 0.911/0.894, AUC 0.1 = 0.711/0.694, and PVV = 0.267/0.171 performance values for NetMHCphosPan and NetMHCpan were obtained, respectively. These results thus confirm the overall superior predictive power of NetMHCphosPan over the two other methods.

Fig 6 Opens large image

Fig. 6

Boxplot of the F-rank performance values for the benchmark over the different lengths and cell lines. The number of peptides associated with each length is noted in parenthesis. Methods included are NetMHCpan-4.1 with all phosphosites replaced with “X”, and NetMHCphosPan. Only peptides with a predicted percentile rank score of 20 or less for at least one of the two methods are included. F-rank values equal to 0 are displayed as 0.0001.

In summary, these two independent benchmark calculations demonstrate not only that NetMHCphosPan achieves comparable performance to that of PhosMHCpred for the limited set of peptide length and alleles covered by PhosMHCpred, but also that this high performance for NetMHCphosPan extends to a much increased allelic and peptide length coverage. The NetMHCphosPan method is available as a web-server at http://www.cbs.dtu.dk/services/NetMHCphosPan-1.0.

Discussion

In this study, we have applied the NNAlign_MA machine learning framework to develop an improved predictor for HLA antigen presentation of phosphorylated peptides. In contrast to the MixMHCp approach used for motif deconvolution in the earlier work by Solleder et al. [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
], the NNAlign_MA approach allows for the integration of information from mixed data types including both phosphorylated and unmodified peptides along with the MHC pseudo-sequence in an HLA pan-specific prediction model, enabling motif deconvolution by information shared between datasets. This has the potential to result in a motif deconvolution with broader allelic coverage and associated improved predictive power [17x[17]Alvarez, B, Reynisson, B, Barra, C, Buus, S, Ternette, N, and Connelley, T. NNAlign_MA; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved T-cell epitope predictions. Mol Cell Proteomics. 2019; 18: 2459–2477 (december)

Abstract | Full Text | Full Text PDF | PubMed | Scopus (34)
| Google ScholarSee all References
].

Using this NNAlign_MA framework, a model termed NetMHCphosPan, was trained on an extensive data set of 3,653 phosphorylated HLA ligands obtained from 63 cell lines expressing a total of 69 distinct HLA molecules, combined with a large data set of un-modified binding affinity and single allele HLA eluted ligand data (obtained from the training data of NetMHCpan-4.1 [19x[19]Reynisson, B, Alvarez, B, Paul, S, Peters, B, and Nielsen, M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2. juli. 2020; 48: W449–W454

Crossref | PubMed | Scopus (0)
| Google ScholarSee all References
]). Performance estimated using cross-validation confirmed a significantly improved predictive power of NetMHCphosPan compared to NetMHCpan (run using "X" in the phosphosite) but also revealed an at first surprising high performance of the latter. Investigating the location of the phosphosite across the 3,653 modified ligands demonstrated a clear and highly statistically significant enrichment at the P4 position independent of peptide length. This observation is in agreement with earlier findings including those of Solleder et al. [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
] and provides an explanation for the relatively high performance of NetMHCpan for predicting the X modified peptides since P4 rarely serves as a key residue for HLA binding [19x[19]Reynisson, B, Alvarez, B, Paul, S, Peters, B, and Nielsen, M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2. juli. 2020; 48: W449–W454

Crossref | PubMed | Scopus (0)
| Google ScholarSee all References
].

Analyzing the relative proportion of phosphorylated ligands assigned to HLA-A, HLA-B, and HLA-C loci alleles after the motif deconvolution revealed the expected enrichment for HLA-A and HLA-B presentation over HLA-C. However, the proportion of HLA-C presented ligands was found to be higher in phosphorylated ligands compared to that most often observed for unmodified peptides, in line with the earlier results of Solleder et al. [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
].

Investigating binding motifs from the data set deconvolution, the P4 phosphosite enrichment was confirmed for the vast majority of individual HLA molecules, with one clear outlier, HLA-B*08:01 having a phosphosite enrichment at P8 in line with earlier work [3x[3]Zarling, AL, Ficarro, SB, White, FM, Shabanowitz, J, Hunt, DF, and Engelhard, VH. Phosphorylated peptides are naturally processed and presented by major histocompatibility complex class I molecules in vivo. J Exp Med. 2000; 192: 1755–1762 (18. december)

Crossref | PubMed | Scopus (167)
| Google ScholarSee all References
]. Additionally, analyses of the binding motifs suggested an enrichment of proline at P5 and arginine at P1 in phosphorylated peptides in line with kinase-domain-specific binding motifs such as CDC2 and CAMK2A [3x[3]Zarling, AL, Ficarro, SB, White, FM, Shabanowitz, J, Hunt, DF, and Engelhard, VH. Phosphorylated peptides are naturally processed and presented by major histocompatibility complex class I molecules in vivo. J Exp Med. 2000; 192: 1755–1762 (18. december)

Crossref | PubMed | Scopus (167)
| Google ScholarSee all References
,25x[25]Fenoy, E, Izarzugaza, JMG, Jurtz, V, Brunak, S, and Nielsen, M. A generic deep convolutional neural network framework for prediction of receptor-ligand interactions-NetPhosPan: application to kinase phosphorylation prediction. Bioinforma Oxf Engl. 2019; 35: 1098–1107 (1. april)

Crossref | PubMed | Scopus (9)
| Google ScholarSee all References
]. Finally, we observed very few examples of ligands with phosphosites located at the conventional HLA anchor binding positions (P1, P2, and PΩ).

Our analyses further confirmed the superior power of the pan-data NNAlign_MA method over single data set approaches allowing accurate motif deconvolution for 35 HLA molecules compared to the 22 obtained from the same data using the method of Solleder et al. [16x[16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404

Abstract | Full Text | Full Text PDF | PubMed | Scopus (21)
| Google ScholarSee all References
]. Using the pan-specific potential of the trained NetMHCpanPan model and a highly conservative sequence similarity distance threshold of 0.05 [26x[26]Hoof, I, Peters, B, Sidney, J, Pedersen, LE, Sette, A, and Lund, O. NetMHCpan, a method for MHC class I binding prediction beyond humans. Immunogenetics. januar. 2009; 61: 1–13

Crossref | PubMed | Scopus (502)
| Google ScholarSee all References
], these 35 molecules extrapolate to 6,861 distinct HLA molecules providing close to complete coverage of the worldwide prevalent HLA molecules.

Finally, the NetMHCphosPan method was benchmarked against the PhosMHCpred method developed by Solleder et al. and NetMHCpan-4.1 on a large independent dataset covering 208 phosphorylated HLA ligand either obtained from literature studies or generated specifically for this study. The subset of data covered by PhosMHCpred (this method only allows for prediction of 9-mer peptides, and only covers 22 HLA molecules) demonstrated a comparable performance between NetMHCphosPan and PhosMHCpred, and a statistically improved performance of NetMHCphosPan compared to NetMHCpan. This high performance of NetMHCphosPan was next confirmed when evaluating on the complete data set including all HLAs and peptides of length 8-13.

In conclusion, we have developed a method for prediction of HLA antigen presentation of phosphorylated peptides with improved predictive power as well as greatly expanded HLA and peptide length coverage compared to current state-of-the-art methods. We expect this method will have high impact for rational identification for novel immune targets and development of novel immunotherapies [27x[27]Leko, V and Rosenberg, SA. Identifying and Targeting Human Tumor Antigens for T Cell-Based Immunotherapy of Solid Tumors. Cancer Cell. 12. oktober. 2020; 38: 454–472

Abstract | Full Text | Full Text PDF | PubMed | Scopus (73)
| Google ScholarSee all References
,28x[28]Engelhard, VH, Obeng, RC, Cummings, KL, Petroni, GR, Ambakhutwala, AL, and Chianese-Bullock, KA. MHC-restricted phosphopeptide antigens: preclinical validation and first-in-humans clinical trial in participants with high-risk melanoma. J Immunother Cancer. maj. 2020; 8

Google ScholarSee all References
]. Further, the NNAlign_MA modeling framework is readily extendable to cover peptides with other post-translational modifications such as citrullination, deamidation, and glycosylation, providing an ideal and ready to apply platform to analyse and train prediction models for HLA presentation of such modified peptides once the relevant EL data becomes available.

Funding

This work was supported in part by the Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN272201200010C , and by the Science and Technology Council of Investigation (CONICET-Argentina). The authors declare that they have no competing interests.

Author contributions

MN and CB conceived the main conceptual project idea, CTR and CB performed most of the data handling and computational experiments. NT and XP generated the experimental data. The manuscript was written by CTR, CB and MN with contributions from all authors. All authors have read and approved the final version of the manuscript.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix. Supplementary materials

References

  1. [1]Neefjes, J, Jongsma, MLM, Paul, P, and Bakke, O. Towards a systems understanding of MHC class I and MHC class II antigen presentation. (december)Nat Rev Immunol. 2011; 11: 823–836
  2. [2]Andersen, MH, Bonfill, JE, Neisig, A, Arsequell, G, Sondergaard, I, and Valencia, G. Phosphorylated peptides can be transported by TAP molecules, presented by class I MHC molecules, and recognized by phosphopeptide-specific CTL. J Immunol Baltim Md 1950. 1. oktober. 1999; 163: 3812–3818
  3. [3]Zarling, AL, Ficarro, SB, White, FM, Shabanowitz, J, Hunt, DF, and Engelhard, VH. Phosphorylated peptides are naturally processed and presented by major histocompatibility complex class I molecules in vivo. (18. december)J Exp Med. 2000; 192: 1755–1762
  4. [4]Zarling, AL, Polefrone, JM, Evans, AM, Mikesh, LM, Shabanowitz, J, and Lewis, ST. Identification of class I MHC-associated phosphopeptides as targets for cancer immunotherapy. Proc Natl Acad Sci. 3. oktober. 2006; 103: 14889–14894
  5. [5]Robinson, J, Barker, DJ, Georgiou, X, Cooper, MA, Flicek, P, and Marsh, SGE. IPD-IMGT/HLA database. (oktober)Nucleic Acids Res. 2019; 31: gkz950
  6. [6]Nielsen, M, Andreatta, M, Peters, B, and Buus, S. Immunoinformatics: predicting peptide–MHC binding. Annu Rev Biomed Data Sci. 20. juli. 2020; 3: 191–215
  7. [7]Alpízar, A, Marino, F, Ramos-Fernández, A, Lombardía, M, Jeko, A, and Pazos, F. A molecular basis for the presentation of phosphorylated peptides by HLA-B antigens. Mol Cell Proteomics MCP. februar. 2017; 16: 181–193
  8. [8]Meyer, VS, Drews, O, Günder, M, Hennenlotter, J, Rammensee, H-G, and Stevanovic, S. Identification of natural MHC class II presented phosphopeptides and tumor-derived MHC class I phospholigands. J Proteome Res. 6. juli. 2009; 8: 3666–3674
  9. [9]Cobbold, M, De La Pena, H, Norris, A, Polefrone, JM, Qian, J, and English, AM. MHC class I-associated phosphopeptides are the targets of memory-like immunity in leukemia. (18. september203ra125-203ra125)Sci Transl Med. 2013; 5
  10. [10]Marcilla, M, Alpízar, A, Lombardía, M, Ramos-Fernandez, A, Ramos, M, and Albar, JP. Increased diversity of the HLA-B40 ligandome by the presentation of peptides phosphorylated at their main anchor residue*. Mol Cell Proteomics. februar. 2014; 13: 462–474
  11. [11]Bassani-Sternberg, M, Bräunlein, E, Klar, R, Engleitner, T, Sinitcyn, P, and Audehm, S. Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry. (december)Nat Commun. 2016; 7: 13404
  12. [12]Abelin, JG, Keskin, DB, Sarkizova, S, Hartigan, CR, Zhang, W, and Sidney, J. Mass spectrometry profiling of HLA-associated peptidomes in mono-allelic cells enables more accurate epitope prediction. Immunity. februar. 2017; 46: 315–326
  13. [13]Mohammed, F, Cobbold, M, Zarling, AL, Salim, M, Barrett-Wilt, GA, and Shabanowitz, J. Phosphorylation-dependent interaction between antigenic peptides and MHC class I: a molecular basis for the presentation of transformed self. (november)Nat Immunol. 2008; 9: 1236–1243
  14. [14]Mohammed, F, Stones, DH, Zarling, AL, Willcox, CR, Shabanowitz, J, and Cummings, KL. The antigenic identity of human class I MHC phosphopeptides is critically dependent upon phosphorylation status. (15. august)Oncotarget. 2017; 8: 54160–54172
  15. [15]Mommen, GPM, Frese, CK, Meiring, HD, van Gaans-van den Brink, J, de Jong, APJM, and van Els, CACM. Expanding the detectable HLA peptide repertoire using electron-transfer/higher-energy collision dissociation (EThcD). Proc Natl Acad Sci. 25. marts. 2014; 111: 4507–4512
  16. [16]Solleder, M, Guillaume, P, Racle, J, Michaux, J, Pak, H-S, and Müller, M. Mass spectrometry based immunopeptidomics leads to robust predictions of phosphorylated HLA Class I Ligands. Mol Cell Proteomics MCP. februar. 2020; 19: 390–404
  17. [17]Alvarez, B, Reynisson, B, Barra, C, Buus, S, Ternette, N, and Connelley, T. NNAlign_MA; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved T-cell epitope predictions. (december)Mol Cell Proteomics. 2019; 18: 2459–2477
  18. [18]Hornbeck, PV, Zhang, B, Murray, B, Kornhauser, JM, Latham, V, and Skrzypek, E. PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. januar. 2015; 43: D512–D520
  19. [19]Reynisson, B, Alvarez, B, Paul, S, Peters, B, and Nielsen, M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2. juli. 2020; 48: W449–W454
  20. [20]Nielsen, M, Lundegaard, C, and Lund, O. Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC Bioinformatics. 4. juli. 2007; 8: 238
  21. [21]Mei, S, Ayala, R, Ramarathinam, SH, Illing, PT, Faridi, P, and Song, J. Immunopeptidomic analysis reveals that deamidated HLA-bound peptides arise predominantly from deglycosylated precursors. Mol Cell Proteomics. juli. 2020; 19: 1236–1247
  22. [22]Purcell, AW, Ramarathinam, SH, and Ternette, N. Mass spectrometry-based identification of MHC-bound peptides for immunopeptidomics. Nat Protoc. juni. 2019; 14: 1687–1707
  23. [23]Wagih, O. ggseqlogo: a versatile R package for drawing sequence logos. (15. november)Hancock J, redaktør. Bioinformatics. 2017; 33: 3645–3647
  24. [24]Sharma, K, D'Souza, RCJ, Tyanova, S, Schaab, C, Wiśniewski, JR, and Cox, J. Ultradeep Human Phosphoproteome Reveals a Distinct Regulatory Nature of Tyr and Ser/Thr-Based Signaling. (september)Cell Rep. 2014; 8: 1583–1594
  25. [25]Fenoy, E, Izarzugaza, JMG, Jurtz, V, Brunak, S, and Nielsen, M. A generic deep convolutional neural network framework for prediction of receptor-ligand interactions-NetPhosPan: application to kinase phosphorylation prediction. (1. april)Bioinforma Oxf Engl. 2019; 35: 1098–1107
  26. [26]Hoof, I, Peters, B, Sidney, J, Pedersen, LE, Sette, A, and Lund, O. NetMHCpan, a method for MHC class I binding prediction beyond humans. Immunogenetics. januar. 2009; 61: 1–13
  27. [27]Leko, V and Rosenberg, SA. Identifying and Targeting Human Tumor Antigens for T Cell-Based Immunotherapy of Solid Tumors. Cancer Cell. 12. oktober. 2020; 38: 454–472
  28. [28]Engelhard, VH, Obeng, RC, Cummings, KL, Petroni, GR, Ambakhutwala, AL, and Chianese-Bullock, KA. MHC-restricted phosphopeptide antigens: preclinical validation and first-in-humans clinical trial in participants with high-risk melanoma. J Immunother Cancer. maj. 2020; 8
1These authors contributed equally to this work and share co-first authorship.

 

Linked Articles

Unknown widget #d2170c4d-a9cf-482f-ac17-ef77d57a1866

of type linkedContentList

Related Articles

Unknown widget #c2ffda61-8426-42f7-926b-03d7330eede2

of type relatedArticleListWidget