Advertisement
Research Article| Volume 10, 100025, June 2023

Download started.

Ok

AIRR community curation and standardised representation for immunoglobulin and T cell receptor germline sets

Open AccessPublished:February 16, 2023DOI:https://doi.org/10.1016/j.immuno.2023.100025

      Abstract

      Analysis of an individual's immunoglobulin or T cell receptor gene repertoire can provide important insights into immune function. High-quality analysis of adaptive immune receptor repertoire sequencing data depends upon accurate and relatively complete germline sets, but current sets are known to be incomplete. Established processes for the review and systematic naming of receptor germline genes and alleles require specific evidence and data types, but the discovery landscape is rapidly changing. To exploit the potential of emerging data, and to provide the field with improved state-of-the-art germline sets, an intermediate approach is needed that will allow the rapid publication of consolidated sets derived from these emerging sources. These sets must use a consistent naming scheme and allow refinement and consolidation into genes as new information emerges. Name changes should be minimised, but, where changes occur, the naming history of a sequence must be traceable. Here we outline the current issues and opportunities for the curation of germline IG/TR genes and present a forward-looking data model for building out more robust germline sets that can dovetail with current established processes. We describe interoperability standards for germline sets, and an approach to transparency based on principles of findability, accessibility, interoperability, and reusability.

      Graphical abstract

      Keywords

      1. Introduction

      Germline immunoglobulin and T cell receptor germline gene (IG and TR) sets are compilations of curated sequences of known germline genes and their allelic variants. These include constant (C), joining (J), diversity (D), and variable (V) genes found within the IG and TR loci of many species. Historically, germline sets have focused on sequences sourced from genomic data that meet specific criteria [
      • Lefranc M.P.
      • Giudicelli V.
      • Duroux P.
      • Jabado-Michaloud J.
      • Folch G.
      • Aouinti S.
      • Carillon E.
      • Duvergey H.
      • Houles A.
      • Paysan-Lafosse T.
      • Hadi-Saljoqi S.
      • Sasorith S.
      • Lefranc G.
      • Kossida S.
      IMGT®, the international ImMunoGeneTics information system® 25 years on.
      ,
      • Retter I.
      • Althaus H.H.
      • Münch R.
      • Müller W.
      VBASE2, an integrative V gene database.
      ]. More recently, approaches have been developed that allow for the characterization of germline sequences from Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) and higher throughput genomic sequencing datasets. These next-generation data sources for novel IG and TR gene and allele discovery have the potential to significantly expand existing germline sets, an important point considering that for many applications, it has been demonstrated that the use of more comprehensive germline sets can substantially improve the accuracy of data analysis, even in supposedly simple measures such as somatic hypermutation rate [
      • Kaduk M.
      • Corcoran M.
      • Karlsson Hedestam G.B.
      Addressing IGHV gene structural diversity enhances immunoglobulin repertoire analysis: lessons from rhesus Macaque.
      ,
      • Collins A.M.
      • Peres A.
      • Corcoran M.M.
      • Watson C.T.
      • Yaari G.
      • Lees W.D.
      • Ohlin M.
      Commentary on Population matched (pm) germline allelic variants of immunoglobulin (IG) loci: relevance in infectious diseases and vaccination studies in human populations.
      ,
      • Jackson K.J.
      • Kos J.T.
      • Lees W.
      • Gibson W.S.
      • Smith M.L.
      • Peres A.
      • et al.
      A BALB/c IGHV reference set, defined by haplotype analysis of long-read VDJ-C sequences from F1 (BALB/c /C57BL/6) mice.
      ,
      • Scheepers C.
      • Shrestha R.K.
      • Lambson B.E.
      • Jackson K.J.L.
      • Wright I.A.
      • Naicker D.
      • Goosen M.
      • Berrie L.
      • Ismail A.
      • Garrett N.
      • Abdool Karim Q.
      • Abdool Karim S.S.
      • Moore P.L.
      • Travers S.A.
      • Morris L.
      Ability to develop broadly neutralizing HIV-1 antibodies is not restricted by the germline Ig gene repertoire.
      ]. In humans, the number of IG alleles reported by the IMmunoGeneTics Information System (IMGT) [
      • Lefranc M.P.
      • Giudicelli V.
      • Duroux P.
      • Jabado-Michaloud J.
      • Folch G.
      • Aouinti S.
      • Carillon E.
      • Duvergey H.
      • Houles A.
      • Paysan-Lafosse T.
      • Hadi-Saljoqi S.
      • Sasorith S.
      • Lefranc G.
      • Kossida S.
      IMGT®, the international ImMunoGeneTics information system® 25 years on.
      ] continues to grow each year (Fig. 1), suggesting that substantial allelic diversity remains undiscovered. Despite the advantages of combining the strengths of genomic and AIRR-seq data when deriving germline sets, the seamless integration of these data types for the construction of more comprehensive germline sets is not straightforward. This largely stems from constraints caused by current nomenclatures and curation processes, which require germline sequences to have both gene names and allele identifiers: in other words, for the sequences of alleles to be ‘mapped’ to identified genes at specific genomic locations. Germline sequences inferred from AIRR-seq, for example, cannot always be unequivocally mapped to a single location: in other words, there may be confidence that a germline sequence is observed, but not which gene it should be assigned to (Fig. 2). While fully mapped sets remain the long-term goal, there is an urgent need for mechanisms that will allow the growing body of germline sequence data discovered from next-generation sources to be published in interim form, in a codified manner that can easily be used by researchers. Such mechanisms should also allow for germline sequences and the germline sets they are part of to evolve through time transparently, as more information becomes available.
      Fig 1
      Fig. 1Cumulative number of human IG alleles in IMGT databases (LIGM-DB to 2001 and IMGT GENE/DB subsequently).
      Fig 2
      Fig. 2Alleles inferred from AIRR-seq reads are derived from recombined VDJ sequences, meaning that the exact genes from which they arise (determined by location in the immune receptor locus) cannot be determined. In some cases, the location can be inferred by assessing sequence similarity to genes in the reference genome. However, the presence of multiple highly similar genes may make it impossible to determine the correct mapping unambiguously.
      The complexity of the IG/TR loci is manifested as single-nucleotide variation in genes (allelic variation) and in non-coding regions, and, in addition, structural variation, in which whole genes or segments of the locus are deleted or duplicated [
      • Pramanik S.
      • Cui X.
      • Wang H.Y.
      • Chimge N.O.
      • Hu G.
      • Shen L.
      • Gao R.
      • Li H.
      Segmental duplication as one of the driving forces underlying the diversity of the human immunoglobulin heavy chain variable gene region.
      ]. The challenges are such that significant obstacles remain before complete sets can be published even for those species that have received the most attention (Fig. 3). In humans, significant structural variation is seen in the IG heavy chain (IGH) and TR beta (TRB) loci [
      • Luo S.
      • Yu J.A.
      • Li H.
      • Song Y.S.
      Worldwide genetic variation of the IGHV and TRBV immune receptor gene families in humans.
      ,
      • Zhang J.Y.
      • Roberts H.
      • Flores D.S.C.
      • Cutler A.J.
      • Brown A.C.
      • Whalley J.P.
      • Mielczarek O.
      • Buck D.
      • Lockstone H.
      • Xella B.
      • Oliver K.
      • Corton C.
      • Betteridge E.
      • Bashford-Rogers R.
      • Knight J.C.
      • Todd J.A.
      • Band G.
      Using de novo assembly to identify structural variation of eight complex immune system gene regions.
      ], and many structural variants have been characterised. Others are still being discovered as additional haplotypes are resolved at nucleotide resolution. Because a single reference sequence cannot represent variation in the population as a whole, the current human reference genome, GRCh38, is missing genes that are present within some IG haplotypes and hence not all recognised genes have coordinates in GRCh38. These genes are, however, represented in alternate contigs that can be properly placed relative to GRCh38 [
      • Watson C.T.
      • Steinberg K.M.
      • Huddleston J.
      • Warren R.L.
      • Malig M.
      • Schein J.
      • Willsey A.J.
      • Joy J.B.
      • Scott J.K.
      • Graves T.A.
      • Wilson R.K.
      • Holt R.A.
      • Eichler E.E.
      • Breden F.
      Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation.
      ,
      • Milner E.C.
      • Hufnagle W.O.
      • Glas A.M.
      • Suzuki I.
      • Alexander C.
      Polymorphism and utilization of human VH Genes.
      ].
      Fig 3
      Fig. 3Challenges in the characterization of IG/TR genes, and the current status in three key species.
      Compared to other species, our knowledge of the functional genes and common structural variants in the human IG loci is more complete, and thus many newly identified but unmapped sequences can be mapped following detailed review. The general principle dictates that when such a sequence aligns with known alleles of a single gene, G, with high sequence identity, and with substantially lower identity to the alleles of other genes, the sequence can reasonably be mapped/assigned to G. Such unmapped sequences can arise from traditional sequencing methods (e.g., sequencing from targeted PCR amplicons or large-insert clones), or by inference from the transcriptome. Several tools have been developed that facilitate the inference of germline alleles from AIRR-seq data [
      • Zhang W.
      • Wang I.-M.
      • Wang C.
      • Lin L.
      • Chai X.
      • Wu J.
      • Bett A.J.
      • Dhanasekaran G.
      • Casimiro D.R.
      • Liu X.
      IMPre: an accurate and efficient software for prediction of T- and B-cell receptor germline genes and alleles from rearranged repertoire data.
      ,
      • Corcoran M.M.
      • Vázquez Bernat N.
      • Phad Ganesh E.
      • Stahl-Hennig Christiane
      • Sumida N.
      • Persson M.A.A.
      • Martin M.
      • Karlsson Hedestam G.B.
      Production of individualized V gene databases reveals high levels of immunoglobulin genetic diversity.
      ,
      • Ralph D.K.
      • Matsen F.A.
      Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data.
      ,
      • Yu Y.
      • Ceredig R.
      • Seoighe C.
      LymAnalyzer: a tool for comprehensive analysis of next generation sequencing data of T cell receptors and immunoglobulins.
      ,
      • Gadala-Maria D.
      • Yaari G.
      • Uduman M.
      • Kleinstein S.H.
      Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles.
      ]. Since their first description in 2015, these tools and the inferences produced by them have received substantial scrutiny [
      • Ohlin M.
      • Scheepers C.
      • Corcoran M.
      • Lees W.D.
      • Busse C.E.
      • Bagnara D.
      • Thörnqvist L.
      • Bürckert J.-P.
      • Jackson K.J.L.
      • Ralph D.
      • Schramm C.A.
      • Marthandan N.
      • Breden F.
      • Scott J.
      • Matsen IV F.A.
      • Greiff V.
      • Yaari G.
      • Kleinstein S.H.
      • Christley S.
      • Sherkow J.S.
      • Kossida S.
      • Lefranc M.-P.
      • van Zelm M.C.
      • Watson C.T.
      • Collins A.M.
      Inferred allelic variants of immunoglobulin receptor genes: a system for their evaluation, documentation, and naming.
      ,
      • Yang X.
      • Zhu Y.
      • Chen S.
      • Zeng H.
      • Guan J.
      • Wang Q.
      • Lan C.
      • Sun D.
      • Yu X.
      • Zhang Z.
      Novel allele detection tool benchmark and application with antibody repertoire sequencing dataset.
      ,
      • Thörnqvist L.
      • Ohlin M.
      Critical steps for computational inference of the 3’-end of novel alleles of immunoglobulin heavy chain variable genes - illustrated by an allele of IGHV3-7.
      ,
      • Kirik U.
      • Greiff L.
      • Levander F.
      • Ohlin M.
      Parallel antibody germline gene and haplotype analyses support the validity of immunoglobulin germline gene inference and discovery.
      ,
      • Vázquez Bernat N.
      • Corcoran M.
      • Hardt U.
      • Kaduk M.
      • Phad G.E.
      • Martin M.
      • Karlsson Hedestam G.B.
      High-quality library preparation for NGS-Based immunoglobulin germline gene inference and repertoire expression analysis.
      ] and there is now a consolidated understanding of the technical capabilities and limitations of the approach. Collaboration between the Inferred Allele Review Committee (IARC), of the AIRR Community (AIRR-C), the International Union of Immunological Societies (IUIS), and IMGT has allowed the incorporation of sequences inferred from AIRR-seq into IMGT's human IG germline sets. To date, 37 inferred human IG alleles have been affirmed by the IARC, 32 of which have been included in IMGT sets.
      AIRR-seq inference studies, building on earlier, foundational work based on genomic assemblies [
      • Ramesh A.
      • Darko S.
      • Hua A.
      • Overman G.
      • Ransier A.
      • Francica J.R.
      • Trama A.
      • Tomaras G.D.
      • Haynes B.F.
      • Douek D.C.
      • Kepler T.B.
      Structure and diversity of the rhesus macaque immunoglobulin loci through multiple de novo genome assemblies.
      ,
      • Retter I.
      • Chevillard C.
      • Scharfe M.
      • Conrad A.
      • Hafner M.
      • Im T.-H.
      • Ludewig M.
      • Nordsiek G.
      • Severitt S.
      • Thies S.
      • Mauhar A.
      • Blöcker H.
      • Müller W.
      • Riblet R.
      Sequence and characterization of the Ig heavy chain constant and partial variable region of the mouse strain 129S1.
      ,
      • Cirelli K.M.
      • Carnathan D.G.
      • Nogal B.
      • Martin J.T.
      • Rodriguez O.L.
      • Upadhyay A.A.
      • Enemuo C.A.
      • Gebru E.H.
      • Choe Y.
      • Viviano F.
      • Nakao C.
      • Pauthner M.G.
      • Reiss S.
      • Cottrell C.A.
      • Smith M.L.
      • Bastidas R.
      • Gibson W.
      • Wolabaugh A.N.
      • Melo M.B.
      • Cossette B.
      • Kumar V.
      • Patel N.B.
      • Tokatlian T.
      • Menis S.
      • Kulp D.W.
      • Burton D.R.
      • Murrell B.
      • Schief W.R.
      • Bosinger S.E.
      • Ward A.B.
      • Watson C.T.
      • Silvestri G.
      • Irvine D.J.
      • Crotty S.
      Slow delivery immunization enhances HIV neutralizing antibody and germinal center responses via modulation of immunodominance.
      ], have also enabled the discovery of large numbers of previously unknown germline allele sequences in the mouse and macaque. However, in contrast to human, our understanding of haplotype diversity in these species is limited. This knowledge gap presents challenges for curating and compiling germline sets using historical standards. We use these species here to illustrate current issues.
      Commonly used inbred laboratory strains, initially derived from diverse subspecies of wild-type mice, are now understood to exhibit substantial inter-strain variation in the IG loci [
      • Jackson K.J.
      • Kos J.T.
      • Lees W.
      • Gibson W.S.
      • Smith M.L.
      • Peres A.
      • et al.
      A BALB/c IGHV reference set, defined by haplotype analysis of long-read VDJ-C sequences from F1 (BALB/c /C57BL/6) mice.
      ,
      • Watson C.T.
      • Kos J.T.
      • Gibson W.S.
      • Newman L.
      • Deikus G.
      • Busse C.E.
      • Smith M.L.
      • Jackson K.J.
      • Collins A.M.
      A comparison of immunoglobulin IGHV, IGHD and IGHJ genes in wild-derived and classical inbred mouse strains.
      ,
      • Collins A.M.
      • Watson C.T.
      Immunoglobulin light chain gene rearrangements, receptor editing and the development of a self-tolerant antibody repertoire.
      ,

      Kos J.T., Safonova Y., Shields K.M., Silver C.A., Lees W.D., Collins A.M., et al. Characterization of extensive diversity in immunoglobulin light chain variable germline genes across biomedically important mouse strains. bioRxiv 2022:489089. doi:10.1101/2022.05.01.489089.

      ]. For example, AIRR-seq analysis has revealed that fewer than 5% of IGHV sequences curated in C57BL/6 are found in the germline repertoire of BALB/c. In the BALB/c strain, reference assemblies from the Sanger Mouse Genomes Project (https://www.sanger.ac.uk/data/mouse-genomes-project/) were found to contain only 44% of the IGH alleles inferred by AIRR-seq [
      • Jackson K.J.
      • Kos J.T.
      • Lees W.
      • Gibson W.S.
      • Smith M.L.
      • Peres A.
      • et al.
      A BALB/c IGHV reference set, defined by haplotype analysis of long-read VDJ-C sequences from F1 (BALB/c /C57BL/6) mice.
      ]. Critically, these reference assemblies were compiled using short-read next-generation sequencing (NGS) [
      • Lilue J.
      • Doran A.G.
      • Fiddes I.T.
      • Abrudan M.
      • Armstrong J.
      • Bennett R.
      • Chow W.
      • Collins J.
      • Collins S.
      • Czechanski A.
      • Danecek P.
      • Diekhans M.
      • Dolle D.-D.
      • Dunn M.
      • Durbin R.
      • Earl D.
      • Ferguson-Smith A.
      • Flicek P.
      • Flint J.
      • Frankish A.
      • Fu B.
      • Gerstein M.
      • Gilbert J.
      • Goodstadt L.
      • Harrow J.
      • Howe K.
      • Ibarra-Soria X.
      • Kolmogorov M.
      • Lelliott C.J.
      • Logan D.W.
      • Loveland J.
      • Mathews C.E.
      • Mott R.
      • Muir P.
      • Nachtweide S.
      • Navarro F.C.P.
      • Odom D.T.
      • Park N.
      • Pelan S.
      • Pham S.K.
      • Quail M.
      • Reinholdt L.
      • Romoth L.
      • Shirley L.
      • Sisu C.
      • Sjoberg-Herrera M.
      • Stanke M.
      • Steward C.
      • Thomas M.
      • Threadgold G.
      • Thybert D.
      • Torrance J.
      • Wong K.
      • Wood J.
      • Yalcin B.
      • Yang F.
      • Adams D.J.
      • Paten B.
      • Keane T.M.
      Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci.
      ]: the mapping and assembly of short-read data in the IG loci is problematic. This result emphasises that, even in inbred strains, the close gene spacing and repetitive nature of the IGH locus requires advanced genome sequencing and assembly techniques to identify IGH alleles reliably. While broader surveys in additional strains have identified germline alleles with the same sequence or high sequence identity in multiple strains, in the absence of genomic assemblies for these strains, it has not been possible to determine whether these alleles map to the same gene. Genomic sequencing will be required to establish gene-to-allele mappings and to establish whether a mapping to an overall structure that spans across strains is viable. At the time of writing, IG germline sets listing alleles inferred in 20 commonly used strains are available on the AIRR-C Open Germline Receptor Database (OGRDB) [
      • Lees W.
      • Busse C.E.
      • Corcoran M.
      • Ohlin M.
      • Scheepers C.
      • Matsen F.A.
      • Yaari G.
      • Watson C.T.
      • Community AIRR
      • Collins A.
      • Shepherd A.J.
      OGRDB: a reference database of inferred immune receptor genes.
      ], using the new schema described in this document (Table 1).
      Table 1Community-curated mouse germline sets currently listed on OGRDB (sets for other species will be added as available). IGHV sequences are taken from (11) and IGKV/IGLV from (23). These sets may be accessed at https://ogrdb.airr-community.org/germline_sets/Mouse.
      StrainTypeSequences
      129S1/SvImJIGKV91
      IGLV3
      A/JIGKV102
      IGLV3
      AKR/JIGKV85
      IGLV3
      BALB/cIGHV164
      BALB/c/ByJIGLV3
      IGKV98
      C3H/HeJIGKV96
      IGLV3
      C57BL/6IGHV102
      C57BL/6JIGKV91
      IGLV3
      CAST/EiJIGKV88
      IGLV9
      CBA/JIGKV82
      IGLV3
      DBA/1JIGKV104
      IGLV3
      DBA/2JIGKV100
      IGLV3
      LEWES/EiJIGKV87
      IGLV4
      MRL/MpJIGKV72
      IGLV3
      MSM/MsJIGKV83
      IGLV5
      NOD/ShiLtJIGKV62
      IGLV3
      NOR/LtJIGKV80
      IGLV3
      NZB/BlNJIGKV105
      IGLV3
      PWD/PhJIGKV89
      IGLV3
      SJL/JIGKV67
      IGLV3
      In the macaque, Vázquez Bernat et al. [
      • Vázquez Bernat N.
      • Corcoran M.
      • Nowak I.
      • Kaduk M.
      • Dopico X.C.
      • Narang S.
      • Maisonasse P.
      • Dereuddre-Bosquet N.
      • Murrell B.
      • Karlsson Hedestam G.B.
      Rhesus and cynomolgus macaque immunoglobulin heavy-chain genotyping yields comprehensive databases of germline VDJ alleles.
      ] recently used AIRR-seq repertoires from 45 rhesus and cynomolgus macaques to uncover several hundred previously unknown allele sequences, many of which were additionally confirmed via PCR amplification of unrearranged genomic material. The results were compiled into a database, KIMDB (http://kimdb.gkhlab.se/). To investigate the potential impacts of germline databases on AIRR-seq analysis, Kaduk et al. [
      • Kaduk M.
      • Corcoran M.
      • Karlsson Hedestam G.B.
      Addressing IGHV gene structural diversity enhances immunoglobulin repertoire analysis: lessons from rhesus Macaque.
      ] compared the annotation of a single AIRR-seq dataset annotated using the KIMDB rhesus macaque germline database with that annotated using the corresponding IMGT germline set, which is largely derived from the rheMac10 reference assembly [
      • Warren W.C.
      • Harris R.A.
      • Haukness M.
      • Fiddes I.T.
      • Murali S.C.
      • Fernandes J.
      • Dishuck P.C.
      • Storer J.M.
      • Raveendran M.
      • Hillier L.W.
      • Porubsky D.
      • Mao Y.
      • Gordon D.
      • Vollger M.R.
      • Lewis A.P.
      • Munson K.M.
      • DeVogelaere E.
      • Armstrong J.
      • Diekhans M.
      • Walker J.A.
      • Tomlinson C.
      • Graves-Lindsay T.A.
      • Kremitzki M.
      • Salama S.R.
      • Audano P.A.
      • Escalona M.
      • Maurer N.W.
      • Antonacci F.
      • Mercuri L.
      • Maggiolini F.A.M.
      • Catacchio C.R.
      • Underwood J.G.
      • O'Connor D.H.
      • Sanders A.D.
      • Korbel J.O.
      • Ferguson B.
      • Kubisch H.M.
      • Picker L.
      • Kalin N.H.
      • Rosene D.
      • Levine J.
      • Abbott D.H.
      • Gray S.B.
      • Sanchez M.M.
      • Kovacs-Balint Z.A.
      • Kemnitz J.W.
      • Thomasy S.M.
      • Roberts J.A.
      • Kinnally E.L.
      • Capitanio J.P.
      • Skene J.H.P.
      • Platt M.
      • Cole S.A.
      • Green R.E.
      • Ventura M.
      • Wiseman R.W.
      • Paten B.
      • Batzer M.A.
      • Rogers J.
      • Eichler E.E.
      Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility.
      ,
      • Nguefack Ngoune V.
      • Bertignac M.
      • Georga M.
      • Papadaki A.
      • Albani A.
      • Folch G.
      • Jabado-Michaloud J.
      • Giudicelli V.
      • Duroux P.
      • Lefranc M.-P.
      • Kossida S.
      IMGT® biocuration and analysis of the rhesus monkey IG Loci.
      ]. Analysis with the IMGT set overestimated somatic hypermutation levels, as a result of missing genes and alleles. Further examination of the two databases demonstrated that substantial structural variation between animals was reflected in KIMDB but missing from the IMGT germline set.
      In the mouse and macaque, and in other non-human species, the lack of a detailed genomic understanding means that alleles determined from transcriptomic and/or genomic information cannot as yet be mapped to genes. Their sequences serve to emphasise the current lack of understanding of the genomic structure of these two species. Hence, outside human, outlets for the efficient sharing of many identified but unlocalised germline alleles are not available, leading to confusion among researchers and impeding scientific progress and reproducibility.
      While germline sets have been published for many species in addition to those covered above, they are largely based on the annotation of published reference assemblies. Experience from mouse and macaque as just summarised suggests that relying solely on a single reference assembly, especially if constructed from short-read NGS, is likely to provide poor coverage of genes and alleles, particularly in the important and complex IGH locus. Augmentation with sequences from other sources, specifically inference from AIRR-seq repertoires, can improve coverage, and also identify sequences derived from the assembly that are not observed in expressed repertoires, and may therefore be erroneous. Germline sets can therefore be improved through a hierarchy of annotation, initially starting with annotation of a reference assembly and AIRR-seq repertoires; subsequently adding further repertoires and confirming inferred allele sequences with targeted PCR amplification, cloning and Sanger sequencing, and ultimately utilising high-fidelity genomic sequencing at scale. However, this approach requires a schema that allows allele sequences to be incorporated in a germline set regardless of whether they have been mapped to a gene, in contrast to current practice.
      For IG/TR loci, the close spacing of many highly similar genes and pseudogenes requires meticulous assembly and the use of longer reads than those generally employed for whole-genome sequencing (typically >8 kilobase reads, compared to the 64 or 128 base paired reads that has historically been used in short-read whole genome sequencing). Methods for high-volume, high-fidelity genomic sequencing of receptor loci are reaching maturity [
      • Rodriguez O.L.
      • Gibson W.S.
      • Parks T.
      • Emery M.
      • Powell J.
      • Strahl M.
      • Deikus G.
      • Auckland K.
      • Eichler E.E.
      • Marasco W.A.
      • Sebra R.
      • Sharp A.J.
      • Smith M.L.
      • Bashir A.
      • Watson C.T.
      A novel framework for characterizing genomic haplotype diversity in the human immunoglobulin heavy chain locus.
      ,
      • Lin M.J.
      • Lin Y.C.
      • Chen N.C.
      • Luo A.C.
      • Lai S.K.
      • Hsu C.L.
      • et al.
      Profiling genes encoding the adaptive immune receptor repertoire with gAIRR Suite.
      ]. From our own work and that of others, we expect high-quality genomic sequencing of IG/TR loci from many hundreds of individuals and species to emerge over the next 2-3 years [
      • Lin M.J.
      • Lin Y.C.
      • Chen N.C.
      • Luo A.C.
      • Lai S.K.
      • Hsu C.L.
      • et al.
      Profiling genes encoding the adaptive immune receptor repertoire with gAIRR Suite.
      ,
      • Gibson W.S.
      • Rodriguez O.L.
      • Shields K.
      • Silver C.A.
      • Dorgham A.
      • Emery M.
      • et al.
      Characterization of the immunoglobulin lambda chain locus from diverse populations reveals extensive genetic variation.
      ,
      • Rodriguez O.L.
      • Silver C.A.
      • Shields K.
      • Smith M.L.
      • Watson C.T.
      Targeted long-read sequencing facilitates phased diploid assembly and genotyping of the human T cell receptor alpha, delta and beta loci.
      ]. While alleles identified from genomic sequencing may by their nature be localised within the assembly, substantial work is still needed to integrate the results from multiple assemblies, identifying structural variants, and resolving technical errors in sequencing and assembly. Combining genomic sequencing with AIRR-seq inference of the same subjects is helpful in this respect. The combined use of advanced genomic sequencing and AIRR-seq inference currently offers the best path forward for capturing both the high population diversity in the IG/TR loci and the chromosomal location of the genes.
      The schema and process outlined in this work allow unlocalised alleles to be named using a consistent naming scheme, incorporated into curated germline sets without detailed understanding of the genomic structure, and rapidly published. Allele sequences can be refined and mapped into genes as new information emerges, while preserving traceability and minimising name changes. Today, in the absence of such a framework, researchers are left to search the literature for germline sets published by particular groups, and to reconcile differences in naming and in levels of support for themselves, often discovering that the same allele has been identified in more than one study and in some cases given different names. As an example, a rhesus macaque allele identified as IGHV2-ABG*01 in [
      • Ramesh A.
      • Darko S.
      • Hua A.
      • Overman G.
      • Ransier A.
      • Francica J.R.
      • Trama A.
      • Tomaras G.D.
      • Haynes B.F.
      • Douek D.C.
      • Kepler T.B.
      Structure and diversity of the rhesus macaque immunoglobulin loci through multiple de novo genome assemblies.
      ] is listed as IGHV2-118*01 in [
      • Vázquez Bernat N.
      • Corcoran M.
      • Nowak I.
      • Kaduk M.
      • Dopico X.C.
      • Narang S.
      • Maisonasse P.
      • Dereuddre-Bosquet N.
      • Murrell B.
      • Karlsson Hedestam G.B.
      Rhesus and cynomolgus macaque immunoglobulin heavy-chain genotyping yields comprehensive databases of germline VDJ alleles.
      ].The same sequence, without the final 3’ nucleotide, was listed in IMGT as IGHV2-1*01 prior to September 2020 and is currently listed there (including the final nucleotide) as IGHV2-174*02.
      In this work, we describe a schema that allows for the effective integration of genomic and inferred data without the constraint of gene mapping. It is designed to be flexible and responsive to new data types and diverse nomenclatures. By storing rich metadata alongside each sequence in the germline set, it enables traceability, cooperation between teams in the development of germline sets, and more effective integration with other genomic and immunological data types. While it is likely to require significant further development in the light of experience, it provides a foundation from which to address the large increase in data volumes and types that can be anticipated in the next few years. Finally, we define an interoperable standard format in which germline sets can be loaded into tools for gene annotation, gene usage statistics, somatic hypermutation profiling, lineage reconstruction, etc., making it easier for both users and developers to keep their tools up to date. The overall approach has been adopted by the Germline Database Working Group of the AIRR Community (AIRR-C) and is supported by AIRR-C standards and repositories as described below.
      To promote transparency and re-use, we adopt the FAIR guiding principles for scientific data management [
      • Wilkinson M.D.
      • Dumontier M.
      • Aalbersberg Ij.J.
      • Appleton G.
      • Axton M.
      • Baak A.
      • Blomberg N.
      • Boiten J.W.
      • da Silva Santos L.B.
      • Bourne P.E.
      • Bouwman J.
      • Brookes A.J.
      • Clark T.
      • Crosas M.
      • Dillo I.
      • Dumon O.
      • Edmunds S.
      • Evelo C.T.
      • Finkers R.
      • Gonzalez-Beltran A.
      • Gray A.J.G.
      • Groth P.
      • Goble C.
      • Grethe J.S.
      • Heringa J.
      • ’t Hoen P.A.C.
      • Hooft R.
      • Kuhn T.
      • Kok R.
      • Kok J.
      • Lusher S.J.
      • Martone M.E.
      • Mons A.
      • Packer A.L.
      • Persson B.
      • Rocca-Serra P.
      • Roos M.
      • van Schaik R.
      • Sansone S.-A.
      • Schultes E.
      • Sengstag T.
      • Slater T.
      • Strawn G.
      • Swertz M.A.
      • Thompson M.
      • van der Lei J.
      • van Mulligen E.
      • Velterop J.
      • Waagmeester A.
      • Wittenburg P.
      • Wolstencroft K.
      • Zhao J.
      • Mons B.
      The FAIR guiding principles for scientific data management and stewardship.
      ]. To ensure that germline sets and the sequences they contain are findable, we define a schema that contains rich metadata, and assign globally unique identifiers to schema objects such as alleles and genes. For accessibility, we utilise the existing AIRR Standards-compliant infrastructure such as the AIRR Data Commons [
      • Christley S.
      • Aguiar A.
      • Blanck G.
      • Breden F.
      • Bukhari S.A.C.
      • Busse C.E.
      • Jaglale J.
      • Harikrishnan S.L.
      • Laserson U.
      • Peters B.
      • Rocha A.
      • Schramm C.A.
      • Taylor S.
      • Vander Heiden J.A.
      • Zimonja B.
      • Watson C.T.
      • Corrie B.
      • Cowell L.G.
      The ADC API: a web API for the programmatic query of the AIRR data commons.
      ] and OGRDB, providing open platforms through which the germline sets can be accessed. Interoperability is currently a key problem: many tools that make use of germline sets are provided with pre-installed sets that are difficult to update: to address this, we describe a standardised format that can be used by such tools. For reusability, we include rich metadata, including fields that describe provenance, assist with traceability, and encourage open licensing. We are publishing the germline sets with an unrestricted licence and making the source code freely available.

      2. Results

      The results are organised into three sections (Fig. 4):
      • A schema that enables germline sets to contain rich information, and, importantly, can ensure that an identified germline sequence can be tracked through time, even if its name changes in the light of new information (Section 2.1).
      • Tooling that supports germline allele review, and the publication and use of germline sets that follow the schema (Section 2.2).
      • A community approach that allows researchers to co-operate in the development of germline sets in their species of interest, utilising the functionality provided by the schema and tooling (Section 2.3).

      2.1 A schema and terminology for gene and allele curation

      As a first step towards developing the present Germline extension of the AIRR Schema, we identify common types of germline information, the names of which are used throughout the manuscript (Fig. 5a):
      • Sequence: A sequence of nucleic acids that was observed in or inferred from a single individual.
      • Allele: A known – but potentially unmapped – region within the genome of at least one individual.
      • Label: the name by which an allele is referred to in a germline set
      • Gene: A defined and mapped region within the genome of a species that groups alleles based on a shared, single ancestry.
      • Genotype: The collection of all alleles of a locus from a single individual, which may contain partial or complete phasing information allowing the alleles to be mapped to chromosomes.
      • Germline set: A curated collection of genes and alleles of a single locus of a species, which may be restricted to genes and alleles of certain populations within the species.
      • Locus: Used to distinguish the chromosomal regions in which IG/TR genes are located (e.g., IGH, TRB).
      Fig 5
      Fig. 5(A) – the relationship of objects defined in the Schema. (B) - evolution of labels and aliases through phases of discovery. The diagram depicts four stages in the activity of a community group curating sequences from a particular species. These events are likely to be separated in time and may be triggered by the availability of additional evidence.
      Note that in these definitions, an individual is a single vertebrate organism. Throughout the results, we italicise the terms above where the explicit definition is intended.
      As a next step, we define the potential usage scenarios:
      • Researchers can obtain current germline sets that support the best possible analysis of their data at that point in time.
      • Researchers can refer to genes and alleles in publications using a defined nomenclature that minimises the possibility of ambiguity and supports traceability over time.
      • Researchers can use germline sets to annotate AIRR-seq repertoires with the likely V, D and J alleles underlying each read in the repertoire, and can examine germline sets to understand the supporting evidence underlying each listed allele.
      • Researchers can load germline sets into tools used to annotate AIRR-seq repertoires and other software at a keystroke, without manipulation.
      • Software tools can produce haplotypes (i.e., fully phased genotypes) and personalised germline sets (sets containing just those alleles discovered in a single individual) in the same standard format.
      • Repositories can publish the germline sets that were used alongside annotated repertoires, enhancing transparency and reproducibility.

      2.1.1 The AIRR germline set schema

      The AIRR Data Schema [
      • Vander Heiden J.A.
      • Marquez S.
      • Marthandan N.
      • Bukhari S.A.C.
      • Busse C.E.
      • Corrie B.
      • Hershberg U.
      • Kleinstein S.H.
      • Matsen IV F.A.
      • Ralph D.K.
      • Rosenfeld A.M.
      • Schramm C.A.
      AIRR Community Standardized Representations for Annotated Immune Repertoires.
      ] is maintained by open-to-all community participation via the AIRR Community Standards Working Group. The schema includes key data items associated with the processing of AIRR-seq data, defined as minimal information by the MiAIRR data standard [
      • Rubelt F.
      • Busse C.E.
      • Bukhari S.A.C.
      • Bürckert J.-P.
      • Mariotti-Ferrandiz E.
      • Cowell L.G.
      • Watson C.T.
      • Marthandan N.
      • Faison W.J.
      • Hershberg U.
      • Laserson U.
      • Corrie B.D.
      • Davis M.M.
      • Peters B.
      • Lefranc M.-P.
      • Scott J.K.
      • Breden F.
      • Luning Prak E.T.
      • Kleinstein S.H.
      Adaptive Immune Receptor Repertoire Community recommendations for sharing immune-repertoire sequencing data.
      ]. To meet the usage scenarios, we chose to implement the above-mentioned concepts in a simplified manner that takes several specifics of gene inference workflows into account. To accomplish this, the AIRR Data Schema was expanded with the addition of two new high-level objects, GermlineSet and GenotypeSet (Fig. 6, Supplementary Information).
      Fig 6
      Fig. 6The AIRR Germline Set Schema. See Supplementary Data for detailed description and itemisation of fields.
      GermlineSet lists the alleles associated with a single locus of a species or species subgroup. To indicate the exact nature of genetically distinct populations within the species of interest, we included the subgroup field, which uses a controlled vocabulary (locational, breed, inbred or outbred strain) - a set of descriptors that can be extended, if necessary, to allow the curation of subgroups in all species of interest. Within the GermlineSet, an AlleleDescription is provided for each identified allele. Each AlleleDescription provides details of a single V, D, or J allele, describing its core sequence and, in the case of V alleles, IMGT alignment [
      • Lefranc M.-P.
      • Pommié C.
      • Ruiz M.
      • Giudicelli V.
      • Foulquier E.
      • Truong L.
      • Thouvenin-Contet V.
      • Lefranc G.
      IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains.
      ]. Fields are provided for additional coding and non-coding elements where known. The AlleleDescription also includes information needed for the accurate annotation of observed rearranged sequences, specifically the delineation of complementarity-determining and framework regions in V sequences using diverse schemes (e.g., Kabat, Chothia, IMGT), and the frame orientation and location of the donor splice site in J alleles. Additional schema objects can be associated in order to enumerate the supporting evidence for the sequence. Both GermlineSet and AlleleDescription provide fields that allow for naming, versioning and attribution.
      Genotype describes, with reference to one or more GermlineSets, the specific alleles inferred (from repertoire analysis or genomic methods) in a single locus of a specific subject. Provision is made also for the identification of ‘previously undocumented’ alleles, that cannot be found in the referenced GermlineSets, and for the identification of genes that are not detected in the repertoire, which, depending on the analysis method and data available, may explicitly confirm that the gene is deleted in this individual. A ‘phasing’ field allows information to be partially or fully haplotyped, where the analysis permits.

      2.1.2 Temporary nomenclature

      In advance of the formal recognition of a gene or allele and the approval of a name by the IUIS Reports Committee, we outline here a mechanism for assignment of standardised temporary names. We describe a concrete implementation supported by a minimal toolset, which can be evolved over time in the light of experience. In this system, when an allele is first identified, it is issued a temporary label. Temporary labels follow the IUIS style of: <locus><type><subgroup>-<identifier>*<allele>, example: IGHV0-A5B2*00
      <subroup> and <allele> take ‘null’ values of 0 and 00 respectively, until such time as specific values are determined and assigned. <identifier> is a random 20-bit value, encoded as 4-character base32 according to RFC 4648 section 6 [

      IETF: RFC 4648 (Base-N Encodings), (n.d.). https://www.ietf.org/rfc/rfc4648.txt (accessed May 1, 2022).

      ]. In summary, the value is encoded as 4 characters, where each character may be an upper-case letter or a digit, but the digits 0,1,8 and 9 are omitted. This provides reasonably memorable strings, with ∼1-million combinations. <identifier> is guaranteed to be unique within a naming domain and may not be re-used. The naming domain may extend to the locus of an entire species or may be restricted to a subgroup such as a strain or breed, at the decision of the curators. We permit the same <identifier> to be used in multiple naming domains with no overlap in meaning. For example, the same <identifier> could be assigned to an allele in humans, and also to an allele in macaques, without any implication that the two are related or share the same nucleotide sequence. Within a locus of a particular strain or species, two sequences might be identified, where one is a sub-sequence of the other (i.e., one sequence contains an identical copy of the other). In this case, the curating group can decide either to associate the same allele with the two sequences, or to associate different alleles. This might depend, for example, on whether haplotyping or usage evidence exists to support the case for the identified sequences arising from the same or different alleles. The structure of the temporary label, as well as being familiar to researchers, provides a level of compatibility with existing toolsets while being easily distinguishable from IUIS names.
      The schema provides for alleles to have multiple synonyms (aliases). These are used to capture legacy designations or alternative nomenclatures of an allele. They are also used to store previous labels, where a previously published temporary label has been changed. Aliases therefore provide traceability across time (Fig. 5b) and may be used to integrate records across multiple germline sequence databases (e.g., OGRDB, IMGT, etc.) provided that all researchers issuing labels coordinate with each other, to ensure that labels remain unique within the naming domain. We do not propose specific rules for the process of renaming (for example, why A5B2 was preferred over C89D in step 2 of the figure), believing that it is best left to the discretion of curators.

      2.2 Supporting tools

      2.2.1 IgLabel - A tool for managing the allocation of temporary labels

      To support the allocation of temporary labels to sequences, we have developed a command-line tool, IgLabel. IgLabel uses sequences as input to create new or suggest existing labels for the corresponding allele, by maintaining a csv-based database for the naming domain. While originally developed to allocate labels to IG alleles, it can be used to label alleles in any IG or TR locus. New sequences are allocated labels in a two-step process. In the first step, a file listing the new sequences is submitted. IgLabel returns a file containing a proposed action for each sequence, identifying those that duplicate already-submitted sequences, or are sub- or super-sequences of already submitted sequences. In the second phase, the user reviews the actions and can optionally change them, for example to allocate a new label for a sequence, even if it is a sub-sequence of an existing sequence. The file of actions is then submitted and IgLabel updates the database, allocating new labels as needed. IgLabel is available at https://github.com/williamdlees/IgLabel under open-source licence.
      It was noted previously that labels can be used to integrate records across multiple databases, provided that the issuers of labels within a naming domain coordinate with one another. This can be achieved with IgLabel if the database is published on a version control system such as Github (https://www.github.com), allowing changes from multiple sources to be merged and potential clashes handled.

      2.2.2 OGRDB - A system for managing and publishing germline sets

      The OGRDB website was initially developed as a system to support the work of IARC in reviewing and affirming human alleles inferred from AIRR-seq repertoires. It has been enhanced to support the management of sequences identified through the review process described in this work, and their publication as alleles. Multiple independent review groups, such as IARC, can be supported, each with assigned responsibility for specific species and loci. Each group can manage tables of sequences, alleles, and genes and publish them in germline sets. A number of sets are already published (Table 1). The AIRR-C schema for GermlineSet is supported, and germline sets are downloadable in JSON format compliant with the schema, or in FASTA format. They are also queryable via a REST API. OGRDB manages versioning and change control, such that both users and curators can identify the addition, removal, or modification of sequences in a germline set, and drill down to individual records for each sequence in order to see more detail. Data published on OGRDB is provided under a minimally restrictive Creative Commons CC0 1.0 licence. OGRDB data is periodically archived at Zenodo (https://zenodo.org) for long-term storage, and each version of a germline set is also deposited at Zenodo and allocated a Digital Object Identifier (DOI) (ref https://www.iso.org/standard/81599.html): hence users may cite a persistent identifier that uniquely references the germline set used in their work. OGRDB source code is published under open-source licence.

      2.2.3 AIRR Data Commons

      The AIRR Data Commons is a geographically distributed set of data repositories for storing and sharing AIRR-seq data that conforms to the AIRR Standards. Users can directly query and download data using the AIRR Data Commons API or using a graphical user interface such as iReceptor Gateway [
      • Corrie B.D.
      • Marthandan N.
      • Zimonja B.
      • Jaglale J.
      • Zhou Y.
      • Barr E.
      • Knoetze N.
      • Breden F.M.W.
      • Christley S.
      • Scott J.K.
      • Cowell L.G.
      • Breden F.
      iReceptor: a platform for querying and analyzing antibody/B-cell and T-cell receptor repertoire data across federated repositories.
      ] or VDJServer Community Data Portal [
      • Christley S.
      • Scarborough W.
      • Salinas E.
      • Rounds W.H.
      • Toby I.T.
      • Fonner J.M.
      • Levin M.K.
      • Kim M.
      • Mock S.A.
      • Jordan C.
      • Ostmeyer J.
      • Buntzman A.
      • Rubelt F.
      • Davila M.L.
      • Monson N.L.
      • Scheuermann R.H.
      • Cowell L.G.
      VDJServer: a cloud-based analysis portal and data commons for immune repertoire sequences and rearrangements.
      ]. The AIRR schema for Subject was enhanced to allow a GenotypeSet to be specified for the subject, which can provide the Genotype for one or more loci. This enables repertoire queries to the AIRR Data Commons to also query any of the Genotype fields such as the locus, the alleles, the germline sets employed for annotation, and the inference process. GenotypeSets for one or more subjects along with their associated Genotypes can be downloaded in JSON format through the AIRR Data Commons API repertoire query end point.

      2.3 A community approach to curation

      With the growing interest in and usage of AIRR-seq data, germline sets for humans and other species are becoming widely used, but, for any species and locus, the number of researchers actively engaged in germline set curation or discovery is low. We recognise that most researchers who work with germline sets are invested in one or in a small number of species and wish to see those move forward, with less interest in a general approach. The schema and nomenclature outlined above provide a common and consistent framework through which researchers working with a particular species or locus can share and publish data in advance of formal recognition. In the absence of a group, the schema and approach can be used by an individual researcher to provide results that can be extended later.
      The principles we envisage for a community approach are:
      • Groups should be open to all researchers working with a particular species or locus. The AIRR-C Working Groups (which are open to non-members) provide a non-exclusive solution.
      • Overlap should be discouraged, i.e., where possible, there should be just one community group working on each locus in a given species. In general, the small number of interested researchers should make it easy to avoid overlap: however, the schema provides approaches to nomenclature that can be used to coordinate parallel efforts where necessary, for example by storing the list of allocated labels in a commonly accessible and versioned repository.
      • Groups should be free to determine the evidence and approach to review that best suits the overall aim of creating the best available germline set from the resources available, bearing in mind that the resources will vary considerably between species, and that the approach may vary considerably between inbred and outbred species or strains.
      • Decision-making, supporting evidence, and review criteria should be documented and transparent. The schema provides versioning and links to records in primary repositories to support this.

      3. Discussion

      The AIRR-C is committed to the promotion of improved tools and techniques for next-generation sequencing, curation, and sharing of AIRRs [
      • Rubelt F.
      • Busse C.E.
      • Bukhari S.A.C.
      • Bürckert J.-P.
      • Mariotti-Ferrandiz E.
      • Cowell L.G.
      • Watson C.T.
      • Marthandan N.
      • Faison W.J.
      • Hershberg U.
      • Laserson U.
      • Corrie B.D.
      • Davis M.M.
      • Peters B.
      • Lefranc M.-P.
      • Scott J.K.
      • Breden F.
      • Luning Prak E.T.
      • Kleinstein S.H.
      Adaptive Immune Receptor Repertoire Community recommendations for sharing immune-repertoire sequencing data.
      ,
      • Scott J.K.
      • Breden F.
      The adaptive immune receptor repertoire community as a model for FAIR stewardship of big immunology data.
      ]. The development and improvement of receptor germline sets is a long-standing aim in support of understanding the development of these repertoires. Eventually, the germline receptor alleles of the receptor gene loci of all species of interest may be sufficiently well characterised to the point that intermediate sets and processes such as those described here are no longer needed: this is an important long-term goal. Until it is reached, intermediate sets will provide the soundest available basis for AIRR-seq analysis, and the schema will provide transparency and traceability in results. Today, for many species, reliance on a single reference assembly for the identification of receptor alleles has yielded germline sets that do not reflect species diversity. Researchers can obtain substantially higher quality analyses by employing germline sets that more fully reflect species diversity, even if alleles are not mapped to genes. The quality of AIRR-seq analyses can further be improved using personalised germline sets provided by AIRR-seq inference tools [
      • Corcoran M.M.
      • Vázquez Bernat N.
      • Phad Ganesh E.
      • Stahl-Hennig Christiane
      • Sumida N.
      • Persson M.A.A.
      • Martin M.
      • Karlsson Hedestam G.B.
      Production of individualized V gene databases reveals high levels of immunoglobulin genetic diversity.
      ,
      • Ralph D.K.
      • Matsen F.A.
      Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data.
      ,
      • Gadala-Maria D.
      • Yaari G.
      • Uduman M.
      • Kleinstein S.H.
      Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles.
      ], particularly when analysing highly mutated repertoires. We recommend the routine use of such tools in analysis pipelines. Some such tools, for example those that focus specifically on identifying which alleles of each gene are present in a repertoire, may need modification or extension to handle cases where the allele-to-gene mapping is not available.
      Analysis tools - annotation tools, repertoire analysis tools etc. - have traditionally relied on the structure of the name to extract information such as the allele identifier, gene identifier and family designation. Indeed, the name itself has been the only source of that information in germline sets. Here we outline a schema for germline IG and TR sets which has specific fields to carry these attributes. We encourage developers to adopt the new schema, and in doing so: (1) move away from parsing information out of the name; (2) handle cases where gene identifiers and subgroup designations are not available; and (3) build tools that can easily and conveniently update germline sets by taking advantage of the new format. Germline sets are frequently updated; hence, ease of update is a factor that deserves specific attention. Likewise, strong version management is required, both in the publication of germline sets and in the attribution of results. Another factor for developers to consider is that germline sequences can be incomplete at the 5’ or 3’ end. This is a feature of current sets, and one that may persist, as AIRR-seq inferences are often inconclusive for the final nucleotides at the 3’ end. It is important, therefore, that calls are not unduly biased by sequence length [
      • Thörnqvist L.
      • Ohlin M.
      Critical steps for computational inference of the 3’-end of novel alleles of immunoglobulin heavy chain variable genes - illustrated by an allele of IGHV3-7.
      ].
      Provision is made in the IMGT naming scheme for unlocalised genes via the ‘S’ gene number prefix [

      M.P. Lefranc, From IMGT-ontology classification axiom to IMGT standardized gene and allele nomenclature: for immunoglobulins (IG) and T cell receptors (TR), Cold Spring Harb Protoc. 2011 (2011) 627–632. 10.1101/pdb.ip84.

      ], but here we describe a community process that can recognise unmapped alleles (rather than unlocalised genes) on the basis of evidence that is not currently acceptable for formal ratification. We have opted to use a naming scheme that can be easily differentiated from existing schemes. In our view, the community approach is likely to be more manageable and to lead to better results at this early stage than a more formal structure. This should not be seen to detract from the value of formal ratification: it is important that community efforts are able to support eventual ratification, and that traceability is maintained.
      A consideration for curators is how comprehensive or otherwise the coverage of a germline set should be before it is published. Users can compensate for lack of completeness in gene or allele coverage by including an inference step in their annotation pipeline. Curators should provide guidance to users both on the coverage of a germline set, and on any specific considerations concerning the use of inference tools. The focus in this work has been on the documentation of previously undocumented genes and alleles, but current sets also include erroneous and incomplete allelic sequences [
      • Wang Y.
      • Jackson K.J.L.
      • Sewell W.A.
      • Collins A.M.
      Many human immunoglobulin heavy-chain IGHV gene polymorphisms have been reported in error.
      ]. Consideration of evidence based on AIRR-seq inference and on long-read genomic sequencing offers an opportunity for existing sets to be improved. The evidence for alleles not seen in expressed repertoires should be carefully reviewed, and partial sequences extended where evidence permits. Allele-to-gene mappings may need to be reviewed.
      While the curational processes outlined here represent an important and significant step, it is only a first step. We expect the processes to change and improve over time, to be extended to C genes, which are not currently covered by the specification, and to be supported by increasingly sophisticated and interconnected systems and tools. The schema may also require modification over time to deal with complexities found in some species. The Atlantic salmon, for example, carries functional IGH loci on two chromosomes [
      • Magadan S.
      • Sunyer O.J.
      • Boudinot P.
      Unique features of fish immune repertoires: particularities of adaptive immunity within the largest group of vertebrates.
      ]. As this is thought to be a consequence of a whole genome duplication event [
      • Glasauer S.M.K.
      • Neuhauss S.C.F.
      Whole-genome duplication in teleost fishes and its evolutionary consequences.
      ], additional fields may be needed to capture the complexity of IG/TR genes in such cases.
      Over time, the number of species of interest, and the volume of data available, will continue to increase. This will require novel approaches for the discovery, review, and curation of alleles from multiple data sources to build the most robust germline databases. It will need to be combined with the development of more adaptable nomenclature and data standards, as well as an accompanying schema for a distributed infrastructure that can accommodate collaborative updates from multiple sources while preserving full history and provenance.
      It is important for the reproducibility and interpretation of results that germline sets used within the field are freely and openly available to all practitioners, so that reproducibility is maintained, and common standards can be established for medical and other critical applications. The underlying studies that support the development of germline sets are overwhelmingly drawn from academic research. The AIRR-C is committed to FAIR principles for data management and stewardship. Sustainability requires secure funding, however, and we call on for-profit organisations that benefit from the use of such work to develop models for funding their continued development and curation.

      Author contributions

      Authors are members of AIRR-C and have participated in the development of the policies and procedures through AIRR-C Working Groups and Subcommittees. The work was led within the Germline Database Working Group: AC and CW functioned as co-chairs. Substantial contribution was also contributed by the Software Working Group and Standards Working Group. WL, SC, CB, AC and CW drafted the manuscript and all authors contributed to editing and review.

      Funding

      WL, GY, CB, FB, BC, LC and SC received support from the European Union's Horizon 2020 research and innovation program under grant agreement No 825821. FB and BC were supported in part by grant number 01866 from Canadian Institutes of Health Research. CTW was supported in part by relevant awards from the National Institutes of Health (grant numbers: R24AI138963, R24AI162317, and R21AI142590). MO was supported in part by the Swedish Research Council (grant number 2019-01042).

      Data availability

      • Data is publicly available at the links for software provided in the manuscript

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      Acknowledgments

      We would like to thank Jamie Scott, Katherine Jackson and Cathrine Scheepers for valuable comments during the preparation of the manuscript, and Erick Matsen for his contribution to the initiation and early development of this work.

      Appendix. Supplementary materials

      References

        • Lefranc M.P.
        • Giudicelli V.
        • Duroux P.
        • Jabado-Michaloud J.
        • Folch G.
        • Aouinti S.
        • Carillon E.
        • Duvergey H.
        • Houles A.
        • Paysan-Lafosse T.
        • Hadi-Saljoqi S.
        • Sasorith S.
        • Lefranc G.
        • Kossida S.
        IMGT®, the international ImMunoGeneTics information system® 25 years on.
        Nucl Acids Res. 2015; 43: D413-D422https://doi.org/10.1093/nar/gku1056
        • Retter I.
        • Althaus H.H.
        • Münch R.
        • Müller W.
        VBASE2, an integrative V gene database.
        Nucl Acids Res. 2005; 33: D671-D674https://doi.org/10.1093/nar/gki088
        • Kaduk M.
        • Corcoran M.
        • Karlsson Hedestam G.B.
        Addressing IGHV gene structural diversity enhances immunoglobulin repertoire analysis: lessons from rhesus Macaque.
        Front Immunol. 2022; 13 (accessed May 1, 2022)
        • Collins A.M.
        • Peres A.
        • Corcoran M.M.
        • Watson C.T.
        • Yaari G.
        • Lees W.D.
        • Ohlin M.
        Commentary on Population matched (pm) germline allelic variants of immunoglobulin (IG) loci: relevance in infectious diseases and vaccination studies in human populations.
        Genes Immun. 2021; 22: 335-338https://doi.org/10.1038/s41435-021-00152-6
        • Jackson K.J.
        • Kos J.T.
        • Lees W.
        • Gibson W.S.
        • Smith M.L.
        • Peres A.
        • et al.
        A BALB/c IGHV reference set, defined by haplotype analysis of long-read VDJ-C sequences from F1 (BALB/c /C57BL/6) mice.
        Front Immunol. 2022; 13
        • Scheepers C.
        • Shrestha R.K.
        • Lambson B.E.
        • Jackson K.J.L.
        • Wright I.A.
        • Naicker D.
        • Goosen M.
        • Berrie L.
        • Ismail A.
        • Garrett N.
        • Abdool Karim Q.
        • Abdool Karim S.S.
        • Moore P.L.
        • Travers S.A.
        • Morris L.
        Ability to develop broadly neutralizing HIV-1 antibodies is not restricted by the germline Ig gene repertoire.
        J Immunol. 2015; 194: 4371-4378https://doi.org/10.4049/jimmunol.1500118
        • Pramanik S.
        • Cui X.
        • Wang H.Y.
        • Chimge N.O.
        • Hu G.
        • Shen L.
        • Gao R.
        • Li H.
        Segmental duplication as one of the driving forces underlying the diversity of the human immunoglobulin heavy chain variable gene region.
        BMC Genom. 2011; 12: 78https://doi.org/10.1186/1471-2164-12-78
        • Luo S.
        • Yu J.A.
        • Li H.
        • Song Y.S.
        Worldwide genetic variation of the IGHV and TRBV immune receptor gene families in humans.
        Life Sci Alliance. 2019; 2e201800221https://doi.org/10.26508/lsa.201800221
        • Zhang J.Y.
        • Roberts H.
        • Flores D.S.C.
        • Cutler A.J.
        • Brown A.C.
        • Whalley J.P.
        • Mielczarek O.
        • Buck D.
        • Lockstone H.
        • Xella B.
        • Oliver K.
        • Corton C.
        • Betteridge E.
        • Bashford-Rogers R.
        • Knight J.C.
        • Todd J.A.
        • Band G.
        Using de novo assembly to identify structural variation of eight complex immune system gene regions.
        PLoS Comput Biol. 2021; 17e1009254https://doi.org/10.1371/journal.pcbi.1009254
        • Watson C.T.
        • Steinberg K.M.
        • Huddleston J.
        • Warren R.L.
        • Malig M.
        • Schein J.
        • Willsey A.J.
        • Joy J.B.
        • Scott J.K.
        • Graves T.A.
        • Wilson R.K.
        • Holt R.A.
        • Eichler E.E.
        • Breden F.
        Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation.
        Am J Hum Genet. 2013; 92: 530-546https://doi.org/10.1016/j.ajhg.2013.03.004
        • Milner E.C.
        • Hufnagle W.O.
        • Glas A.M.
        • Suzuki I.
        • Alexander C.
        Polymorphism and utilization of human VH Genes.
        Ann N Y Acad Sci. 1995; 764: 50-61https://doi.org/10.1111/j.1749-6632.1995.tb55806.x
        • Zhang W.
        • Wang I.-M.
        • Wang C.
        • Lin L.
        • Chai X.
        • Wu J.
        • Bett A.J.
        • Dhanasekaran G.
        • Casimiro D.R.
        • Liu X.
        IMPre: an accurate and efficient software for prediction of T- and B-cell receptor germline genes and alleles from rearranged repertoire data.
        Front Immunol. 2016; 7https://doi.org/10.3389/fimmu.2016.00457
        • Corcoran M.M.
        • Vázquez Bernat N.
        • Phad Ganesh E.
        • Stahl-Hennig Christiane
        • Sumida N.
        • Persson M.A.A.
        • Martin M.
        • Karlsson Hedestam G.B.
        Production of individualized V gene databases reveals high levels of immunoglobulin genetic diversity.
        Nat Commun. 2016; 7: 13642https://doi.org/10.1038/ncomms13642
        • Ralph D.K.
        • Matsen F.A.
        Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data.
        PLoS Comput Biol. 2019; 15e1007133https://doi.org/10.1371/journal.pcbi.1007133
        • Yu Y.
        • Ceredig R.
        • Seoighe C.
        LymAnalyzer: a tool for comprehensive analysis of next generation sequencing data of T cell receptors and immunoglobulins.
        Nucl Acids Res. 2016; 44: e31https://doi.org/10.1093/nar/gkv1016
        • Gadala-Maria D.
        • Yaari G.
        • Uduman M.
        • Kleinstein S.H.
        Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles.
        Proc Natl Acad Sci U S A. 2015; 112: E862-E870https://doi.org/10.1073/pnas.1417683112
        • Ohlin M.
        • Scheepers C.
        • Corcoran M.
        • Lees W.D.
        • Busse C.E.
        • Bagnara D.
        • Thörnqvist L.
        • Bürckert J.-P.
        • Jackson K.J.L.
        • Ralph D.
        • Schramm C.A.
        • Marthandan N.
        • Breden F.
        • Scott J.
        • Matsen IV F.A.
        • Greiff V.
        • Yaari G.
        • Kleinstein S.H.
        • Christley S.
        • Sherkow J.S.
        • Kossida S.
        • Lefranc M.-P.
        • van Zelm M.C.
        • Watson C.T.
        • Collins A.M.
        Inferred allelic variants of immunoglobulin receptor genes: a system for their evaluation, documentation, and naming.
        Front Immunol. 2019; : 10https://doi.org/10.3389/fimmu.2019.00435
        • Yang X.
        • Zhu Y.
        • Chen S.
        • Zeng H.
        • Guan J.
        • Wang Q.
        • Lan C.
        • Sun D.
        • Yu X.
        • Zhang Z.
        Novel allele detection tool benchmark and application with antibody repertoire sequencing dataset.
        Front Immunol. 2021; 12739179https://doi.org/10.3389/fimmu.2021.739179
        • Thörnqvist L.
        • Ohlin M.
        Critical steps for computational inference of the 3’-end of novel alleles of immunoglobulin heavy chain variable genes - illustrated by an allele of IGHV3-7.
        Mol Immunol. 2018; 103: 1-6https://doi.org/10.1016/j.molimm.2018.08.018
        • Kirik U.
        • Greiff L.
        • Levander F.
        • Ohlin M.
        Parallel antibody germline gene and haplotype analyses support the validity of immunoglobulin germline gene inference and discovery.
        Mol Immunol. 2017; 87: 12-22https://doi.org/10.1016/j.molimm.2017.03.012
        • Vázquez Bernat N.
        • Corcoran M.
        • Hardt U.
        • Kaduk M.
        • Phad G.E.
        • Martin M.
        • Karlsson Hedestam G.B.
        High-quality library preparation for NGS-Based immunoglobulin germline gene inference and repertoire expression analysis.
        Front Immunol. 2019; 10: 660https://doi.org/10.3389/fimmu.2019.00660
        • Ramesh A.
        • Darko S.
        • Hua A.
        • Overman G.
        • Ransier A.
        • Francica J.R.
        • Trama A.
        • Tomaras G.D.
        • Haynes B.F.
        • Douek D.C.
        • Kepler T.B.
        Structure and diversity of the rhesus macaque immunoglobulin loci through multiple de novo genome assemblies.
        Front Immunol. 2017; 8: 1407https://doi.org/10.3389/fimmu.2017.01407
        • Retter I.
        • Chevillard C.
        • Scharfe M.
        • Conrad A.
        • Hafner M.
        • Im T.-H.
        • Ludewig M.
        • Nordsiek G.
        • Severitt S.
        • Thies S.
        • Mauhar A.
        • Blöcker H.
        • Müller W.
        • Riblet R.
        Sequence and characterization of the Ig heavy chain constant and partial variable region of the mouse strain 129S1.
        J Immunol. 2007; 179: 2419-2427https://doi.org/10.4049/jimmunol.179.4.2419
        • Cirelli K.M.
        • Carnathan D.G.
        • Nogal B.
        • Martin J.T.
        • Rodriguez O.L.
        • Upadhyay A.A.
        • Enemuo C.A.
        • Gebru E.H.
        • Choe Y.
        • Viviano F.
        • Nakao C.
        • Pauthner M.G.
        • Reiss S.
        • Cottrell C.A.
        • Smith M.L.
        • Bastidas R.
        • Gibson W.
        • Wolabaugh A.N.
        • Melo M.B.
        • Cossette B.
        • Kumar V.
        • Patel N.B.
        • Tokatlian T.
        • Menis S.
        • Kulp D.W.
        • Burton D.R.
        • Murrell B.
        • Schief W.R.
        • Bosinger S.E.
        • Ward A.B.
        • Watson C.T.
        • Silvestri G.
        • Irvine D.J.
        • Crotty S.
        Slow delivery immunization enhances HIV neutralizing antibody and germinal center responses via modulation of immunodominance.
        Cell. 2019; 177 (e28): 1153-1171https://doi.org/10.1016/j.cell.2019.04.012
        • Watson C.T.
        • Kos J.T.
        • Gibson W.S.
        • Newman L.
        • Deikus G.
        • Busse C.E.
        • Smith M.L.
        • Jackson K.J.
        • Collins A.M.
        A comparison of immunoglobulin IGHV, IGHD and IGHJ genes in wild-derived and classical inbred mouse strains.
        Immunol Cell Biol. 2019; 97: 888-901https://doi.org/10.1111/imcb.12288
        • Collins A.M.
        • Watson C.T.
        Immunoglobulin light chain gene rearrangements, receptor editing and the development of a self-tolerant antibody repertoire.
        Front Immunol. 2018; 9: 2249https://doi.org/10.3389/fimmu.2018.02249
      1. Kos J.T., Safonova Y., Shields K.M., Silver C.A., Lees W.D., Collins A.M., et al. Characterization of extensive diversity in immunoglobulin light chain variable germline genes across biomedically important mouse strains. bioRxiv 2022:489089. doi:10.1101/2022.05.01.489089.

        • Lilue J.
        • Doran A.G.
        • Fiddes I.T.
        • Abrudan M.
        • Armstrong J.
        • Bennett R.
        • Chow W.
        • Collins J.
        • Collins S.
        • Czechanski A.
        • Danecek P.
        • Diekhans M.
        • Dolle D.-D.
        • Dunn M.
        • Durbin R.
        • Earl D.
        • Ferguson-Smith A.
        • Flicek P.
        • Flint J.
        • Frankish A.
        • Fu B.
        • Gerstein M.
        • Gilbert J.
        • Goodstadt L.
        • Harrow J.
        • Howe K.
        • Ibarra-Soria X.
        • Kolmogorov M.
        • Lelliott C.J.
        • Logan D.W.
        • Loveland J.
        • Mathews C.E.
        • Mott R.
        • Muir P.
        • Nachtweide S.
        • Navarro F.C.P.
        • Odom D.T.
        • Park N.
        • Pelan S.
        • Pham S.K.
        • Quail M.
        • Reinholdt L.
        • Romoth L.
        • Shirley L.
        • Sisu C.
        • Sjoberg-Herrera M.
        • Stanke M.
        • Steward C.
        • Thomas M.
        • Threadgold G.
        • Thybert D.
        • Torrance J.
        • Wong K.
        • Wood J.
        • Yalcin B.
        • Yang F.
        • Adams D.J.
        • Paten B.
        • Keane T.M.
        Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci.
        Nat Genet. 2018; 50: 1574-1583https://doi.org/10.1038/s41588-018-0223-8
        • Lees W.
        • Busse C.E.
        • Corcoran M.
        • Ohlin M.
        • Scheepers C.
        • Matsen F.A.
        • Yaari G.
        • Watson C.T.
        • Community AIRR
        • Collins A.
        • Shepherd A.J.
        OGRDB: a reference database of inferred immune receptor genes.
        Nucl Acids Res. 2019; https://doi.org/10.1093/nar/gkz822
        • Vázquez Bernat N.
        • Corcoran M.
        • Nowak I.
        • Kaduk M.
        • Dopico X.C.
        • Narang S.
        • Maisonasse P.
        • Dereuddre-Bosquet N.
        • Murrell B.
        • Karlsson Hedestam G.B.
        Rhesus and cynomolgus macaque immunoglobulin heavy-chain genotyping yields comprehensive databases of germline VDJ alleles.
        Immunity. 2021; : 0https://doi.org/10.1016/j.immuni.2020.12.018
        • Warren W.C.
        • Harris R.A.
        • Haukness M.
        • Fiddes I.T.
        • Murali S.C.
        • Fernandes J.
        • Dishuck P.C.
        • Storer J.M.
        • Raveendran M.
        • Hillier L.W.
        • Porubsky D.
        • Mao Y.
        • Gordon D.
        • Vollger M.R.
        • Lewis A.P.
        • Munson K.M.
        • DeVogelaere E.
        • Armstrong J.
        • Diekhans M.
        • Walker J.A.
        • Tomlinson C.
        • Graves-Lindsay T.A.
        • Kremitzki M.
        • Salama S.R.
        • Audano P.A.
        • Escalona M.
        • Maurer N.W.
        • Antonacci F.
        • Mercuri L.
        • Maggiolini F.A.M.
        • Catacchio C.R.
        • Underwood J.G.
        • O'Connor D.H.
        • Sanders A.D.
        • Korbel J.O.
        • Ferguson B.
        • Kubisch H.M.
        • Picker L.
        • Kalin N.H.
        • Rosene D.
        • Levine J.
        • Abbott D.H.
        • Gray S.B.
        • Sanchez M.M.
        • Kovacs-Balint Z.A.
        • Kemnitz J.W.
        • Thomasy S.M.
        • Roberts J.A.
        • Kinnally E.L.
        • Capitanio J.P.
        • Skene J.H.P.
        • Platt M.
        • Cole S.A.
        • Green R.E.
        • Ventura M.
        • Wiseman R.W.
        • Paten B.
        • Batzer M.A.
        • Rogers J.
        • Eichler E.E.
        Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility.
        Science. 2020; 370: eabc6617https://doi.org/10.1126/science.abc6617
        • Nguefack Ngoune V.
        • Bertignac M.
        • Georga M.
        • Papadaki A.
        • Albani A.
        • Folch G.
        • Jabado-Michaloud J.
        • Giudicelli V.
        • Duroux P.
        • Lefranc M.-P.
        • Kossida S.
        IMGT® biocuration and analysis of the rhesus monkey IG Loci.
        Vaccines. 2022; 10: 394https://doi.org/10.3390/vaccines10030394
        • Rodriguez O.L.
        • Gibson W.S.
        • Parks T.
        • Emery M.
        • Powell J.
        • Strahl M.
        • Deikus G.
        • Auckland K.
        • Eichler E.E.
        • Marasco W.A.
        • Sebra R.
        • Sharp A.J.
        • Smith M.L.
        • Bashir A.
        • Watson C.T.
        A novel framework for characterizing genomic haplotype diversity in the human immunoglobulin heavy chain locus.
        Front Immunol. 2020; 11: 2136https://doi.org/10.3389/fimmu.2020.02136
        • Lin M.J.
        • Lin Y.C.
        • Chen N.C.
        • Luo A.C.
        • Lai S.K.
        • Hsu C.L.
        • et al.
        Profiling genes encoding the adaptive immune receptor repertoire with gAIRR Suite.
        Front Immunol. 2022; 13https://doi.org/10.3389/fimmu.2022.922513
        • Gibson W.S.
        • Rodriguez O.L.
        • Shields K.
        • Silver C.A.
        • Dorgham A.
        • Emery M.
        • et al.
        Characterization of the immunoglobulin lambda chain locus from diverse populations reveals extensive genetic variation.
        Genes Immun. 2023; 24: 21-31https://doi.org/10.1038/s41435-022-00188-2
        • Rodriguez O.L.
        • Silver C.A.
        • Shields K.
        • Smith M.L.
        • Watson C.T.
        Targeted long-read sequencing facilitates phased diploid assembly and genotyping of the human T cell receptor alpha, delta and beta loci.
        Cell Genom. 2022; 2100228https://doi.org/10.1016/j.xgen.2022.100228
        • Wilkinson M.D.
        • Dumontier M.
        • Aalbersberg Ij.J.
        • Appleton G.
        • Axton M.
        • Baak A.
        • Blomberg N.
        • Boiten J.W.
        • da Silva Santos L.B.
        • Bourne P.E.
        • Bouwman J.
        • Brookes A.J.
        • Clark T.
        • Crosas M.
        • Dillo I.
        • Dumon O.
        • Edmunds S.
        • Evelo C.T.
        • Finkers R.
        • Gonzalez-Beltran A.
        • Gray A.J.G.
        • Groth P.
        • Goble C.
        • Grethe J.S.
        • Heringa J.
        • ’t Hoen P.A.C.
        • Hooft R.
        • Kuhn T.
        • Kok R.
        • Kok J.
        • Lusher S.J.
        • Martone M.E.
        • Mons A.
        • Packer A.L.
        • Persson B.
        • Rocca-Serra P.
        • Roos M.
        • van Schaik R.
        • Sansone S.-A.
        • Schultes E.
        • Sengstag T.
        • Slater T.
        • Strawn G.
        • Swertz M.A.
        • Thompson M.
        • van der Lei J.
        • van Mulligen E.
        • Velterop J.
        • Waagmeester A.
        • Wittenburg P.
        • Wolstencroft K.
        • Zhao J.
        • Mons B.
        The FAIR guiding principles for scientific data management and stewardship.
        Sci Data. 2016; 3160018https://doi.org/10.1038/sdata.2016.18
        • Christley S.
        • Aguiar A.
        • Blanck G.
        • Breden F.
        • Bukhari S.A.C.
        • Busse C.E.
        • Jaglale J.
        • Harikrishnan S.L.
        • Laserson U.
        • Peters B.
        • Rocha A.
        • Schramm C.A.
        • Taylor S.
        • Vander Heiden J.A.
        • Zimonja B.
        • Watson C.T.
        • Corrie B.
        • Cowell L.G.
        The ADC API: a web API for the programmatic query of the AIRR data commons.
        Front Big Data. 2020; 3 (accessed May 1, 2022)
        • Vander Heiden J.A.
        • Marquez S.
        • Marthandan N.
        • Bukhari S.A.C.
        • Busse C.E.
        • Corrie B.
        • Hershberg U.
        • Kleinstein S.H.
        • Matsen IV F.A.
        • Ralph D.K.
        • Rosenfeld A.M.
        • Schramm C.A.
        AIRR Community Standardized Representations for Annotated Immune Repertoires.
        Front Immunol. 2018; 9 (accessed May 1, 2022)
        • Rubelt F.
        • Busse C.E.
        • Bukhari S.A.C.
        • Bürckert J.-P.
        • Mariotti-Ferrandiz E.
        • Cowell L.G.
        • Watson C.T.
        • Marthandan N.
        • Faison W.J.
        • Hershberg U.
        • Laserson U.
        • Corrie B.D.
        • Davis M.M.
        • Peters B.
        • Lefranc M.-P.
        • Scott J.K.
        • Breden F.
        • Luning Prak E.T.
        • Kleinstein S.H.
        Adaptive Immune Receptor Repertoire Community recommendations for sharing immune-repertoire sequencing data.
        Nat Immunol. 2017; 18: 1274-1278https://doi.org/10.1038/ni.3873
        • Lefranc M.-P.
        • Pommié C.
        • Ruiz M.
        • Giudicelli V.
        • Foulquier E.
        • Truong L.
        • Thouvenin-Contet V.
        • Lefranc G.
        IMGT unique numbering for immunoglobulin and T cell receptor variable domains and Ig superfamily V-like domains.
        Dev Comp Immunol. 2003; 27: 55-77
      2. IETF: RFC 4648 (Base-N Encodings), (n.d.). https://www.ietf.org/rfc/rfc4648.txt (accessed May 1, 2022).

        • Corrie B.D.
        • Marthandan N.
        • Zimonja B.
        • Jaglale J.
        • Zhou Y.
        • Barr E.
        • Knoetze N.
        • Breden F.M.W.
        • Christley S.
        • Scott J.K.
        • Cowell L.G.
        • Breden F.
        iReceptor: a platform for querying and analyzing antibody/B-cell and T-cell receptor repertoire data across federated repositories.
        Immunol Rev. 2018; 284: 24-41https://doi.org/10.1111/imr.12666
        • Christley S.
        • Scarborough W.
        • Salinas E.
        • Rounds W.H.
        • Toby I.T.
        • Fonner J.M.
        • Levin M.K.
        • Kim M.
        • Mock S.A.
        • Jordan C.
        • Ostmeyer J.
        • Buntzman A.
        • Rubelt F.
        • Davila M.L.
        • Monson N.L.
        • Scheuermann R.H.
        • Cowell L.G.
        VDJServer: a cloud-based analysis portal and data commons for immune repertoire sequences and rearrangements.
        Front Immunol. 2018; 9: 976https://doi.org/10.3389/fimmu.2018.00976
        • Scott J.K.
        • Breden F.
        The adaptive immune receptor repertoire community as a model for FAIR stewardship of big immunology data.
        Curr Opin Syst Biol. 2020; 24: 71-77https://doi.org/10.1016/j.coisb.2020.10.001
      3. M.P. Lefranc, From IMGT-ontology classification axiom to IMGT standardized gene and allele nomenclature: for immunoglobulins (IG) and T cell receptors (TR), Cold Spring Harb Protoc. 2011 (2011) 627–632. 10.1101/pdb.ip84.

        • Wang Y.
        • Jackson K.J.L.
        • Sewell W.A.
        • Collins A.M.
        Many human immunoglobulin heavy-chain IGHV gene polymorphisms have been reported in error.
        Immunol Cell Biol. 2008; 86: 111-115https://doi.org/10.1038/sj.icb.7100144
        • Magadan S.
        • Sunyer O.J.
        • Boudinot P.
        Unique features of fish immune repertoires: particularities of adaptive immunity within the largest group of vertebrates.
        Results Probl Cell Differ. 2015; 57: 235-264https://doi.org/10.1007/978-3-319-20819-0_10
        • Glasauer S.M.K.
        • Neuhauss S.C.F.
        Whole-genome duplication in teleost fishes and its evolutionary consequences.
        Mol Genet Genom. 2014; 289: 1045-1060https://doi.org/10.1007/s00438-014-0889-2