Tools and Resources Working Group

Members: Ramy Arnaout, Davide Bagnara, Christian E. Busse, Martin Corcoran, Rion Dooley, Danny Douek, Simon Frost, Victor Greiff, William Lees, Mats Ohlin, Sai Reddy, Adrian Shepherd, Corey Watson, Erick A. Matsen IV (co-chair), and Chaim A. Schramm (co-chair). 

The goal of this working group is to organize and evaluate B cell receptor (BCR) and T cell receptor (TCR) sequence analysis tools.

Draft summary - Recommendations from the Tools and Resources Working Group (PDF)

Please note: all members of Tools and Resources Working Group have not approved this document. These documents will be updated.

The Tools and Resources initiated subsidiary working groups to handle the tasks required by their broad mandate.

Biological Standards Working Group (led by Sarah Taylor)

This working group will be responsible for coordinating the development of reference samples that can be used as controls. The working group will reach out to established organizations such as NIST and Genome in a Bottle, as well as companies like ATCC and Novartis to help encourage ease of use and broad adoption.

The goal of our working group is to be able to recommend a set of biological standards that can be used for normalization of data sets. This will allow more direct comparison of data generated by different library prep methods. Ultimately we would like to have DNA and RNA spike-ins, in addition to cell mixes that can be used for this purpose.

Sai Reddy, one of the group members, has developed some mouse VH RNA spike-ins and since these are already validated by Sai’s lab we thought we would use this as our starting point. We have had several discussions with NIST and are following their recommendations on how best to perform further testing. The initial test they recommend is performing a multi-lab study using the spike-ins to validate their performance.

Before for the next AIRR meeting, we plan to:

  1. Send out mouse VH spike-ins (RNA) to multiple labs for testing
  2. Collate data from the test labs for comparison
  3. Have the data available for presentation at the meeting

Biological Standards members: Sarah Taylor, Sai Reddy, Melissa Smith, Victor Greiff, Yariv Win, Christian Busse, Davide Bagnara, Danny Douek, Chraim Schramm

Software Standards Working Group (led by Erick Matsen)

This working group will be responsible for developing a list of standard datasets with which software tools can be tested and compared. This will include both real and simulated data with a variety of characteristics matched to potential applications.

The goal of our working group is to encourage practices that enable software tools to work, and to work with one another. As such we have been assembling data sets that people can use to test functionality of various programs. We have also had discussions about containerization using Docker, which is a technology that allows a program to be self-contained and be mostly immune from differing versions of software dependencies. We have also discussed a Common Workflow Language (CWL) schema that would allow programs to be run with standardized input fields.

However, after feedback from various community members it appears that the most acute need is standardized simulated data sets with known properties. In the next few months, we will define summary statistics that can be used to characterize simulated data sets and compare them to real data sets. After that we can “benchmark the benchmarks” to decide how realistic the various simulations are.

In preparation for the next AIRR meeting, we would like to

  1. Standardize criteria for a realistic simulation
  2. Finalize list of standard data sets; make them available for automated download

Software Standards members: Branden Olson, Duncan Ralph, Erick, Chaim, Christian Busse, Corey Watson, Inimary Toby, Jason Vader Heiden, Mikhail Shugay, Simon Frost, Uri Laserson, Victor Greiff, William Lees, Jian Ye, Enkelejda Miho

Germline Database Working Group (led by Corey Watson)

This working group will be responsible for developing appropriate metadata fields for the documenting of novel germline alleles and for establishing standards for versioned, inclusive databases. This will involve interfacing with established germline repositories, such as IMGT, IgPdb, and VBase to get the widest possible buy-in.

The AIRR Germline Database Working Group (GLDB-WG) is seeking to tackle many outstanding issues pertaining to needed improvements in existing germline IG/TCR gene and allele databases. While our work and discussions to date have been quite broad, spanning many topics and ideas in this space, from governance, to database structure, to gene/allele inference criteria, etc., we have recently narrowed our immediate sights on three primary initiatives, which we hope to have completed in the next 3-5 months. These initiatives will be carried out in collaboration with IMGT, and primarily aim to facilitate needed changes/additions to the current IMGT database. Although initial interactions between the GLDB-WG and IMGT were not met with success, a recent February 2017 meeting between members of the working group and IMGT in Montpellier allowed us to find some common ground.

Briefly, these initiatives seek to:

-Initiative 1-

  1. Develop criteria (and a “scoring” system) for inferring IG/TCR genes/alleles from expressed repertoire data.
  2. Establish a process for depositing “inferred” gene/allele data into a public database, consistent with requirements set by IMGT.
  3. Assist IMGT in creating a mechanism that will allow these deposited data to be extracted and curated to source a secondary IMGT IG/TCR database. 

-Initiative 2-

  1. Assess recently generated IG genomic assembly data for the NOD mouse strain in comparison to data available for C57BL/6.
  2. Use these data to guide a set of recommendations to IMGT for more effectively and accurately annotating mouse IG/TCR genes for various strains.

-Initiative 3-

  1. (Related to Initiative 1) Using data from the rhesus macaque, establish a process for creating IG/TCR germline databases within IMGT (sourced by either repertoire-inferred data and/or genomic data) for species that do not have completed genome reference assemblies.
  2. This will include criteria (if possible) for how to classify and annotate sequences as genes vs. alleles when these data cannot be explicitly placed in the context of a genomic locus.

Germline Database members: Scott Christley, Andrew Collins, Bruno Gaeta, Felix Breden, Brian Fritz, Chaim Schramm, Christian Busse, Corey Watson, Daniel Gadala-Maria, Davide Bagnara, Deanna Church, Florian Rubelt, Duncan Ralph, Erick Matsen, Gur Yaari, Jamie Faison, Jean-Philippe Buerckert, Jian Ye, Jamie Scott, Justin Kos, Katherine Jackson, Kevin Wu, Martin Corcoran, Mats Ohlin, Melissa Smith, Nishanth MarthandanPaul Rivkin, William Rounds, Steven Kleinstein, Victor Greiff, Werner Müller, William Gibson

File Formats Working Group (led by Uri Laserson)

This working group will be responsible for developing standardized names for data fields which can be understood and interpreted by all software tools, allowing interoperability between pipelines from different developers. Close collaboration with the Minimal Standards Working Group is expected.

The formats working group is focused on developing standard file formats and schemas to represent annotated antibody and T cell receptor sequences and any downstream data representations. The proliferation of tools for processing raw AIRR data is making it more difficult to compare results between tools and to build modular data pipelines. We have been developing a CSV-like file format for representing annotated reads and clones, with the goal of having it implemented in multiple common AIRR pipelines (e.g., immcantation).

In preparation for the next AIRR meeting, we would like to:

  1. Finalize a schema/format for annotated AIRR data; we would like the AIRR community to endorse it for all tools that annotate reads
  2. Provide example data sets in our format, along with a tool for validating correctly formatted data
  3. Ensure our schema is compatible with the Common Repo WG's specifications for repositories.

File Formats members: Aaron Rosenfeld, Anna Fowler, Ahmad Chan, Brian Corrie, Bojan Zimonja, Chaim Schramm, Corey Watson, Daniel Gadala-Maria, Duncan Ralph, Felix Breden, Jason Vander Heiden, Jerome Jaglale, Jessica Finn, Nishanth Marthandan, Richard Bruskiewich, Scott Christley, Steve Kleinstein, Susanna Marquez

 

 Working Group Resources