Pipeline Information

The lentivirus-based deep mutational scanning platform relies on obtaining relative frequencies of different receptor binding protein variants that enter cells after applying selection to the libraries. By comparing mutation frequencies before and after selections, we can determine the effects of mutations on different phenotypes.

Calculating the relative frequencies of thousands of variants is not trivial. We rely on two different sequencing technologies to obtain the necessary data.

PacBio long-read sequencing to link barcodes to specific mutations in the receptor binding protein.
Illumina short-read sequencing to obtain the relative frequencies of barcodes in each selection experiment.

Schematic of lentivirus vector used in deep mutational scanning experiments (top), along with sequencing strategy (bottom).

Because PacBio sequencing is expensive and lower throughput, we only sequenced the variant libraries with this technology once. Full-length consensus sequences of the receptor binding protein and associated barcodes are assembled, while discarding low-quality reads. From these assembled consensus sequences we build a codon-variant lookup table, enabling us to match barcodes to specific mutations in the receptor binding protein. All subsequent Illumina sequencing of selection experiments use this lookup table to estimate mutational effects from barcode sequencing data alone. Our generated pseudovirus libraries consist of 60,000 to 80,000 unique variants. Each unique variant is sequenced hundreds of times with Illumina to get accurate frequency measurements.

Most of these computationally intensive steps were analyzed with dms-vep-pipeline-3. This pipeline utilizes the alignparse package. Each step, along with the associated jupyter notebooks are listed below.

Build Pacbio Sequences

PacBio consensus sequences notebook

This notebook builds the Pacbio consensus sequences used to link specific mutations in the receptor binding protein with a unique 16 bp barcode. Parameters used are:

max_minor_sub_frac=0.2
max_minor_indel_frac=0.2
min_support=3

These parameters filter consensus sequences generated from Pacbio CCS sequencing and assembly. If an assembled RBP sequence has a mutation or indel in more than 20% of the reads, it will be discarded. Consensus sequences must have at least three reads to be included as variants.

Using alignparse, reads were mapped to a reference sequence, and clipped based on parameters in this config file.

Analyze PacBio CCS Reads

Analyze Pacbio CCS reads notebook

Reports information about CCS read filtering.

Build Codon Variants Notebook

Build codon variants notebook

Builds the codon-variant table from PacBio consensus sequences that links barcodes and RBP mutations. Displays information about the number of mutations and variants present in each library.

Link to codon-variant table .csv file

Illumina Variant Counts

Once the barcodes are linked to mutations in the codon-variant table, all sequencing data is generated with Illumina on a small sequence fragment to obtain the relative frequencies of barcodes in each selection experiment. The config file linked below specifies the parameters used for converting barcode counts to functional scores, which are used to estimate cell entry.

Analysis of variant counts notebook

Link to raw barcode count .csv files

Link to functional selection config file

Filtering Selection Data

Filtering notebook

Once the effects of mutations on different phenotypes have been calculated, we perform a data filtering step to remove low confidence measurements. The filtering parameters are contained within the nipah_config.yaml file. More information about these parameters are listed in the notebook. In brief, we require mutations to be present in at least two barcodes, and have low variance between selection replicates.

Filtered Data

These data have been filtered and are the best choice for anyone interested in analyzing the data themselves. For unfiltered raw .csv files of mutational effects on different phenotypes, go to individual pages to view and download.

Miscellaneous Notebooks

Notebook for finding correlations between libraries and making histogram of variants

Notebook for making a Nipah phylogeny

Notebook for making specific file formats (.defattr), to map site-averaged scores onto protein structures in Chimera

Notebook for calculating atomic distances between residues from a PDB file

Notebook for finding variable sites in Nipah or Henipavirus alignments

Pipeline Information ​

Build Pacbio Sequences ​

Analyze PacBio CCS Reads ​

Build Codon Variants Notebook ​

Illumina Variant Counts ​

Filtering Selection Data ​

Filtered Data ​

Cell Entry ​

Receptor Binding ​

Antibody Escape ​

Miscellaneous Notebooks ​