Counts of variants

This notebook analyzes the counts of the different variants.

Import Python modules:

[1]:
import os

import Bio.SeqIO

import altair as alt

import dms_variants.codonvarianttable

import pandas as pd

import yaml
[2]:
# allow more rows for Altair
_ = alt.data_transformers.disable_max_rows()

Get configuration information:

[3]:
# If you are running notebook interactively rather than in pipeline that handles
# working directories, you may have to first `os.chdir` to appropriate directory.

with open("config.yaml") as f:
    config = yaml.safe_load(f)

Read information on the barcode runs:

[4]:
barcode_runs = pd.read_csv(config["processed_barcode_runs"])

assert len(barcode_runs) == barcode_runs["library_sample"].nunique()

Barcode sequencing alignment stats

Read the “fates” of barcode reads (e.g., did they align?):

[5]:
fates = (
    pd.concat(
        [
            pd.read_csv(
                os.path.join(config["barcode_fates_dir"], f"{library_sample}.csv")
            )
            for library_sample in barcode_runs["library_sample"]
        ]
    )
    .merge(barcode_runs, on=["library", "sample"], validate="many_to_one")
    .drop(columns=["fastq_R1", "notes"])
    .assign(
        valid=lambda x: x["fate"] == "valid barcode",
        not_valid=lambda x: ~x["valid"],
    )
)

Plot the barcode fates in an interactive plot:

[6]:
selection_cols = [
    "exclude_after_counts",
    "antibody",
    "virus_batch",
    "sample_type",
    "date",
    "library",
]

selections = [
    alt.selection_point(
        fields=[col],
        bind=alt.binding_select(
            options=[None] + fates[col].dropna().unique().tolist(),
            labels=["all"] + [str(x) for x in fates[col].dropna().unique()],
            name=col,
        ),
    )
    for col in selection_cols
]

sample_types = barcode_runs["sample_type"].unique().tolist()

fate_chart = (
    alt.Chart(fates)
    .encode(
        x=alt.X("count", axis=alt.Axis(format=".2g"), scale=alt.Scale(nice=False)),
        y=alt.Y("library_sample", title=None),
        color=alt.Color(
            "fate",
            scale=alt.Scale(domain=sorted(fates["fate"].unique(), reverse=True)),
        ),
        order="not_valid",
        tooltip=[
            alt.Tooltip(c, format=".3g") if c == "count" else c
            for c in fates.columns
            if c not in ["valid", "not_valid", "library_sample", "sample"]
        ],
    )
    .mark_bar()
    .properties(width=300, height=alt.Step(13))
    .configure_axis(labelLimit=500)
)

for selection in selections:
    fate_chart = fate_chart.add_parameter(selection).transform_filter(selection)

display(fate_chart)
/fh/fast/bloom_j/computational_notebooks/jbloom/2022/SARS-CoV-2_Omicron_BA.1_spike_DMS/.snakemake/conda/c7f3c86c00e32a482c2803aeeccbef5b/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.

Barcode counts

Read the counts of all valid and invalid barcodes:

[7]:
counts = pd.concat(
    [
        pd.read_csv(os.path.join(subdir, f"{library_sample}.csv")).assign(valid=valid)
        for library_sample in barcode_runs["library_sample"]
        for (subdir, valid) in [
            (config["barcode_counts_dir"], True),
            (config["barcode_counts_invalid_dir"], False),
        ]
    ]
)

Get the average number of counts per barcode, separating valid and invalid ones:

[8]:
avg_counts = (
    counts.groupby(["library", "sample", "valid"], as_index=False)
    .aggregate(
        mean_counts=pd.NamedAgg("count", "mean"),
        n_barcodes=pd.NamedAgg("barcode", "count"),
    )
    .merge(
        barcode_runs,
        on=["library", "sample"],
        validate="many_to_one",
    )
    .assign(valid=lambda x: x["valid"].astype(str))  # to make interactive chart work
    .drop(columns=["fastq_R1", "notes"])
)

Plot the average counts per barcode for both valid and invalid ones. The plot below is interactive: you can use the drop downs at bottom to select specific subsets, mouseover points for details, and click the legend to show only valid or invalid counts:

[9]:
validities = avg_counts["valid"].unique()
valid_selection = alt.selection_multi(
    fields=["valid"],
    bind="legend",
)

avg_counts_chart = (
    alt.Chart(avg_counts)
    .encode(
        x=alt.X("mean_counts", title="average counts per barcode"),
        y=alt.Y("library_sample", title=None),
        color=alt.Color(
            "valid", title="valid barcode", scale=alt.Scale(domain=validities)
        ),
        shape=alt.Shape("valid", scale=alt.Scale(domain=validities)),
        opacity=alt.condition(valid_selection, alt.value(0.8), alt.value(0)),
        tooltip=[
            alt.Tooltip(c, format=".3g") if c in {"mean_counts", "n_barcodes"} else c
            for c in avg_counts.columns
            if c != "library_sample"
        ],
    )
    .mark_point(filled=True, size=50)
    .properties(width=200, height=alt.Step(15))
    .configure_axis(labelLimit=500)
    .add_parameter(*selections, valid_selection)
)
for selection in selections:
    avg_counts_chart = avg_counts_chart.transform_filter(selection)

avg_counts_chart
/fh/fast/bloom_j/computational_notebooks/jbloom/2022/SARS-CoV-2_Omicron_BA.1_spike_DMS/.snakemake/conda/c7f3c86c00e32a482c2803aeeccbef5b/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
[9]:

Unique barcodes and other estimates of diversity

Calculate the inverse simpson index for each sample as well as the total number of unique barcodes per sample

[10]:
def inverse_simpson_index(arr):
    """
    Calculate inverse simpson index of sample.
    ISI = 1/D
    D = sum((ni/N)^2)

    """
    return 1 / sum((arr / sum(arr)) ** 2)


unique_bcs_and_ISI = (
    counts.query("valid")
    .query("count > 0")
    .groupby(["library", "sample"], as_index=False)
    .aggregate(
        inv_simpson_index=pd.NamedAgg("count", inverse_simpson_index),
        n_unique_barcodes=pd.NamedAgg("barcode", "nunique"),
    )
    .merge(barcode_runs, on=["library", "sample"], validate="one_to_one")
    .astype({"n_unique_barcodes": float})
    .drop(columns=["fastq_R1", "notes"])
)

Plot the inverse simpson metrics as an interactive plot. The number of unique barcodes is shown when mousing over a sample

[11]:
unique_barcodes_chart = (
    alt.Chart(unique_bcs_and_ISI)
    .encode(
        y=alt.Y("library_sample", title=None),
        x=alt.X("inv_simpson_index", title="inverse simpson index"),
        color=alt.Color(
            "sample_type",
            scale=alt.Scale(domain=unique_bcs_and_ISI["sample_type"].unique()),
        ),
        tooltip=[
            alt.Tooltip(c, format=".3g")
            if c in {"inv_simpson_index", "n_unique_barcodes"}
            else c
            for c in unique_bcs_and_ISI.columns
        ],
    )
    .mark_bar()
    .properties(width=275, height=alt.Step(15))
    .configure_axis(labelLimit=500)
    .add_parameter(*selections, valid_selection)
)

for selection in selections:
    unique_barcodes_chart = unique_barcodes_chart.add_parameter(
        selection
    ).transform_filter(selection)

display(unique_barcodes_chart)
/fh/fast/bloom_j/computational_notebooks/jbloom/2022/SARS-CoV-2_Omicron_BA.1_spike_DMS/.snakemake/conda/c7f3c86c00e32a482c2803aeeccbef5b/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.

Invalid barcodes and possible library-to-library contamination

Look at the top invalid barcodes.

First get all invalid barcodes along with where their counts rank among all (valid and invalid) barcodes for that library / sample:

[12]:
ranked_invalid = (
    counts.assign(
        overall_rank=lambda x: (
            x.groupby(["library", "sample"])["count"]
            .transform("rank", ascending=False, method="first")
            .astype(int)
        )
    )
    .query("not valid")
    .merge(barcode_runs)
    .sort_values(["library", "overall_rank"])
)

How many invalid barcodes are in top 10 most abundant? Plot this, and then list top barcode for any sample with an invalid in the top 10:

[13]:
# get top invalid barcode and number invalid in top 10
invalid_topn = (
    ranked_invalid.groupby(["library_sample", "library", "sample"])
    .aggregate(
        invalid_barcodes_in_top_10=pd.NamedAgg(
            "overall_rank", lambda s: len(s[s <= 10])
        ),
        top_invalid_rank=pd.NamedAgg("overall_rank", "first"),
        top_invalid_barcode=pd.NamedAgg("barcode", "first"),
    )
    .reset_index()
    .merge(barcode_runs.drop(columns=["fastq_R1", "notes"]), how="left")
)

# plot number invalid barcodes in top 10
topn_chart = (
    alt.Chart(invalid_topn)
    .encode(
        x=alt.X(
            "invalid_barcodes_in_top_10",
            title="invalid barcodes in top 10",
            scale=alt.Scale(domain=(0, 10)),
        ),
        y=alt.Y("library_sample", title=None),
        tooltip=[
            alt.Tooltip(c, format=".3g") if c in {"mean_counts", "n_barcodes"} else c
            for c in invalid_topn.columns
            if c != "library_sample"
        ],
    )
    .mark_bar()
    .properties(width=200, height=alt.Step(15))
    .configure_axis(labelLimit=500)
    .add_parameter(*selections, valid_selection)
)
for selection in selections:
    topn_chart = topn_chart.transform_filter(selection)
display(topn_chart)

# top barcode for any samples with an invalid in top 10
print("\nTop barcode for samples with an invalid barcode in top 10:")
display(
    invalid_topn.query("invalid_barcodes_in_top_10 > 0").reset_index(drop=True)[
        [
            "library",
            "sample",
            "invalid_barcodes_in_top_10",
            "top_invalid_rank",
            "top_invalid_barcode",
        ]
    ]
)
/fh/fast/bloom_j/computational_notebooks/jbloom/2022/SARS-CoV-2_Omicron_BA.1_spike_DMS/.snakemake/conda/c7f3c86c00e32a482c2803aeeccbef5b/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.

Top barcode for samples with an invalid barcode in top 10:
library sample invalid_barcodes_in_top_10 top_invalid_rank top_invalid_barcode
0 Lib-1 2022-03-25_thaw-1_VSVG_control_1 2 5 AACAGCACTACCAGCA
1 Lib-1 2022-03-25_thaw-1_VSVG_control_2 2 5 AACAGCACTACCAGCA
2 Lib-1 2022-04-13_thaw-1_antibody_CC25.103_184.5_1 2 5 TCGACAGTGGGGCTCA
3 Lib-1 2022-04-13_thaw-1_antibody_CC25.103_184.5_2 1 8 TCGACAGTGGGGCTCA
4 Lib-1 2022-04-13_thaw-1_antibody_CC25.103_369.0_1 2 1 TCGACAGTGGGGCTCA
5 Lib-1 2022-04-13_thaw-1_antibody_CC25.103_369.0_2 1 9 TCGACAGTGGGGCTCA
6 Lib-1 2022-04-13_thaw-1_antibody_CC25.103_61.5_1 2 4 TCGACAGTGGGGCTCA
7 Lib-1 2022-04-13_thaw-1_antibody_CC25.103_61.5_2 1 9 TCGACAGTGGGGCTCA
8 Lib-1 2022-04-13_thaw-1_antibody_CC95.102_111.0_1 6 1 CATGATACTGAAAGTG
9 Lib-1 2022-04-13_thaw-1_antibody_CC95.102_111.0_2 1 7 TCGACAGTGGGGCTCA
10 Lib-1 2022-04-13_thaw-1_antibody_CC95.102_222.0_1 2 1 TCGACAGTGGGGCTCA
11 Lib-1 2022-04-13_thaw-1_antibody_CC95.102_222.0_2 2 1 AACGAACATGTAACTC
12 Lib-1 2022-04-13_thaw-1_antibody_CC95.102_37.0_1 4 1 TCGACAGTGGGGCTCA
13 Lib-1 2022-04-13_thaw-1_antibody_CC95.102_37.0_2 2 7 ATATTTTGTAGTCCAA
14 Lib-1 2022-04-13_thaw-1_antibody_LyCoV-1404_0.654_1 2 1 GCATTACGCATATCGA
15 Lib-1 2022-04-13_thaw-1_antibody_LyCoV-1404_0.654_2 1 7 TCGACAGTGGGGCTCA
16 Lib-1 2022-04-13_thaw-1_antibody_LyCoV-1404_10.464_1 2 5 TTGGTCGACTCAGGTC
17 Lib-1 2022-04-13_thaw-1_antibody_LyCoV-1404_10.464_2 1 9 TCGACAGTGGGGCTCA
18 Lib-1 2022-04-13_thaw-1_antibody_LyCoV-1404_2.616_1 1 10 TCGACAGTGGGGCTCA
19 Lib-1 2022-04-13_thaw-1_antibody_LyCoV-1404_2.616_2 1 9 TCGACAGTGGGGCTCA
20 Lib-1 2022-04-13_thaw-1_no-antibody_control_1 1 3 CACTGAATAACGAATT
21 Lib-1 2022-04-13_thaw-1_no-antibody_control_2 2 3 CACTGAATAACGAATT
22 Lib-2 2022-06-22_thaw-1_VSVG_control_1 2 3 CTCATGTTATTATAGA
23 Lib-2 2022-06-22_thaw-1_antibody_CC67.105_210.0_1 1 10 GATTAAACTGAGATTA
24 Lib-2 2022-06-22_thaw-1_antibody_CC67.105_420.0_1 1 9 GATTAAACTGAGATTA
25 Lib-2 2022-06-22_thaw-1_antibody_CC9.104_272.0_1 2 9 CATGATACTGAAAGTG
26 Lib-2 2022-06-22_thaw-1_antibody_CC95.102_222.0_1 1 9 GATTAAACTGAGATTA
27 Lib-2 2022-06-22_thaw-1_antibody_LyCoV-1404_2.616_1 1 9 TATCAGATATCCCGAA
28 Lib-2 2022-06-22_thaw-1_antibody_NTD_5-7_300.0_1 1 10 CTAGCAGAAACATACA
29 Lib-2 2022-06-22_thaw-1_antibody_SD1_12-19_12.0_1 1 10 CTATTGCCCCGACAGC
30 Lib-2 2022-06-22_thaw-1_antibody_SD1_12-19_192.0_1 1 10 GGAAGCATCACTTGTT
31 Lib-2 2022-06-22_thaw-1_antibody_SD1_12-19_48.0_1 2 9 TCGACAGTGGGGCTCA
32 Lib-3 2022-06-22_thaw-1_antibody_268C.59A_323.2_1 1 10 ATAGGAGTGATAGTTG
33 Lib-3 2022-06-22_thaw-1_antibody_CC25.103_184.5_1 2 9 CATGATACTGAAAGTG
34 Lib-3 2022-06-22_thaw-1_antibody_CC25.43_48.0_1 2 9 TATAAAACAGGCACAT
35 Lib-3 2022-06-22_thaw-1_antibody_CC25.43_96.0_1 2 9 TCGACAGTGGGGCTCA
36 Lib-3 2022-06-22_thaw-1_antibody_CC9.104_68.0_1 1 9 TCGACAGTGGGGCTCA
37 Lib-3 2022-06-22_thaw-1_antibody_NTD_5-7_300.0_1 1 10 ATTCGTGCTTAAAATT
38 Lib-3 2022-06-22_thaw-1_antibody_SD1_12-19_12.0_1 5 1 CATGATACTGAAAGTG
39 Lib-3 2022-06-22_thaw-1_antibody_SD1_12-19_48.0_1 2 9 TCGACAGTGGGGCTCA

Get which libraries each barcode maps to:

[14]:
barcodes_by_library = (
    pd.read_csv(config["codon_variants"])
    .groupby(["barcode", "target"], as_index=False)
    .aggregate(
        libraries_w_barcode=pd.NamedAgg("library", lambda s: ", ".join(s.unique())),
        n_libraries_w_barcode=pd.NamedAgg("library", "nunique"),
    )
)

display(
    barcodes_by_library.groupby(["target", "libraries_w_barcode"]).aggregate(
        n_barcodes=pd.NamedAgg("barcode", "count")
    )
)
n_barcodes
target libraries_w_barcode
gene Lib-1 88547
Lib-1, Lib-2 5786
Lib-1, Lib-3 14
Lib-2 134841
Lib-2, Lib-3 16
Lib-3 125097
neut_standard Lib-1, Lib-2, Lib-3 8

Now look at the overall barcode counts for each sample and see how many map to the expected library or to some other library. Having many barcodes that map to a different library can be an indication of contamination unless there is a lot of expected overlap between the two libraries (which would be indicated in table above):

[15]:
counts_by_library = (
    counts.merge(barcodes_by_library, on="barcode", validate="many_to_one")
    .groupby(
        ["library", "sample", "libraries_w_barcode", "target", "n_libraries_w_barcode"],
        as_index=False,
    )
    .aggregate(n_counts=pd.NamedAgg("count", "sum"))
    .assign(
        frac_counts=lambda x: x["n_counts"]
        / x.groupby(["library", "sample"])["n_counts"].transform("sum"),
    )
    .merge(barcode_runs)
    .assign(
        category=lambda x: x["libraries_w_barcode"].where(
            x["target"] == "gene", x["target"]
        )
    )
    .drop(
        columns=[
            "fastq_R1",
            "notes",
            "antibody_concentration",
            "target",
            "libraries_w_barcode",
        ]
    )
)

Plot which libraries overall barcode counts map to for each sample:

[16]:
ordered_cats = (
    counts_by_library.sort_values(["n_libraries_w_barcode", "category"])["category"]
    .unique()
    .tolist()
)

category_selection = alt.selection_point(fields=["category"], bind="legend")

counts_by_library_chart = (
    alt.Chart(
        counts_by_library.assign(
            order=lambda x: x["category"].map(lambda s: ordered_cats.index(s))
        )
    )
    .encode(
        x=alt.X("frac_counts", scale=alt.Scale(domain=[0, 1])),
        y=alt.Y("library_sample", title=None),
        color=alt.Color("category", scale=alt.Scale(domain=ordered_cats)),
        order="order",
        tooltip=[
            alt.Tooltip(c, format=".2g") if c in {"n_counts", "frac_counts"} else c
            for c in counts_by_library.columns
            if c not in {"library_sample"}
        ],
    )
    .mark_bar()
    .properties(width=250, height=alt.Step(15))
    .configure_axis(labelLimit=500)
    .add_parameter(*selections, category_selection)
    .transform_filter(category_selection)
)
for selection in selections:
    counts_by_library_chart = counts_by_library_chart.transform_filter(selection)

counts_by_library_chart
/fh/fast/bloom_j/computational_notebooks/jbloom/2022/SARS-CoV-2_Omicron_BA.1_spike_DMS/.snakemake/conda/c7f3c86c00e32a482c2803aeeccbef5b/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
[16]:

Get CodonVariantTable for valid variant counts

Get the variant counts for samples with sufficient valid barcode counts, and determine if there are any samples without sufficient counts that don’t havea exclude_after_counts specified:

[17]:
print(f"Requiring {config['min_avg_counts']=} average counts per variant\n")

valid_counts = counts.query("valid").drop(columns="valid")

avg_counts = (
    valid_counts.groupby(["library", "sample"], as_index=False)
    .aggregate(avg_counts=pd.NamedAgg("count", "mean"))
    .assign(adequate_counts=lambda x: x["avg_counts"] >= config["min_avg_counts"])
    .merge(barcode_runs, how="left", validate="one_to_one")
    .drop(columns=["fastq_R1", "notes"])
)

adequate_counts_selection = alt.selection_point(
    fields=["adequate_counts"],
    bind=alt.binding_select(
        options=[None] + avg_counts["adequate_counts"].unique().tolist(),
        labels=["all"] + [str(x) for x in avg_counts["adequate_counts"].unique()],
        name="adequate counts",
    ),
)

avg_counts_chart = (
    alt.Chart(avg_counts)
    .encode(
        x=alt.X("avg_counts"),
        y=alt.Y("library_sample", title=None),
        color="exclude_after_counts",
        tooltip=[
            alt.Tooltip(c, format=".2g") if c in {"avg_counts"} else c
            for c in avg_counts.columns
            if c not in {"library_sample"}
        ],
    )
    .mark_bar()
    .properties(width=250, height=alt.Step(15))
    .configure_axis(labelLimit=500)
    .add_parameter(adequate_counts_selection, *selections)
    .transform_filter(adequate_counts_selection)
)
for selection in selections:
    avg_counts_chart = avg_counts_chart.transform_filter(selection)
display(avg_counts_chart)
print(f"Saving this chart to {config['variant_avg_counts_plot']}")
avg_counts_chart.save(config["variant_avg_counts_plot"])

print(f"Writing average counts per variant to {config['variant_avg_counts_csv']}")
(
    avg_counts.assign(min_avg_counts=config["min_avg_counts"])[
        [
            "library",
            "sample",
            "avg_counts",
            "min_avg_counts",
            "adequate_counts",
            "exclude_after_counts",
        ]
    ].to_csv(config["variant_avg_counts_csv"], index=False, float_format="%.1f")
)

insufficient_counts = avg_counts.query(
    "(not adequate_counts) & (exclude_after_counts != 'yes')"
)
if len(insufficient_counts):
    print(
        f"\n{len(insufficient_counts)} have insufficient counts despite not "
        "having `exclude_after_counts` specified in `barcode_runs`"
    )
    display(insufficient_counts[["library", "sample", "avg_counts"]])
Requiring config['min_avg_counts']=20 average counts per variant

/fh/fast/bloom_j/computational_notebooks/jbloom/2022/SARS-CoV-2_Omicron_BA.1_spike_DMS/.snakemake/conda/c7f3c86c00e32a482c2803aeeccbef5b/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
Saving this chart to results/variant_counts/avg_counts_per_variant.html
Writing average counts per variant to results/variant_counts/avg_counts_per_variant.csv

Now create a CodonVariantTable for the samples not specified for exclusion:

[18]:
geneseq = str(Bio.SeqIO.read(config["gene_sequence_codon"], "fasta").seq)

variants = dms_variants.codonvarianttable.CodonVariantTable(
    barcode_variant_file=config["codon_variants"],
    geneseq=geneseq,
    allowgaps=True,
    substitutions_are_codon=True,
    primary_target="gene",
    substitutions_col="codon_substitutions",
)

variants.add_sample_counts_df(
    valid_counts.merge(
        barcode_runs[["library", "sample", "exclude_after_counts"]],
        validate="many_to_one",
        how="left",
        on=["library", "sample"],
    ).query("exclude_after_counts == 'no'")
)

Average mutations per variant

Interactive plot of average codon mutations per variant:

[19]:
avg_muts = (
    variants.numCodonMutsByType(
        variant_type="all",
        libraries=variants.libraries,
    )
    .merge(barcode_runs, validate="many_to_one")
    .drop(
        columns=[
            "num_muts_count",
            "count",
            "antibody_concentration",
            "fastq_R1",
            "notes",
            "exclude_after_counts",
        ]
    )
)
[20]:
mut_types_sort = avg_muts["mutation_type"].unique().tolist()

mut_type_selection = alt.selection_point(fields=["mutation_type"], bind="legend")

# we can remove exclude_after_counts from selections as we've now removed it
selections = [
    s for (s, c) in zip(selections, selection_cols) if c != "exclude_after_counts"
]

avg_muts_chart = (
    alt.Chart(
        avg_muts.assign(
            order=lambda x: x["mutation_type"].map(lambda m: mut_types_sort.index(m))
        ),
    )
    .encode(
        x=alt.X(
            "number",
            title="average mutations per variant",
            axis=alt.Axis(grid=False),
        ),
        y=alt.Y(
            "library_sample",
            title=None,
        ),
        color=alt.Color(
            "mutation_type",
            scale=alt.Scale(domain=mut_types_sort),
        ),
        order=alt.Order("order", sort="descending"),
        tooltip=[
            alt.Tooltip(c, format=".3g") if c in ["count", "number"] else c
            for c in avg_muts.columns
            if c != "library_sample"
        ],
    )
    .mark_bar()
    .properties(height=alt.Step(13), width=300)
    .configure_axis(labelLimit=500)
    .add_parameter(mut_type_selection, *selections)
    .transform_filter(mut_type_selection)
)

for selection in selections:
    avg_muts_chart = avg_muts_chart.transform_filter(selection)

avg_muts_chart
/fh/fast/bloom_j/computational_notebooks/jbloom/2022/SARS-CoV-2_Omicron_BA.1_spike_DMS/.snakemake/conda/c7f3c86c00e32a482c2803aeeccbef5b/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
[20]:

Mutation frequency across the gene

Plot frequencies of mutations across the gene.

First, get the frequencies of mutations at each site:

[21]:
# read reference site numbering
site_numbering_map = pd.read_csv(config["site_numbering_map"])

# frequencies of mutations at each site
site_freqs = (
    variants.mutCounts(
        variant_type="all",
        mut_type="aa",
        libraries=variants.libraries,
    )
    .query("count > 0")
    .rename(columns={"site": "sequential_site"})
    .merge(
        site_numbering_map,
        how="left",
        on="sequential_site",
        validate="many_to_one",
    )
    .assign(
        wildtype=lambda x: x["mutation"].str[0],
        reference_site=lambda x: (
            x["reference_site"]
            if all(x["reference_site"] == x["reference_site"].astype(str))
            else x["reference_site"].astype("Int64")
        ),
    )
    .groupby(
        ["library", "sample", "sequential_site", "reference_site", "wildtype"],
        observed=True,
        as_index=False,
    )
    .aggregate(count=pd.NamedAgg("count", "sum"))
    .merge(
        variants.n_variants_df(
            libraries=variants.libraries, primary_target_only=True
        ).rename(columns={"count": "n_variants"})
    )
    .assign(percent=lambda x: 100 * x["count"] / x["n_variants"])
    .merge(barcode_runs[["library_sample", "library", "sample"]])
    .drop(columns=["n_variants", "count", "library", "sample"])
)

Now to make these plots smaller we pivot the data to wide form and remove all most of the data to a separate lookup table:

[22]:
# encode samples as integers
library_sample_code = {
    y: str(x) for x, y in enumerate(site_freqs["library_sample"].unique())
}

# wide version of site freqs
site_freqs_wide = (
    site_freqs.assign(
        library_sample_code=lambda x: x["library_sample"].map(library_sample_code)
    )
    .pivot_table(
        index=["sequential_site", "reference_site", "wildtype"],
        values="percent",
        columns="library_sample_code",
        fill_value=0,
    )
    .reset_index()
)

# lookup data frame that map sample codes to other information
library_sample_lookup = barcode_runs.assign(
    library_sample_code=lambda x: x["library_sample"].map(library_sample_code)
).query("library_sample_code.notnull()")

Now plot the frequencies:

[23]:
zoom_brush = alt.selection_interval(
    encodings=["x"],
    mark=alt.BrushConfig(stroke="black", strokeWidth=2),
)

zoom_bar = (
    alt.Chart(site_freqs[["sequential_site"]].drop_duplicates())
    .mark_rect(color="lightgrey")
    .encode(
        x=alt.X("sequential_site", title=None, scale=alt.Scale(nice=False, zero=False))
    )
    .add_parameter(zoom_brush)
    .properties(width=550, height=15, title="site zoom bar")
)

site_freqs_base = (
    alt.Chart(site_freqs_wide)
    .encode(
        x=alt.X("sequential_site", scale=alt.Scale(nice=False, zero=False)),
        y=alt.Y("percent:Q", title="% variants with mutation"),
        color=alt.Color("sample_type:N", legend=alt.Legend(orient="top")),
        tooltip=[
            alt.Tooltip(c, format=".3g") if c in {"percent:Q"} else c
            for c in ["sequential_site", "reference_site", "wildtype", "percent:Q"]
        ],
    )
    .properties(height=100, width=275)
)

site_freqs_chart = (
    alt.layer(
        site_freqs_base.mark_point(filled=True),
        site_freqs_base.mark_line(size=0.5),
        data=site_freqs_wide,
    )
    .transform_fold(
        list(library_sample_code.values()), ["library_sample_code", "percent"]
    )
    .transform_lookup(
        lookup="library_sample_code",
        from_=alt.LookupData(
            data=library_sample_lookup,
            key="library_sample_code",
            fields=[
                "date",
                "virus_batch",
                "library",
                "sample_type",
                "antibody",
                "sample",
                "library_sample",
            ],
        ),
    )
    .facet(facet=alt.Facet("library_sample:N", title=None), columns=3)
    .add_parameter(zoom_brush)
    .transform_filter(zoom_brush)
    .resolve_scale(y="independent")
)

site_freqs_zoom_chart = (zoom_bar & site_freqs_chart).configure_axis(grid=False)

site_freqs_zoom_chart
/fh/fast/bloom_j/computational_notebooks/jbloom/2022/SARS-CoV-2_Omicron_BA.1_spike_DMS/.snakemake/conda/c7f3c86c00e32a482c2803aeeccbef5b/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
[23]:

How much do individual variants dominate counts?

We look to see which individual variants dominate the counts, doing this separately for each target:

[24]:
variant_counts = (
    variants.variant_count_df[
        ["library", "sample", "target", "barcode", "count", "aa_substitutions"]
    ]
    .merge(
        barcode_runs.drop(
            columns=[
                "fastq_R1",
                "notes",
                "antibody_concentration",
                "exclude_after_counts",
            ]
        )
    )
    .assign(
        percent=lambda x: 100
        * x["count"]
        / x.groupby(["library_sample", "target"])["count"].transform("sum")
    )
    .sort_values("percent", ascending=False)
)

Get the top 25 variants, and the 10th and 90th percentiles:

[25]:
top_n = 25

variant_counts_top_n = (
    variant_counts.groupby(["library_sample", "target"])
    .head(n=top_n)
    .merge(
        (
            variant_counts.groupby(
                ["library_sample", "target"], as_index=False
            ).aggregate(
                percentile_10=pd.NamedAgg("percent", lambda s: s.quantile(0.1)),
                percentile_90=pd.NamedAgg("percent", lambda s: s.quantile(0.9)),
                min_percent=pd.NamedAgg("percent", "min"),
                max_percent=pd.NamedAgg("percent", "max"),
            )
        ),
        validate="many_to_one",
    )
)

variant_counts_top_n.head()
[25]:
library sample target barcode count aa_substitutions date virus_batch sample_type antibody replicate library_sample percent percentile_10 percentile_90 min_percent max_percent
0 Lib-1 2022-03-25_thaw-1_VSVG_control_1 neut_standard GCGATTCACGCGTTGG 3 neut_standard 2022-03-25 thaw-1 VSVG_control NaN 1 Lib-1_2022-03-25_thaw-1_VSVG_control_1 42.857143 0.0 42.857143 0.0 42.857143
1 Lib-1 2022-03-25_thaw-1_VSVG_control_1 neut_standard ATATAGACACGTGACC 3 neut_standard 2022-03-25 thaw-1 VSVG_control NaN 1 Lib-1_2022-03-25_thaw-1_VSVG_control_1 42.857143 0.0 42.857143 0.0 42.857143
2 Lib-1 2022-03-25_thaw-1_VSVG_control_1 neut_standard CCATCTAGTGGCTAGG 1 neut_standard 2022-03-25 thaw-1 VSVG_control NaN 1 Lib-1_2022-03-25_thaw-1_VSVG_control_1 14.285714 0.0 42.857143 0.0 42.857143
3 Lib-1 2022-03-25_thaw-1_VSVG_control_1 neut_standard GTGCAGTAGTAAAGTA 0 neut_standard 2022-03-25 thaw-1 VSVG_control NaN 1 Lib-1_2022-03-25_thaw-1_VSVG_control_1 0.000000 0.0 42.857143 0.0 42.857143
4 Lib-1 2022-03-25_thaw-1_VSVG_control_1 neut_standard TACCCATGGATACGAT 0 neut_standard 2022-03-25 thaw-1 VSVG_control NaN 1 Lib-1_2022-03-25_thaw-1_VSVG_control_1 0.000000 0.0 42.857143 0.0 42.857143

Now plot these data. The points show the top variants, and can be moused over for details. The bars show the 10th to 90th percentiles, and the lines the full span. Note that these are the percent frequencies of variants, not the fractional frequencies. Also, note that the x-axis is a symlog scale:

[26]:
variant_selector = alt.selection_point(
    on="mouseover",
    empty="none",
    fields=["barcode", "library", "target"],
)

target_selection = alt.selection_point(
    fields=["target"],
    bind=alt.binding_select(
        options=variant_counts_top_n["target"].unique(),
        name="target",
    ),
    value=[{"target": "gene"}],
)

variant_counts_top_n_base = (
    alt.Chart(variant_counts_top_n)
    .encode(
        y=alt.Y("library_sample", title=None),
        color=alt.Color(
            "sample_type",
            scale=alt.Scale(domain=variant_counts_top_n["sample_type"].unique()),
        ),
    )
    .mark_point(filled=True)
)

variant_counts_top_n_points = variant_counts_top_n_base.encode(
    x=alt.X("percent", title="percent of library", scale=alt.Scale(type="symlog")),
    tooltip=[
        alt.Tooltip(c, format=".3g") if c.startswith("percent") else c
        for c in variant_counts_top_n.columns
        if c not in ["library_sample", "min_percent", "max_percent"]
    ],
    opacity=alt.condition(variant_selector, alt.value(1), alt.value(0.5)),
    strokeWidth=alt.condition(variant_selector, alt.value(2), alt.value(0)),
    size=alt.condition(variant_selector, alt.value(40), alt.value(25)),
).mark_point(filled=True, stroke="black")

variant_counts_top_n_bars = variant_counts_top_n_base.encode(
    alt.X("percentile_10"),
    alt.X2("percentile_90"),
).mark_bar(size=11)

variant_counts_top_n_rule = variant_counts_top_n_base.encode(
    alt.X("min_percent"),
    alt.X2("max_percent"),
).mark_rule()

variant_counts_top_n_chart = (
    (
        variant_counts_top_n_points
        + variant_counts_top_n_bars
        + variant_counts_top_n_rule
    )
    .configure_axis(labelLimit=500)
    .add_parameter(variant_selector, target_selection, *selections)
    .transform_filter(target_selection)
    .properties(height=alt.Step(15), width=275)
)
for selection in selections:
    variant_counts_top_n_chart = variant_counts_top_n_chart.transform_filter(selection)

variant_counts_top_n_chart
/fh/fast/bloom_j/computational_notebooks/jbloom/2022/SARS-CoV-2_Omicron_BA.1_spike_DMS/.snakemake/conda/c7f3c86c00e32a482c2803aeeccbef5b/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
[26]:

Check counts are consistent with variant count files

Make sure the variant counts analyzed here are the same ones written to CSV files by the pipeline for each library/sample:

[27]:
for library_sample, df in variants.variant_count_df.merge(
    barcode_runs,
    on=["library", "sample"],
    how="left",
    validate="many_to_one",
).groupby("library_sample"):
    f = os.path.join(config["variant_counts_dir"], f"{library_sample}.csv")
    print(f"Checking counts in {f}")
    assert os.path.isfile(f), f"cannot find {f}"
    pd.testing.assert_frame_equal(
        pd.read_csv(f, na_filter=None),
        df[
            [
                "barcode",
                "count",
                "codon_substitutions",
                "aa_substitutions",
                "variant_call_support",
            ]
        ]
        .sort_values(["count", "barcode"], ascending=[False, True])
        .reset_index(drop=True),
    )
Checking counts in results/variant_counts/Lib-1_2022-03-25_thaw-1_VSVG_control_1.csv
Checking counts in results/variant_counts/Lib-1_2022-03-25_thaw-1_VSVG_control_2.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC25.103_184.5_1.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC25.103_184.5_2.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC25.103_369.0_1.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC25.103_369.0_2.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC25.103_61.5_1.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC25.103_61.5_2.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC95.102_111.0_1.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC95.102_111.0_2.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC95.102_222.0_1.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC95.102_222.0_2.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC95.102_37.0_1.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_CC95.102_37.0_2.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_LyCoV-1404_0.654_1.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_LyCoV-1404_0.654_2.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_LyCoV-1404_10.464_1.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_LyCoV-1404_10.464_2.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_LyCoV-1404_2.616_1.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_antibody_LyCoV-1404_2.616_2.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_no-antibody_control_1.csv
Checking counts in results/variant_counts/Lib-1_2022-04-13_thaw-1_no-antibody_control_2.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_VSVG_control_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_268C.59A_323.2_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_268C.59A_80.8_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC25.103_184.5_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC25.103_369.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC25.103_61.5_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC25.43_12.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC25.43_48.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC25.43_96.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC67.105_210.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC67.105_420.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC67.105_52.5_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC9.104_272.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC9.104_544.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC9.104_68.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC95.102_111.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC95.102_222.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_CC95.102_37.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_LyCoV-1404_0.654_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_LyCoV-1404_10.464_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_LyCoV-1404_2.616_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_NTD_5-7_150.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_NTD_5-7_300.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_SD1_12-19_12.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_SD1_12-19_192.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_antibody_SD1_12-19_48.0_1.csv
Checking counts in results/variant_counts/Lib-2_2022-06-22_thaw-1_no-antibody_control_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_VSVG_control_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_268C.59A_20.2_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_268C.59A_323.2_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_268C.59A_80.8_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC25.103_184.5_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC25.103_369.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC25.103_61.5_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC25.43_12.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC25.43_48.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC25.43_96.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC67.105_210.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC67.105_420.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC67.105_52.5_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC9.104_272.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC9.104_544.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_CC9.104_68.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_LyCoV-1404_0.654_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_LyCoV-1404_10.464_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_LyCoV-1404_2.616_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_NTD_5-7_150.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_NTD_5-7_300.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_SD1_12-19_12.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_SD1_12-19_192.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_antibody_SD1_12-19_48.0_1.csv
Checking counts in results/variant_counts/Lib-3_2022-06-22_thaw-1_no-antibody_control_1.csv