Plot Cell Entry Effects of RSV F mutations observed in natural sequences relative to the DMS strain "RSV Long"¶
This notebook analyzes the cell entry effects of naturally occurring RSV F protein sequence variation identified from sequence alignments.
Analysis Overview¶
This analysis identifies mutations in natural RSV F protein sequences by comparing them to the RSV_Long_F strain used in deep mutational scanning (DMS) experiments, then reports the DMS-measured cell entry effects of these mutations.
Effect calculation: For each mutation observed in natural sequences, we report the DMS-measured cell entry effect relative to the DMS strain's amino acid at that position.
For example, if the DMS strain has N at position 100 and a natural sequence has K, we report the DMS effect of the N→K mutation.
In [1]:
# Parameters - will be overridden by papermill
strain = "RSV-A"
variations_with_effects_file = "results/sequence_variation/RSV-A_sequence_variations_with_effects.csv"
output_html = "results/sequence_variation/RSV-A_sequence_variation_effects.html"
analysis_description = "Differences from DMS Long Strain"
mutation_identified_relative_to = "RSV_Long_F"
effects_calculated_relative_to = "RSV_Long_F"
In [2]:
# Parameters
strain = "RSV-A"
variations_with_effects_file = (
"results/sequence_variation/RSV-A_dms_ref_sequence_variations_with_effects.csv"
)
output_html = "results/sequence_variation/RSV-A_dms_ref_sequence_variation_effects.html"
analysis_description = "Differences from Strain Used in Deep Mutational Scanning"
mutation_identified_relative_to = "RSV_Long_F"
effects_calculated_relative_to = "RSV_Long_F"
In [3]:
import pandas as pd
import altair as alt
from Bio import SeqIO
_ = alt.data_transformers.disable_max_rows()
Configuration¶
In [4]:
print(f"Analyzing strain: {strain}")
print(f"Analysis type: {analysis_description}")
print(f"Variations with effects: {variations_with_effects_file}")
print(f"Differences identified relative to: {mutation_identified_relative_to}")
print(f"Effects calculated relative to: {effects_calculated_relative_to}")
print(f"Output HTML: {output_html}")
Analyzing strain: RSV-A Analysis type: Differences from Strain Used in Deep Mutational Scanning Variations with effects: results/sequence_variation/RSV-A_dms_ref_sequence_variations_with_effects.csv Differences identified relative to: RSV_Long_F Effects calculated relative to: RSV_Long_F Output HTML: results/sequence_variation/RSV-A_dms_ref_sequence_variation_effects.html
Load data¶
In [5]:
# Load sequence variations with effects
with_effects = pd.read_csv(variations_with_effects_file)
print(f"Loaded {len(with_effects)} {strain} sequence variations")
print(f"At {with_effects['site'].nunique()} sites")
# Separate those with and without DMS data
with_dms = with_effects.dropna(subset=['cell entry'])
without_dms = with_effects[with_effects['cell entry'].isna()]
print(f"\nVariations with DMS data: {len(with_dms)}")
print(f"Variations without DMS data: {len(without_dms)}")
print(f"Coverage: {len(with_dms) / len(with_effects) * 100:.1f}% of variations have DMS data")
display(with_effects.head())
# Also load all possible DMS mutations for comparison in plots
# Extract cell_entry_file path from the pipeline
import os
cell_entry_file = "results/summaries/cell_entry.csv"
cell_entry = pd.read_csv(cell_entry_file)
print(f"\nLoaded {len(cell_entry)} total possible mutations from DMS data for comparison")
Loaded 234 RSV-A sequence variations At 147 sites Variations with DMS data: 148 Variations without DMS data: 86 Coverage: 63.2% of variations have DMS data
| site | wildtype | mutant | mutation_type | mutation_count | cell entry | sequential_site | region | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2 | E | D | E→D | 219 | NaN | NaN | NaN |
| 1 | 3 | L | S | L→S | 21 | NaN | NaN | NaN |
| 2 | 3 | L | F | L→F | 4 | NaN | NaN | NaN |
| 3 | 4 | P | L | P→L | 34 | NaN | NaN | NaN |
| 4 | 4 | P | S | P→S | 5 | NaN | NaN | NaN |
Loaded 9800 total possible mutations from DMS data for comparison
Summary statistics¶
In [6]:
print(f"Cell Entry Effect Statistics for {strain} Sequence Variations")
print(f"Analysis: {analysis_description}")
print("=" * 60)
# Filter to only variations with DMS data for statistics
with_dms = with_effects.dropna(subset=['cell entry'])
print(f"Statistics based on {len(with_dms)} variations with DMS data:")
print(f"Mean effect: {with_dms['cell entry'].mean():.3f}")
print(f"Median effect: {with_dms['cell entry'].median():.3f}")
print(f"Std deviation: {with_dms['cell entry'].std():.3f}")
print(f"\nEffect range: {with_dms['cell entry'].min():.3f} to {with_dms['cell entry'].max():.3f}")
# Count by effect category
print("\nEffect Categories:")
print(f" Beneficial (> 0.5): {(with_dms['cell entry'] > 0.5).sum()} ({(with_dms['cell entry'] > 0.5).sum() / len(with_dms) * 100:.1f}%)")
print(f" Neutral (-0.5 to 0.5): {((with_dms['cell entry'] >= -0.5) & (with_dms['cell entry'] <= 0.5)).sum()} ({((with_dms['cell entry'] >= -0.5) & (with_dms['cell entry'] <= 0.5)).sum() / len(with_dms) * 100:.1f}%)")
print(f" Deleterious (< -0.5): {(with_dms['cell entry'] < -0.5).sum()} ({(with_dms['cell entry'] < -0.5).sum() / len(with_dms) * 100:.1f}%)")
# Most tolerated variations
print("\nMost Tolerated Sequence Variations (highest cell entry):")
display(with_dms.nlargest(10, 'cell entry')[['site', 'mutation_type', 'cell entry', 'mutation_count', 'region']])
# Most deleterious variations
print("\nMost Deleterious Sequence Variations (lowest cell entry):")
display(with_dms.nsmallest(10, 'cell entry')[['site', 'mutation_type', 'cell entry', 'mutation_count', 'region']])
Cell Entry Effect Statistics for RSV-A Sequence Variations Analysis: Differences from Strain Used in Deep Mutational Scanning ============================================================ Statistics based on 148 variations with DMS data: Mean effect: -0.204 Median effect: -0.104 Std deviation: 0.652 Effect range: -3.351 to 1.399 Effect Categories: Beneficial (> 0.5): 13 (8.8%) Neutral (-0.5 to 0.5): 95 (64.2%) Deleterious (< -0.5): 40 (27.0%) Most Tolerated Sequence Variations (highest cell entry):
| site | mutation_type | cell entry | mutation_count | region | |
|---|---|---|---|---|---|
| 197 | 516 | V→I | 1.3990 | 4 | none |
| 70 | 88 | N→T | 1.1530 | 7 | region_0 |
| 201 | 519 | G→V | 0.9563 | 8 | none |
| 72 | 94 | Q→E | 0.9339 | 19 | region_0 |
| 149 | 228 | N→S | 0.7754 | 5 | none |
| 150 | 228 | N→T | 0.7464 | 5 | none |
| 107 | 120 | N→Y | 0.6935 | 4 | none |
| 155 | 276 | N→S | 0.6924 | 3917 | region_II |
| 109 | 121 | N→S | 0.5973 | 7 | none |
| 74 | 100 | T→S | 0.5721 | 7 | none |
Most Deleterious Sequence Variations (lowest cell entry):
| site | mutation_type | cell entry | mutation_count | region | |
|---|---|---|---|---|---|
| 73 | 99 | S→N | -3.351 | 5 | none |
| 142 | 171 | L→P | -2.826 | 6 | region_V |
| 77 | 101 | T→P | -2.730 | 5036 | none |
| 181 | 462 | Q→R | -1.461 | 3 | region_IV |
| 176 | 385 | D→N | -1.214 | 3 | region_I |
| 80 | 102 | A→V | -1.188 | 9 | none |
| 79 | 101 | T→S | -1.182 | 3 | none |
| 194 | 510 | D→N | -1.124 | 3 | none |
| 148 | 213 | R→S | -1.091 | 5027 | region_0 |
| 203 | 524 | N→S | -1.062 | 8 | none |
Create interactive plots with Altair¶
In [7]:
# Plot 1: Distribution histogram with all possible mutations and observed variations on separate y-axes
# Filter to only variations with DMS data for plotting
with_dms = with_effects.dropna(subset=['cell entry'])
# Define shared bin parameters with explicit step size to ensure both histograms have same bin widths
# Using a fixed step size guarantees identical bin boundaries for both datasets
bin_params = alt.Bin(step=0.2)
# Background histogram: All possible DMS mutations (blue) - right y-axis
hist_all_mutations = alt.Chart(cell_entry).mark_bar(color='blue', opacity=0.6).encode(
alt.X('cell entry:Q', bin=bin_params, title='Cell Entry Effect', axis=alt.Axis(values=[-6, -5, -4, -3, -2, -1, 0, 1, 2, 3])),
alt.Y('count()', title='All possible mutations', axis=alt.Axis(titleColor='blue')),
tooltip=['count()']
)
# Foreground histogram: Observed sequence variations (green) - left y-axis
hist_observed = alt.Chart(with_dms).mark_bar(color='green', opacity=0.7).encode(
alt.X('cell entry:Q', bin=bin_params, title='Cell Entry Effect', axis=alt.Axis(values=[-6, -5, -4, -3, -2, -1, 0, 1, 2, 3])),
alt.Y('count()', title='Mutations in natural sequences', axis=alt.Axis(titleColor='green')),
tooltip=['count()']
)
# Add zero reference line
rule_zero = alt.Chart(pd.DataFrame({'x': [0]})).mark_rule(color='red', strokeDash=[5, 5], strokeWidth=0.5).encode(x='x:Q')
# Layer histograms with independent y-axes and resolve x scale to shared
chart1 = alt.layer(
hist_all_mutations,
hist_observed,
rule_zero
).resolve_scale(
y='independent',
x='shared'
).properties(
width=600,
height=400,
title={
"text": f'{strain} {analysis_description}',
"subtitle": f'Distribution of Cell Entry Effects | Blue: All Possible Mutations (n={len(cell_entry)}) | Green: Mutations in Natural Sequences (n={len(with_dms)})'
}
)
chart1
Out[7]:
In [8]:
# Plot 2: Scatter plot by position
# Use only variations with DMS data
# Calculate total number of sequences for percentage calculation
# Hard-code values based on strain (from FASTA file sequence counts)
total_sequences = 5045 if strain == "RSV-A" else 5156 # RSV-A: 5045, RSV-B: 5156
with_dms['mutation_percent'] = (with_dms['mutation_count'] / total_sequences) * 100
chart2 = alt.Chart(with_dms).mark_circle(opacity=0.7).encode(
x=alt.X('site:Q', title='Site', axis=alt.Axis(grid=True, gridWidth=0.25, gridColor='gray', values=[1, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500])),
y=alt.Y('cell entry:Q', title='Cell Entry Effect', axis=alt.Axis(grid=True, gridWidth=0.25, gridColor='gray')),
size=alt.Size(
'mutation_percent:Q',
title=['Percent of', 'natural sequences', 'with mutation'],
scale=alt.Scale(range=[20, 400]),
legend=alt.Legend(
orient='right',
titleLimit=150,
titleOrient='top',
offset=10
)
),
tooltip=['site', 'mutation_type', 'wildtype', 'mutant', 'mutation_count', 'mutation_percent', 'cell entry']
).properties(
width=700,
height=400,
title={
"text": f'{strain}',
"subtitle": f'{analysis_description} - Cell Entry Effects by Position'
}
)
# Add zero line
zero_line = alt.Chart(pd.DataFrame({'y': [0]})).mark_rule(color='red', strokeDash=[5, 5], strokeWidth=0.5).encode(y='y:Q')
chart2 = chart2 + zero_line
chart2
/tmp/ipykernel_25925/2123015846.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy with_dms['mutation_percent'] = (with_dms['mutation_count'] / total_sequences) * 100
Out[8]:
Save combined chart to HTML¶
In [9]:
# Combine all charts vertically
combined_chart = alt.vconcat(
chart1,
chart2,
spacing=40
).configure_view(
strokeWidth=0
).configure_axis(
labelFontSize=22,
titleFontSize=28,
domainWidth=0.5,
tickWidth=0.5,
gridWidth=0.5,
tickSize=4,
tickColor='black'
).configure_title(
fontSize=30,
anchor='middle',
subtitleFontSize=20
).configure_legend(
labelFontSize=18,
titleFontSize=20,
gradientStrokeWidth=0.5
)
# Save to HTML
combined_chart.save(output_html)
print(f"Saved interactive plots to: {output_html}")
combined_chart
Saved interactive plots to: results/sequence_variation/RSV-A_dms_ref_sequence_variation_effects.html
Out[9]: