Analyze PacBio CCSs¶
This notebook analyzes the PacBio CCSs that have been previously aligned and parsed with alignparse.
Import Python modules:
In [1]:
import os
import alignparse.targets
import altair as alt
import pandas as pd
import yaml
Get configuration information:
In [2]:
# If you are running notebook interactively rather than in pipeline that handles
# working directories, you may have to first `os.chdir` to appropriate directory.
with open("config.yaml") as f:
config = yaml.safe_load(f)
Read in the PacBio runs:
In [3]:
pacbio_runs = (
pd.read_csv(config["pacbio_runs"])
.assign(subdir=lambda x: "results/process_ccs/" + x["run"])
.rename(columns={"run": "pacbioRun"})
)
pacbio_runs
Out[3]:
library | pacbioRun | fastq | subdir | |
---|---|---|---|---|
0 | Lib1 | Lib1-230612 | /fh/fast/bloom_j/computational_notebooks/bdado... | results/process_ccs/Lib1-230612 |
1 | Lib1 | Lib1-230613 | /fh/fast/bloom_j/computational_notebooks/bdado... | results/process_ccs/Lib1-230613 |
2 | Lib2 | Lib2-230612 | /fh/fast/bloom_j/computational_notebooks/bdado... | results/process_ccs/Lib2-230612 |
3 | Lib2 | Lib2-230613 | /fh/fast/bloom_j/computational_notebooks/bdado... | results/process_ccs/Lib2-230613 |
Stats on CCS alignments¶
Read and plot the alignment stats from running alignparse
on the PacBio CCSs:
In [4]:
readstats = pd.concat(
[
(
pd.read_csv(os.path.join(row.subdir, "readstats.csv")).assign(
pacbioRun=row.pacbioRun, library=row.library
)
)
for row in pacbio_runs.itertuples()
],
ignore_index=True,
)
readstats_chart = (
alt.Chart(readstats)
.encode(
x="count:Q",
y=alt.Y(
"category:N",
axis=alt.Axis(title=None),
),
tooltip=readstats.columns.tolist(),
facet=alt.Facet("pacbioRun:N", columns=2, title=None),
)
.mark_bar()
.properties(width=250, height=50)
.resolve_scale(x="independent", y="independent")
)
readstats_chart
Out[4]:
Visualize target to which reads are being aligned¶
Draw images of the target we're parsing:
In [5]:
targets = alignparse.targets.Targets(
seqsfile=config["pacbio_amplicon"],
feature_parse_specs=config["pacbio_amplicon_specs"],
)
fig = targets.plot(
ax_width=7,
plots_indexing="genbank",
ax_height=2,
hspace=1.2,
)
/fh/fast/bloom_j/computational_notebooks/jbloom/2024/Flu_H5_American-Wigeon_South-Carolina_2021-H5N1_DMS/.snakemake/conda/9ee479f57d8e6ac889f8bb26f260852e_/lib/python3.11/site-packages/Bio/SeqFeature.py:230: BiopythonDeprecationWarning: Please use .location.strand rather than .strand warnings.warn(
Why were some CCSs filtered?¶
Plot the number of CCSs filtered for each reason:
In [6]:
# CSVs holding filtered reads
filtered_csvs = pd.concat(
[
(
pd.read_csv(os.path.join(row.subdir, "filtered.csv")).assign(
pacbioRun=row.pacbioRun, library=row.library
)
)
for row in pacbio_runs.itertuples()
],
ignore_index=True,
)
# details for all filtered reads
filtered = pd.concat(
[
pd.read_csv(row.csv_file).assign(
target=row.target, pacbioRun=row.pacbioRun, library=row.library
)
for row in filtered_csvs.itertuples()
],
ignore_index=True,
)
# count reasons for filtering, then add number of non-filtered
filtered_reasons = pd.concat(
[
filtered.groupby(["pacbioRun", "filter_reason"], as_index=False).aggregate(
count=pd.NamedAgg("query_name", "count")
),
readstats.query('category.str.startswith("aligned")', engine="python")
.groupby("pacbioRun", as_index=False)
.aggregate({"count": "sum"})
.assign(filter_reason="aligned"),
]
).assign(
total_counts=lambda x: x.groupby("pacbioRun")["count"].transform("sum"),
frac_counts=lambda x: x["count"] / x["total_counts"],
)
# make chart
filtered_chart = (
alt.Chart(filtered_reasons)
.encode(
x="count:Q",
y=alt.Y(
"filter_reason:N",
axis=alt.Axis(title=None),
),
color="is_aligned:N",
tooltip=filtered_reasons.columns.tolist(),
facet=alt.Facet("pacbioRun:N", columns=2, title=None),
)
.mark_bar()
.properties(width=250, height=75)
.resolve_scale(x="independent", y="independent")
.transform_filter(alt.datum.frac_counts > 0.01)
.transform_calculate(is_aligned=alt.datum.filter_reason == "aligned")
)
filtered_chart
Out[6]:
Get CCSs that align to the amplicon¶
In [7]:
# CSVs holding aligned reads
aligned_csvs = pd.concat(
[
(
pd.read_csv(os.path.join(row.subdir, "aligned.csv")).assign(
pacbioRun=row.pacbioRun, library=row.library
)
)
for row in pacbio_runs.itertuples()
],
ignore_index=True,
)
assert aligned_csvs["target"].nunique() == 1
aligned = pd.concat(
[
(
pd.read_csv(row.csv_file)
.assign(pacbioRun=row.pacbioRun, library=row.library)
.drop(columns=["query_clip5", "query_clip3"])
.rename(columns={"barcode_sequence": "barcode"})
)
for row in aligned_csvs.itertuples()
],
ignore_index=True,
)
print(f"\nRead {len(aligned):.4g} alignable CCSs:")
display(
aligned.groupby("pacbioRun").aggregate(n_CCSs=pd.NamedAgg("query_name", "count"))
)
output_csv = "results/process_ccs/CCSs_aligned_to_amplicon.csv"
print(f"Writing to {output_csv}")
aligned.to_csv(output_csv, index=False)
Read 1.293e+06 alignable CCSs:
n_CCSs | |
---|---|
pacbioRun | |
Lib1-230612 | 461313 |
Lib1-230613 | 294625 |
Lib2-230612 | 327872 |
Lib2-230613 | 209088 |
Writing to results/process_ccs/CCSs_aligned_to_amplicon.csv
In [ ]: