Prophage analysis in Pectobacterium pangenome
Detect prophages in genome
genomad prophage detection
# run genomad and checkv pipeline
nohup \
cat data/reference_data/assembly_ids.txt | \
parallel --jobs 6 --workdir $PWD --halt now,fail=1 \
--keep-order --results logs/genomad/{} \
--joblog logs/genomad/parallel.log \
scripts/preprocessing/genomad_prophage_annotation.sh {} \
>>logs/genomad/nohup.out 2>&1 &
# count of prophage detected in each genome
for i in `cat data/reference_data/assembly_ids.txt`
do
printf "$i\t"
tail -n +2 data/prophage_genomad/$i/${i}_summary/${i}_virus_summary.tsv | wc -l
done | sort -nr -k2
Check the checkV contamination.tsv file to see if there is any other contamination than the expected host,viral
, viral,host
and host,viral,host
ones.
Post processing of prophages
Process genomad output and separate prophages into independent FASTA files. Each prophage is assigned a unique identifier with format g_{genome}.vir_{#count}
. Prophages are saved in independent FASTA files for downstream analysis, for e.g. ANI clustering of prophages.
# post processing of prophage data
Rscript scripts/preprocessing/genomad_checkv_merge.R
# save prophages independently
bash scripts/preprocessing/prophage_genomes_extract.sh
Output: - yaml: confs$data$prophages$files$data
- file: data/prophage_genomad/prophage_summary.tab
Extract homology groups for prophages from the pangenome org.db object.
Output: - yaml: confs$analysis$prophages$preprocessing$files$raw_prophage_hg
- file: analysis/pangenome_v2/prophages/preprocessing/prophage_hgs.tab
Run ANI and MASH on prophage genomes.
conda activate pantools_v4_3
# list prophage FASTA files
ls data/prophage_genomad/phage_seqs/*.fna > data/prophage_genomad/prophage_fasta.list
# run fastANI
fastANI --ql data/prophage_genomad/prophage_fasta.list \
--rl data/prophage_genomad/prophage_fasta.list \
--threads 30 --matrix --visualize --fragLen 500 \
-o data/prophage_genomad/prophage_ANI
# use mash for distance estimation
mkdir data/prophage_genomad/mash
mash sketch -p 30 -s 15000 -k 17 -o data/prophage_genomad/mash/prophages_ref \
-l data/prophage_genomad/prophage_fasta.list >logs/prophage/mash.log 2>&1
mash info data/prophage_genomad/mash/prophages_ref.msh > \
data/prophage_genomad/mash/prophage_sketch_info.txt
mash dist -p 30 data/prophage_genomad/mash/prophages_ref.msh \
data/prophage_genomad/mash/prophages_ref.msh > \
data/prophage_genomad/mash/prophage_distance.tab
QC
Identify the fragmented prophages resulting because of the genome assembly that can be merged into a complete prophage. Prepare a consolidated list of prophages which will be used for the subsequent analysis.
Output: - yaml: confs$analysis$prophages$preprocessing$files$consolidated
- file: analysis/prophages/preprocessing/consolidated_phages.tab
Summarize the consolidated prophage list to generate various statistics and generate summary plots. Prophages that are fragmented or smaller than the length cutoff are removed and the filtered list of prophages is saved for downstream analysis.
Output: - yaml: confs$analysis$prophages$preprocessing$files$prophage_pool
- File: analysis/pangenome_v2/prophages/preprocessing/prophages_pool.tab
Process prophage MASH and ANI results and visualize MASH similarity matrix for the filtered prophages.
Cluster prophages
Use shared homology groups between two prophages to establish the child-parent relationship between the prophages in pangenome. First, all vs all syntenic Jaccard similarity is calculated between the prophages. This information is later used to cluster the prophages using two different approaches.
Approach 1
A heuristic DAG construcion is extracted from the output of the script scripts/analysis/prophage_merge.R
. This output includes a best matching parent prophage for each prophage. This is further processed to build a DAG and visualize the clusters.
Post processing of the child-parent relationships to add additional metadata and save a DAG in tabular format:
Visualize the clusters of prophages as directed acyclic graphs:
Approach 2
We first tried to use the Affinity propagation (AP) clustering to group the prophages into clusters and have a set of cluster representatives. However, we encountered some clusters generated by the AP clustering where the cluster members did not even share a homology group. Therefore, we decided not to use AP clustering.
Cluster the all-vs-all syntenic Jaccard distance matrix using “complete” linkage hierarchical clustering and cut the cluster using carotovoricin cluster as a reference.
Output: - yaml: confs$analysis$prophages$files$clusters
- file: analysis/pangenome_v2/prophages/prophages_clusters.tab
Summarize prophage clusters
Summarize the representative prophages in the pangenome.
Visualize the representative prophages across pangenome. Use MASH distance to cluster prophages and ANI distance to cluster the genomes in the pangenome.
Visualize homology group sets for selected prophages as PAV heatmap across the pangenome. Replace the code block in scripts/analysis/homology_group_viz.R
script with the following.
analysisName <- "vir_lineage_prophages"
outDir <- file.path(confs$analysis$prophages$path)
outPrefix <- file.path(outDir, analysisName)
setIds <- c(
"g_189.vir_2", "g_400.vir_2", "g_399.vir_2", "g_448.vir_2", "g_189.vir_1",
"g_3.vir_4", "g_438.vir_3", "g_350.vir_1",
"g_172.vir_2", "g_393.vir_1"
)
Finally, run the script to generate the plot.
Individual prophage cluster analysis
Script scripts/analysis/clustersmap_data_prophages.R
is used to combine the prophages, homology groups and phylogeny to generate input for clustermap.js
visualization tool and generate a report. Below, some configuration is provided for individual prophage clusters.
phage_grp_46: highly conserved in all but 5 genomes from n23 clade
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22429500,hg_22429499 \
--inner_region --genomes "g_63,g_190,g_193,g_411,g_415,g_406,g_194,g_263"
cluster_title <- "phage_grp_46"
outDir <- paste(confs$analysis$prophages$path, "/cluster_viz/", cluster_title, sep = "")
hg_color_categories <- confs$analysis$prophages$files$hg_broad_functions
# a vector of prophage identifiers that will be included in clustermap plot
region_cluster <- "phage_grp_46"
other_regions <- character(0)
subSample <- TRUE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
# ordering factor for prophages: "host" phylogeny, "hg_pav" for prophage HG PAV,
# "cluster_mash" for prophage MASH and "default" to use the provided
clusterOrder <- "host" # host, hg_pav, cluster_mash, default
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- TRUE
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list(
g_190_reg = list(chr = "NZ_CP059960.1", start = 1997124, end = 2000186, genomeId = "g_190"),
g_193_reg = list(chr = "NZ_CP059957.1", start = 5005351, end = 5008413, genomeId = "g_193"),
g_411_reg = list(chr = "NAK682_contig_2_consensus", start = 461100, end = 464162, genomeId = "g_411"),
g_415_reg = list(chr = "NAK701_contig_5_consensus", start = 415799, end = 418861, genomeId = "g_415"),
g_63_reg = list(chr = "NZ_CP092039.1", start = 4288654, end = 4291716, genomeId = "g_63"),
g_406_reg = list(chr = "NAK641_contig_10_consensus", start = 469664, end = 474521, genomeId = "g_406"),
g_194_reg = list(chr = "NZ_CP059956.1", start = 4163454, end = 4167352, genomeId = "g_194")
# g_263_reg = list(chr = "NZ_CP047495.1", start = 956237, end = 956907, genomeId = "g_263")
)
phage_grp_45: prophage present in all BL-causing isolates
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22429319,hg_22429327 \
--genomes "g_406,g_194"
cluster_title <- "phage_grp_45"
outDir <- paste(confs$analysis$prophages$path, "/cluster_viz/", cluster_title, sep = "")
hg_color_categories <- confs$analysis$prophages$files$hg_broad_functions
# a vector of prophage identifiers that will be included in clustermap plot
region_cluster <- "phage_grp_45"
other_regions <- character(0)
subSample <- FALSE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
# ordering factor for prophages: "host" phylogeny, "hg_pav" for prophage HG PAV,
# "cluster_mash" for prophage MASH and "default" to use the provided
clusterOrder <- "host" # host, hg_pav, cluster_mash, default
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- TRUE
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list(
g_406_reg = list(chr = "NAK641_contig_10_consensus", start = 671040, end = 674984, genomeId = "g_406"),
g_194_reg = list(chr = "NZ_CP059956.1", start = 3940785, end = 3958060, genomeId = "g_194")
)
phage_grp_45 and phage_grp_343 mosaicsm
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22429319,hg_22429327 \
--genomes "g_406,g_194"
cluster_title <- "grp_45_mosaicsm"
outDir <- paste(confs$analysis$prophages$path, "/cluster_viz/", cluster_title, sep = "")
hg_color_categories <- confs$analysis$prophages$files$hg_broad_functions
# a vector of prophage identifiers that will be included in clustermap plot
region_cluster <- NA
other_regions <- c("g_302.vir_1", "g_399.vir_1", "g_400.vir_2")
subSample <- TRUE
cutHeight <- 0.5
addFlankingRegions <- TRUE
flankingRegion <- 7000
# ordering factor for prophages: "host" phylogeny, "hg_pav" for prophage HG PAV,
# "cluster_mash" for prophage MASH and "default" to use the provided
clusterOrder <- "host" # host, hg_pav, cluster_mash, default
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- FALSE
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list(
g_406_reg = list(chr = "NAK641_contig_10_consensus", start = 671040, end = 674984, genomeId = "g_406"),
g_194_reg = list(chr = "NZ_CP059956.1", start = 3940785, end = 3958060, genomeId = "g_194")
)
phage_grp_71
cluster_title <- "phage_grp_71"
outDir <- paste(confs$analysis$prophages$path, "/cluster_viz/", cluster_title, sep = "")
hg_color_categories <- confs$analysis$prophages$files$hg_broad_functions
# a vector of prophage identifiers that will be included in clustermap plot
region_cluster <- "phage_grp_71"
other_regions <- c()
subSample <- FALSE
cutHeight <- 0.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
# ordering factor for prophages: "host" phylogeny, "hg_pav" for prophage HG PAV,
# "cluster_mash" for prophage MASH and "default" to use the provided
clusterOrder <- "hg_pav" # host, hg_pav, cluster_mash, default
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- FALSE
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list()
phage_grp_107
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427004,hg_22426993 \
--inner_region --genomes "g_63,g_194"
cluster_title <- "phage_grp_107"
outDir <- paste(confs$analysis$prophages$path, "/cluster_viz", sep = "")
hg_color_categories <- confs$analysis$prophages$files$hg_broad_functions
# a vector of prophage identifiers that will be included in clustermap plot
region_cluster <- "phage_grp_107"
other_regions <- character(0)
subSample <- FALSE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
# ordering factor for prophages: "host" phylogeny, "hg_pav" for prophage HG PAV,
# "cluster_mash" for prophage MASH and "default" to use the provided
clusterOrder <- "host" # host, hg_pav, cluster_mash, default
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- TRUE
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list(
g_63_reg = list(chr = "NZ_CP092039.1", start = 2572138, end = 2573569, genomeId = "g_63"),
g_194_reg = list(chr = "NZ_CP059956.1", start = 2620331, end = 2621761, genomeId = "g_194")
)
BL-causing clade specific signatures
grpToView <- "Pbr_FN_specific"
subSample <- FALSE
cutHeight <- 0.5
addFlankingRegions <- FALSE
flankingRegion <- 5000
# ordering factor for prophages: host phylogeny, prophage HG PAV, prophage MASH,
# completeness score
clusterOrder <- "default" # completeness, host, hg_pav, cluster_mash, default
# optionally, a custom region list can be provided to generate the plot
grp <- list(
phage_grp = grpToView,
members = c("g_302.vir_1", "g_400.vir_2", "g_409.vir_1", "g_302.vir_2")
)
phage_grp_36
phage_grp_50
phage_grp_21
T3SS missing from 4 P. brasiliense genomes
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22430875,hg_22430876 --genomes g_187,g_368,g_182,g_185,g_236,g_177
grpToView <- "t3ss_regions"
subSample <- FALSE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 6000
# ordering factor for prophages: host phylogeny, prophage HG PAV, prophage MASH,
# completeness score
clusterOrder <- "host" # completeness, host, hg_pav, cluster_mash
# a vector of prophage identifiers that will be included in clustermap plot
appendPhages <- c()
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list(
g_177_reg = list(genomeId = "g_177", chr = "NZ_JACGEP010000027.1", start = 56850, end = 59337),
g_182_reg = list(genomeId = "g_182", chr = "NZ_JACGZZ010000068.1", start = 116773, end = 119260),
g_185_reg = list(genomeId = "g_185", chr = "NZ_JACGEN010000027.1", start = 117747, end = 120234),
g_236_reg = list(genomeId = "g_236", chr = "NZ_JACDSF010000026.1", start = 2275, end = 4762),
g_187_reg = list(genomeId = "g_187", chr = "NZ_CP059963.1", start = 2603432, end = 2605919),
g_368_reg = list(genomeId = "g_368", chr = "NZ_JUJT01000001.1", start = 566617, end = 569104)
)
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- TRUE