Carotovoricin (CTV) cluster or phage_grp_1 data analysis
Extract carotovoricin data from prophage clustering output
Prophage cluster phage_grp_1
belongs to the carotovoricin. Hence, prepare a table with prophage identifiers, their genomic coordinates, homology group signatures and other relevant information for all the prophages in the phage_grp_1
cluster.
Manually check the output file analysis/pangenome_v2/carotovoricin/data/ctv_prophages_data.tsv
and verify carotovoricin presence/absence status in the genomes. The manually curated table is saved as analysis/pangenome_v2/carotovoricin/data/ctv_genome_table.tsv
.
Extract all homology groups representing CTV region across the pangenome
Genes and their respective homology groups flanking CTV: - tolC_2: hg_22427641 - ybiB: hg_22427603
Extract genomic coordinates for CTV region together with the haplotypes
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427641,hg_22427603 \
--inner_region --haplotypes \
--out analysis/pangenome_v2/carotovoricin/ctv_region/hg_regions.tab
Process the output file from the previous step to build CTV homology group table and then manually annotate the homology groups with broad functionl categories based on the keywords in GO, COG and PFAM annotations.
CTV present in 428 genomes.
- Intact CTV along with flanking genes tolC_2 and ybiB: 365 genomes
- Present at the end of the contig: 31 genomes
- Fragmented into multiple contigs: 32 genomes
CTV is missing or deleted in the following 26 genomes:
Complete deletion:
- P. brasiliense: g_177, g_182, g_185, g_236
- P. carotovorum: g_15 (now renamed to P. araliae)
- P. cacticida: g_451
Partial deletion:
- P. atrosepticum (all 16 genomes): g_124, g_200, g_27, g_283, g_311, g_322, g_336, g_348, g_382, g_385, g_389, g_392, g_396, g_50, g_51, g_55
- P. betavasculorum: g_383, g_386
- P. versatile: g_313 (deletion of 5 genes including tail sheath, tail tube and tape measure genes). Check g_313.vir_2
Insertion of another prophage:
- P. brasiliense: g_149 CTV cluster has integration of another prophage and hence it is disrupted. It was detected as g_149.vir_1 and was part of phage_grp_89 with 3 members.
Clustermap visualization of CTV clusters
CTV across pangenome
Include P. atrosepticum (g_385), P. betavasculorum (g_386) and P. cacticida (g_451) genomes that lack CTV cluster.
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427603,hg_22427641 \
--inner_region --genomes "g_385,g_386,g_451"
Update the scripts/analysis/clustersmap_data_prophages.R
script as below to visualize the CTV cluster in P. brasiliense.
cluster_title <- "ctv_typeStrains"
outDir <- paste(confs$analysis$ctv$path, "/cluster_viz/", cluster_title, sep = "")
hg_color_categories <- confs$analysis$ctv$files$hg_broad_functions
# a vector of prophage identifiers that will be included in clustermap plot
region_cluster <- NA
other_regions <- c(
"g_345.vir_1", "g_446.vir_4", "g_66.vir_3", "g_222.vir_2", "g_365.vir_3",
"g_8.vir_2", "g_38.vir_2", "g_273.vir_2", "g_259.vir_4",
"g_305.vir_1", "g_378.vir_6", "g_428.vir_1", "g_248.vir_1", "g_449.vir_1",
"g_54.vir_1", "g_116.vir_3", "g_423.vir_3", "g_375.vir_2", "g_381.vir_2"
)
subSample <- FALSE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
# ordering factor for prophages: "host" phylogeny, "hg_pav" for prophage HG PAV,
# "cluster_mash" for prophage MASH and "default" to use the provided
clusterOrder <- "host" # host, hg_pav, cluster_mash, default
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- TRUE
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list(
g_385_reg = list(chr = "NZ_JQHK01000003.1", start = 203963, end = 207120, genomeId = "g_385"),
g_386_reg = list(chr = "NZ_JQHM01000001.1", start = 553213, end = 555615, genomeId = "g_386"),
g_451_reg = list(chr = "Contig_2_668.636", start = 191452, end = 191490, genomeId = "g_451")
)
carotovoricin cluster in P. brasiliense isolates
cluster_title <- "ctv_pbr"
outDir <- paste(confs$analysis$ctv$path, "/cluster_viz/", cluster_title, sep = "")
hg_color_categories <- confs$analysis$ctv$files$hg_broad_functions
# a vector of prophage identifiers that will be included in clustermap plot
region_cluster <- "phage_grp_1"
other_regions <- c()
subSample <- TRUE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
# ordering factor for prophages: "host" phylogeny, "hg_pav" for prophage HG PAV,
# "cluster_mash" for prophage MASH and "default" to use the provided
clusterOrder <- "host" # host, hg_pav, cluster_mash, default
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- FALSE
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list()
regionClusters <- suppressMessages(
readr::read_tsv(confs$analysis$prophages$files$clusters)
)
# optional filter
regionClusters %<>%
dplyr::filter(
SpeciesName == "P. brasiliense",
!prophage_id %in% c("g_408.vir_3", "g_403.vir_3", "g_399.vir_3")
)
Visualize CTV deletion
Extract genomic coordinates for the species/genomes in which CTV is deleted.
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427603,hg_22427641 \
--inner_region --genomes "g_149,g_177,g_182,g_185,g_236,g_15,g_313,g_385,g_386"
Update the scripts/analysis/clustersmap_data_prophages.R
script as below to visualize the CTV clusters.
cluster_title <- "ctv_deletion"
outDir <- paste(confs$analysis$ctv$path, "/cluster_viz/", cluster_title, sep = "")
hg_color_categories <- confs$analysis$ctv$files$hg_broad_functions
# a vector of prophage identifiers that will be included in clustermap plot
region_cluster <- NA
other_regions <- c("g_345.vir_1", "g_66.vir_3", "g_8.vir_2", "g_381.vir_2")
subSample <- FALSE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
# ordering factor for prophages: "host" phylogeny, "hg_pav" for prophage HG PAV,
# "cluster_mash" for prophage MASH and "default" to use the provided
clusterOrder <- "host" # host, hg_pav, cluster_mash, default
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- TRUE
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list(
g_177_reg = list(chr = "NZ_JACGEP010000002.1", start = 102352, end = 103187, genomeId = "g_177"),
g_182_reg = list(chr = "NZ_JACGZZ010000050.1", start = 81626, end = 82461, genomeId = "g_182"),
g_185_reg = list(chr = "NZ_JACGEN010000006.1", start = 81658, end = 82493, genomeId = "g_185"),
g_236_reg = list(chr = "NZ_JACDSF010000027.1", start = 89639, end = 90475, genomeId = "g_236"),
g_149_reg = list(chr = "NZ_JACGFD010000001.1", start = 2861612, end = 2913419, genomeId = "g_149"),
g_385_reg = list(chr = "NZ_JQHK01000003.1", start = 203963, end = 207120, genomeId = "g_385"),
g_386_reg = list(chr = "NZ_JQHM01000001.1", start = 553213, end = 555615, genomeId = "g_386"),
g_451_reg = list(chr = "Contig_2_668.636", start = 191452, end = 191490, genomeId = "g_451"),
g_313_reg = list(chr = "NZ_CP024842.1", start = 2917505, end = 2929851, genomeId = "g_313"),
g_15_reg = list(chr = "BRCR01000012.1", start = 138656, end = 139632, genomeId = "g_15")
)
Prophage clusters found in the ctv-lacking Pbr: phage_grp_30, phage_grp_6, phage_grp_29
A prophage integrated within the CTV in P. brasiliense genome g_149
Genome g_149 CTV cluster has integration of another prophage and hence it is disrupted. It was detected as g_149.vir_1 and was part of phage_grp_89 which has 3 members, r c("g_149.vir_1", "g_116.vir_1", "g_46.vir_2")
cluster_title <- "prophage_in_ctv"
outDir <- paste(confs$analysis$ctv$path, "/cluster_viz/", cluster_title, sep = "")
hg_color_categories <- confs$analysis$ctv$files$hg_broad_functions
# a vector of prophage identifiers that will be included in clustermap plot
region_cluster <- NA
other_regions <- c("g_368.vir_3", "g_149.vir_1", "g_116.vir_1", "g_46.vir_2")
subSample <- FALSE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
# ordering factor for prophages: "host" phylogeny, "hg_pav" for prophage HG PAV,
# "cluster_mash" for prophage MASH and "default" to use the provided
clusterOrder <- "default" # host, hg_pav, cluster_mash, default
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- FALSE
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list()
CTV in P. versatile
Update the scripts/analysis/clustersmap_data_prophages.R
script as below to visualize the CTV cluster in P. versatile.
cluster_title <- "ctv_pv"
outDir <- paste(confs$analysis$ctv$path, "/cluster_viz/", cluster_title, sep = "")
hg_color_categories <- confs$analysis$ctv$data$files$hg_broad_functions
# a vector of prophage identifiers that will be included in clustermap plot
region_cluster <- "phage_grp_1"
other_regions <- c()
subSample <- TRUE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
# ordering factor for prophages: "host" phylogeny, "hg_pav" for prophage HG PAV,
# "cluster_mash" for prophage MASH and "default" to use the provided
clusterOrder <- "host" # host, hg_pav, cluster_mash, default
# whether to keep custom regions at the bottom or consider during phylogeny
# based ordering
regions_phy_ordered <- TRUE
# regions to append as list of list with following structure
# list(r1 = list(chr, start, end, genomeId), r2 = list(chr, start, end, genomeId))
customRegions <- list()
regionClusters <- suppressMessages(
readr::read_tsv(confs$analysis$prophages$files$clusters)
)
# optional filter
regionClusters %<>%
dplyr::filter(
SpeciesName == "P. versatile"
)
CTV cluster in P. versatile collected from France
Update the scripts/analysis/clustersmap_data_prophages.R
script as below to visualize the CTV clusters.
grpToView <- "ctv_pvs_fr"
subSample <- TRUE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
grp <- list(
phage_grp = grpToView,
members = dplyr::filter(
regionClusters,
SpeciesName == "P. versatile", nFragments == 1, phage_grp == "phage_grp_1",
geo_loc_country == "France"
) %>%
dplyr::pull(prophage_id)
)
Variable regions in CTV cluster
Nitric oxide deoxygenase (hg_22427623)
# all genomes that have CTV
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427625,hg_22427622 \
--haplotypes --max_genes 10 \
--genomes_file analysis/pangenome_v2/carotovoricin/data/genomes_ctv.txt \
--out analysis/pangenome_v2/carotovoricin/ctv_nitric_oxide/hg_regions_all_genomes.tab
# only for intact CTV genomes
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427625,hg_22427622 \
--haplotypes --max_genes 10 \
--genomes_file analysis/pangenome_v2/carotovoricin/data/genomes_ctv_intact.txt \
--out analysis/pangenome_v2/carotovoricin/ctv_nitric_oxide/hg_regions.tab
Tape measure protein
Tape measure genes is located between the homology groups hg_22427622 and hg_22427616. This region is also represented by multiple combinations of homology groups.
# all genomes that have CTV
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427622,hg_22427616 \
--haplotypes --inner_region --overlapping --max_genes 10 \
--genomes_file analysis/pangenome_v2/carotovoricin/data/genomes_ctv.txt \
--out analysis/pangenome_v2/carotovoricin/ctv_tape_measure/hg_regions_all_genomes.tab
# only for intact CTV genomes
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427622,hg_22427616 \
--haplotypes --inner_region --overlapping --max_genes 10 \
--genomes_file analysis/pangenome_v2/carotovoricin/data/genomes_ctv_intact.txt \
--out analysis/pangenome_v2/carotovoricin/ctv_tape_measure/hg_regions.tab
Tail fiber region
# all genomes that have CTV
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427604,hg_22427603 \
--inner_region --haplotypes \
--genomes_file analysis/pangenome_v2/carotovoricin/data/genomes_ctv.txt \
--out analysis/pangenome_v2/carotovoricin/ctv_tfl/hg_regions_all_genomes.tab
# only for intact CTV genomes
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427604,hg_22427603 \
--inner_region --haplotypes \
--genomes_file analysis/pangenome_v2/carotovoricin/data/genomes_ctv_intact.txt \
--out analysis/pangenome_v2/carotovoricin/ctv_tfl/hg_regions.tab
CTV conserved loci
Haplotype analysis and visualization for the tail fiber locus
Optionally, tandem homology groups can be mapped to the genomes and visualized on the phylogenetic tree. Need to change the hgSets
variable in the script scripts/analysis/HG_tandem_match_viz.qmd
to include the homology groups of interest as a list()
object.
For example,
Summary of carotovoricin cluster
Inter-species horizontal gene transfer of CTV
Clustermap visualization for some manually selected candidates. Change the code blocks in the script scripts/analysis/clustersmap_data_prophages.R
and run.
grpToView <- "ctv_hgt"
subSample <- FALSE
cutHeight <- 1.5
addFlankingRegions <- TRUE
flankingRegion <- 5000
# ordering factor for prophages: host phylogeny, prophage HG PAV, prophage MASH,
# completeness score
clusterOrder <- "hg_pav" # completeness, host, hg_pav, cluster_mash
# optionally, a custom region list can be provided to generate the plot
grp <- list(
phage_grp = grpToView,
members = c(
"g_145.vir_1", "g_194.vir_1", "g_429.vir_1", "g_442.vir_1", "g_421.vir_1",
"g_150.vir_4", "g_447.vir_1", "g_434.vir_4",
"g_221.vir_3", "g_53.vir_3", "g_106.vir_2", "g_57.vir_1", "g_125.vir_1"
)
)
Run script scripts/utils/HG_range_coordinates.R
to extract genomic coordinates for CTV region in selected genomes.
Complete CTV region along with 3 flanking genes:
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427643,hg_22427599 \
--inner_region --out analysis/pangenome_v2/carotovoricin/ctv_region/hg_regions.tab
# create GFF3 files for CTV regions of interest
Rscript scripts/utils/HGs_gff_from_regions.R \
--regions analysis/pangenome_v2/carotovoricin/ctv_region/hg_regions.tab \
--genomes g_145,g_194,g_429,g_442,g_421,g_150,g_447,g_434,g_53,g_106,g_57,g_221,g_125
# create CTV regions GFF3 files for one species representative each
Rscript scripts/utils/HGs_gff_from_regions.R \
--regions analysis/pangenome_v2/carotovoricin/ctv_region/hg_regions.tab \
--genomes g_183,g_423,g_277,g_375,g_116,g_449,g_446,g_378,g_296,g_428,g_259,g_66,g_337,g_442,g_305,g_381,g_8,g_273,g_38
DNA sequence distance comparison for conserved and TFL regions
Extract DNA sequence for the regions
bash scripts/utils/get_fasta.sh --region_file analysis/pangenome_v2/carotovoricin/ctv_tfl/hg_regions.tab
bash scripts/utils/get_fasta.sh --region_file analysis/pangenome_v2/carotovoricin/ctv_conserved/hg_regions.tab
bash scripts/utils/get_fasta.sh --region_file analysis/pangenome_v2/carotovoricin/ctv_tape_measure/hg_regions.tab
bash scripts/utils/get_fasta.sh --separate --region_file analysis/pangenome_v2/carotovoricin/ctv_tfl/hg_regions.tab
bash scripts/utils/get_fasta.sh --separate --region_file analysis/pangenome_v2/carotovoricin/ctv_conserved/hg_regions.tab
bash scripts/utils/get_fasta.sh --separate --region_file analysis/pangenome_v2/carotovoricin/ctv_tape_measure/hg_regions.tab
Calculate the distance using mash
cd analysis/pangenome_v2/carotovoricin/ctv_tfl
mash dist -p 8 -k 12 -s 2000 -i -S 124 hg_regions.fasta hg_regions.fasta > distance_mash.tab
cd analysis/pangenome_v2/carotovoricin/ctv_conserved
mash dist -p 8 -k 12 -s 2000 -i -S 124 hg_regions.fasta hg_regions.fasta > distance_mash.tab
cd analysis/pangenome_v2/carotovoricin/ctv_tape_measure
mash dist -p 8 -k 12 -s 2000 -i -S 124 hg_regions.fasta hg_regions.fasta > distance_mash.tab
grep -e 'g_(53|106|57|221|125).*g_(53|106|57|221|125)' ctv_*/distance_mash.tab
grep -e 'g_(150|447|434).*g_(150|447|434)' ctv_*/distance_mash.tab
grep -e 'g_(53|106|57).*g_(53|106|57)' ctv_*/distance_mash.tab
Calculate Jaccard distance matrix using sourmash
MASH distance is low for haplotype pairs where one is subset of the other, eg, A-B-C-D and A-B. Therefore, another tool was tried to calculate the Jaccard index between sequences.
mamba activate smash
file_fa="ctv_conserved/hg_regions.fasta"
# file_fa="ctv_tfl/hg_regions.fasta"
file_sig="${file_fa%/*}"/sourmash_signatures.zip
file_dist="${file_fa%/*}/"distance_sourmash.csv
# # sketch: scaled=5
# sourmash sketch dna -f -p k=7,k=9,k=11,scaled=5,abund,seed=124 --singleton \
# -o ${file_sig} ${file_fa}
# sketch: num=2000
sourmash sketch dna -f -p k=7,k=9,k=11,k=13,num=2000,abund,seed=124 --singleton \
-o "${file_sig}" "${file_fa}"
sourmash compare --ignore-abundance --distance-matrix -p 12 -k 13 --dna \
--csv "${file_dist}" "${file_sig}"
Calculate Jaccard distance using dashing
mamba activate smash
cd analysis/pangenome_v2/carotovoricin
# dir_fa="ctv_conserved/region_fasta"
dir_fa="ctv_tfl/region_fasta"
file_dist="${dir_fa%/*}/"distance_dashing.txt
dashing cmp -k 11 --seed 124 --nthreads 12 --full-mash-dist \
-O "${file_dist}" "${dir_fa}"/*.fasta
zip -m -r "${dir_fa}".zip "${dir_fa}"
Process the sequence distance comparison results
Tail fiber HGT demonstration
Extract sequences for some example tail fiber loci haplotypes that are present in multiple Pectobacterium species.
genomeSet="g_279,g_425,g_299,g_377,g_100,g_106,g_331,g_249,g_125,g_221,g_53,g_395,g_108,g_444,g_160,g_352"
setName="genomeset_1"
# tail fiber locus
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427604,hg_22427603 \
--genomes "${genomeSet}" \
--out "analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_tfl_flanking.tab"
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427604,hg_22427603 \
--genomes "${genomeSet}" \
--inner_region --out "analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_tfl.tab"
# conserved CTV locus
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427640,hg_22427604 \
--genomes "${genomeSet}" \
--out "analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_conserved.tab"
# upstream core
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427642,hg_22427641 \
--genomes "${genomeSet}" \
--out "analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_upstream_core.tab"
Extract DNA sequence for these regions.
for f in analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_*.tab
do
bash scripts/utils/get_fasta.sh --region_file ${f}
done
Perform MSA and generate phylogenetic trees
mamba activate pantools_4_3_3
nohup mafft --reorder --allowshift --unalignlevel 0.2 --leavegappyregion \
--maxiterate 0 --globalpair \
"analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_tfl.fasta" \
> "analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_tfl.msa.fasta" 2>nohup.out &
nohup mafft --reorder --allowshift --unalignlevel 0.2 --leavegappyregion \
--maxiterate 0 --globalpair \
"analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_conserved.fasta" \
> "analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_conserved.msa.fasta" 2>nohup.out &
nohup mafft --reorder --allowshift --unalignlevel 0.2 --leavegappyregion \
--maxiterate 0 --globalpair \
"analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_upstream_core.fasta" \
> "analysis/pangenome_v2/carotovoricin/tfl_hgt/${setName}_upstream_core.msa.fasta" 2>nohup.out &
Generate a maximum-likelihood phylogenetic tree for MSAs
cd analysis/pangenome_v2/carotovoricin/tfl_hgt
mkdir iqtree
nohup nice iqtree2 -T 40 -s "${setName}_tfl.msa.fasta" -B 1000 \
--prefix "iqtree/${setName}_tfl.msa.fasta" >> iqtree_tree.log 2>&1 &
nohup nice iqtree2 -T 40 -s "${setName}_conserved.msa.fasta" -B 1000 \
--prefix "iqtree/${setName}_conserved.msa.fasta" >> iqtree_tree2.log 2>&1 &
nohup nice iqtree2 -T 40 -s "${setName}_upstream_core.msa.fasta" -B 1000 \
--prefix "iqtree/${setName}_upstream_core.msa.fasta" >> iqtree_tree3.log 2>&1 &
TFL region Mauve analysis
Follow the steps below to perform Mauve analysis for genomic regions of interest.
- Define the genome subset for the analysis.
For example:
subsetName="tfl_pecto"
genomeSet="g_302,g_364,g_337,g_439,g_403,g_399,g_408,g_175,g_138,g_368,g_345,g_366,g_308,g_191,g_173,g_155,g_166,g_299,g_438,g_263,g_43,g_391"
outDir="analysis/pangenome_v2/carotovoricin/mauve/${subsetName}"
- Extract the genomic coordinates for a region based on the homology group range.
The TFL region is flanked by the homology groups hg_22427604
and hg_22427603
. These homology groups will be used to extract the genomic coordinates for the region of interest. We include the region encoding homology groups and therefore do not use the --inner_region
option.
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427604,hg_22427603 \
--genomes "${genomeSet}" --out "${outDir}/hg_regions.tab"
- Create GFF3 files for these regions.
- Extract DNA sequence for these regions.
- Generate GenBank files from the
fasta
sequences andgff3
annotations.
Mauve aligns sequences provided in fasta
and genbank
format. However, the sequence annotation is shown only if the alignment input was in genbank
format. Therefore, we need to combine the homology group annotation in gff3
format with the fasta
sequences to generate a genbank
formatted files. To do this, Emboss tool seqret
was used.
conda activate omics_py37
cd "${outDir}"
mkdir gbk_files
for vir in `cut -f 1 hg_regions.tab | tail -n +2`
do
seqret -sequence region_fasta/${vir}.fasta -feature -fformat1 gff3 \
-fopenfile1 gff3/${vir}.gff3 -osformat2 genbank -osextension2 gbk \
-osname2 ${vir} -osdirectory2 gbk_files -auto
done
- Correct GenBank file as per Mauve requirement
sed -i.bak -r -e 's/(^\s+CDS\s+(complement\()?)[^:]+:([[:digit:]]+\.\.[[:digit:]]+\)?).*/\1\3/' \
-e 's/\/note="\*([^:]+): /\/\1="/' gbk_files/*.gbk
- Run
progressiveMauve
to align the sequences.
mauve_out="${subsetName}"
progressiveMauve --output="${mauve_out}.mauve.xmfa" \
--backbone-output="${mauve_out}.mauve.backbone" \
--output-guide-tree="${mauve_out}.mauve.guide_tree.newick" \
gbk_files/*.gbk > "${mauve_out}".log 2>&1
- Analyze the
xmfa
output files using Mauve GUI.
Specific Mauve analysis
TFL haplotype variation in P. brasiliense strains
Extract the region between the two homology groups, hg_22427604
and hg_22427603
for the carotovoricin cluster. Additionally, generate GFF3
files with the homology group, COG, PFAM and other metadata information to visualize.
subsetName="tfl_pbr"
genomeSet="g_302,g_364,g_337,g_439,g_403,g_399,g_408,g_175,g_138,g_368,g_345,g_366,g_308,g_191,g_173,g_155,g_166,g_299,g_438,g_263,g_43,g_391"
Specific use cases
# genomes where Ein gene present in cluster
mauve_out="${subsetName}_ein"
progressiveMauve --output="${mauve_out}.mauve.xmfa" \
--backbone-output="${mauve_out}.mauve.backbone" \
--output-guide-tree="${mauve_out}.mauve.guide_tree.newick" \
gbk_files/g_345.P_brasiliense.gbk gbk_files/g_366.P_brasiliense.gbk \
gbk_files/g_308.P_brasiliense.gbk gbk_files/g_191.P_brasiliense.gbk \
gbk_files/g_173.P_brasiliense.gbk gbk_files/g_155.P_brasiliense.gbk \
gbk_files/g_166.P_brasiliense.gbk > "${mauve_out}".log 2>&1
# represetative genomes
mauve_out="${subsetName}_rep"
progressiveMauve --output="${mauve_out}.mauve.xmfa" \
--backbone-output="${mauve_out}.mauve.backbone" \
--output-guide-tree="${mauve_out}.mauve.guide_tree.newick" \
gbk_files/g_302.P_brasiliense.gbk gbk_files/g_337.P_brasiliense.gbk \
gbk_files/g_175.P_brasiliense.gbk gbk_files/g_173.P_brasiliense.gbk \
gbk_files/g_299.P_brasiliense.gbk gbk_files/g_438.P_brasiliense.gbk \
gbk_files/g_391.P_brasiliense.gbk > "${mauve_out}".log 2>&1
Variation of TFL haplotype with ein across Pectobacterium species
subsetName="tfl_pecto"
genomeSet="g_14,g_121,g_339,g_367,g_324,g_148,g_357,g_309,g_351,g_375,g_20,g_449,g_265,g_138,g_254,g_278,g_344,g_120,g_333,g_428,g_12,g_110,g_58,g_368,g_259,g_279,g_315,g_280,g_16,g_21,g_160,g_53,g_305,g_206,g_1,g_426,g_186,g_264,g_218,g_439,g_421,g_79,g_371,g_431,g_29,g_420,g_247,g_338,g_403,g_31"
outDir="analysis/pangenome_v2/carotovoricin/mauve/${subsetName}"
# set_a_ein_5kb
mauve_out="set_a_ein_5kb"
progressiveMauve --output="${mauve_out}.mauve.xmfa" \
--backbone-output="${mauve_out}.mauve.backbone" \
--output-guide-tree="${mauve_out}.mauve.guide_tree.newick" \
gbk_files/g_79.P_polaris.gbk gbk_files/g_371.P_brasiliense.gbk \
gbk_files/g_431.P_aroidearum.gbk gbk_files/g_29.P_brasiliense.gbk \
gbk_files/g_247.P_carotovorum.gbk gbk_files/g_338.P_polaris.gbk \
gbk_files/g_31.P_quasiaquaticum.gbk
# set_b_5kb
mauve_out="set_b_5kb"
progressiveMauve --output="${mauve_out}.mauve.xmfa" \
--backbone-output="${mauve_out}.mauve.backbone" \
--output-guide-tree="${mauve_out}.mauve.guide_tree.newick" \
gbk_files/g_1.P_carotovorum.gbk gbk_files/g_426.P_brasiliense.gbk \
gbk_files/g_186.P_versatile.gbk gbk_files/g_264.P_carotovorum.gbk \
gbk_files/g_439.P_brasiliense.gbk gbk_files/g_421.P_aroidearum.gbk \
gbk_files/g_420.P_carotovorum.gbk gbk_files/g_403.P_brasiliense.gbk
# set_c_2kb
mauve_out="set_c_2kb"
progressiveMauve --output="${mauve_out}.mauve.xmfa" \
--backbone-output="${mauve_out}.mauve.backbone" \
--output-guide-tree="${mauve_out}.mauve.guide_tree.newick" \
gbk_files/g_14.P_polonicum.gbk gbk_files/g_339.P_polaris.gbk \
gbk_files/g_367.P_brasiliense.gbk gbk_files/g_324.P_odoriferum.gbk \
gbk_files/g_357.P_brasiliense.gbk gbk_files/g_309.P_parmentieri.gbk \
gbk_files/g_351.P_wasabiae.gbk gbk_files/g_375.P_fontis.gbk \
gbk_files/g_20.P_actinidiae.gbk gbk_files/g_449.P_actinidiae.gbk
# set_d_2_5kb
mauve_out="set_d_2_5kb"
progressiveMauve --output="${mauve_out}.mauve.xmfa" \
--backbone-output="${mauve_out}.mauve.backbone" \
--output-guide-tree="${mauve_out}.mauve.guide_tree.newick" \
gbk_files/g_375.P_fontis.gbk gbk_files/g_20.P_actinidiae.gbk \
gbk_files/g_449.P_actinidiae.gbk gbk_files/g_265.P_polonicum.gbk \
gbk_files/g_138.P_brasiliense.gbk gbk_files/g_254.P_polaris.gbk \
gbk_files/g_278.P_aquaticum.gbk gbk_files/g_344.P_actinidiae.gbk \
gbk_files/g_120.P_versatile.gbk gbk_files/g_333.P_odoriferum.gbk
# set_e_3kb
mauve_out="set_e_3kb"
progressiveMauve --output="${mauve_out}.mauve.xmfa" \
--backbone-output="${mauve_out}.mauve.backbone" \
--output-guide-tree="${mauve_out}.mauve.guide_tree.newick" \
gbk_files/g_12.P_versatile.gbk gbk_files/g_58.P_odoriferum.gbk \
gbk_files/g_368.P_brasiliense.gbk gbk_files/g_259.P_parvum.gbk \
gbk_files/g_315.P_versatile.gbk gbk_files/g_280.P_aquaticum.gbk \
gbk_files/g_16.P_carotovorum.gbk gbk_files/g_21.P_punjabense.gbk
# set_mixed
mauve_out="set_mixed"
progressiveMauve --output="${mauve_out}.mauve.xmfa" \
--backbone-output="${mauve_out}.mauve.backbone" \
--output-guide-tree="${mauve_out}.mauve.guide_tree.newick" \
gbk_files/g_14.P_polonicum.gbk gbk_files/g_351.P_wasabiae.gbk \
gbk_files/g_148.P_brasiliense.gbk gbk_files/g_333.P_odoriferum.gbk \
gbk_files/g_206.P_parmentieri.gbk gbk_files/g_428.P_punjabense.gbk \
gbk_files/g_31.P_quasiaquaticum.gbk gbk_files/g_421.P_aroidearum.gbk \
gbk_files/g_53.P_versatile.gbk gbk_files/g_160.P_polaris.gbk
Variation of TFL haplotype (without ein) across Pectobacterium species
Mauve analysis for TFL HGT example genomes
subsetName="tfl_hgt"
genomeSet="g_279,g_425,g_299,g_377,g_100,g_106,g_331,g_249,g_125,g_221,g_53,g_395,g_108,g_444,g_160,g_352"
outDir="analysis/pangenome_v2/carotovoricin/mauve/${subsetName}"
Here, only internal region between the two homology groups is extracted. Therefore, the --inner_region
option is used.
Rscript scripts/utils/HG_range_coordinates.R --hgs hg_22427604,hg_22427603 \
--inner_region --genomes "${genomeSet}" --out "${outDir}/hg_regions.tab"
Rest of the steps are similar to the ones described above.
Perform MSA using MAFFT
conda activate pantools_v4_3
cd analysis/pangenome_v2/carotovoricin/mauve/tfl_pbr/
mafft --globalpair --quiet --maxiterate 1000 --treeout ctv_pbr.variable_regions.fasta
Smash++ pairwise sequence comparison
conda activate omics_py37
cd analysis/pangenome_v2/carotovoricin/mauve/tfl_pbr/region_fasta
function smashpp_compare
{
smashpp -r $1 -t $2
smashpp viz -o $(basename $1 .fasta)$(basename $2 .fasta)".svg" "${1}.${2}.pos"
}
export -f smashpp_compare
smashpp_compare g_345.vir_1.fasta g_345.vir_1.fasta
smashpp_compare g_345.vir_1.fasta g_366.vir_3.fasta
smashpp_compare g_345.vir_1.fasta g_302.vir_3.fasta
smashpp_compare g_345.vir_1.fasta g_403.vir_3.fasta
smashpp_compare g_345.vir_1.fasta g_173.vir_2.fasta
smashpp_compare g_345.vir_1.fasta g_138.vir_2.fasta
smashpp_compare g_345.vir_1.fasta g_263.vir_2.fasta
smashpp_compare g_345.vir_1.fasta g_155.vir_1.fasta
Left inverted repeat for Ein: CTCCCGCAAACCTCGGTTTTGGGGAC (CTCCCGCAAACCTCGGTTT)
Left inverted repeat for Ein(rev-com): GTCCCCAAAACCGAGGTTTGCGGGAG (AAACCGAGGTTTGCGGGAG, AAACCGAGGTTTGCG)
Right inverted repeat for Ein: TTCTCGCAAACCTCGGTTTTGGAGAA
Right inverted repeat for Ein(rev): AAGAGGTTTTGGCTCCAAACGCTCTT
Right inverted repeat for Ein(rev-com): TTCTCCAAAACCGAGGTTTGCGAGAA (AAACCGAGGTTTGCGAGAA, AAACCGAGGTTTGCG)
Right inverted repeat for Ein(comp): AAGAGCGTTTGGAGCCAAAACCTCTT
CTV gene tree vs species tree comparison
Normalised Robinson-Foulds distance between the gene trees and the species tree.