I’m excited to attend my first ISME conference in lovely Montreal! I’m presenting a poster in the Evolution session on some of my PhD work. It will be up Monday and Tuesday August 22nd-23rd: poster 325A “Gene content and diversification of secondary metabolite biosynthetic gene clusters coincides with divergence of terrestrial Streptomyces populations.”
Here, I want to delve more into the Background and Methods sections. Ok. I am working with two recently diverged, species-like populations of Streptomyces. For this poster, I focused on the evolutionary dynamics of secondary metabolite biosynthetic gene clusters (SMGC). There are a gazillion SMGCs that encode a massive diversity of natural products (e.g. antibiotics). Most of these originate in soil-dwelling Actinobacteria, including Streptomyces. However, we don’t have a solid understanding of what sorts of evolutionary and ecological processes generate this immense diversity of SMGCs. So I used a comparative population approach to evaluate the forces driving diversification of SMGCs.
The web-server antiSMASH is a great tool for exploring SMGCs in your genomes. antiSMASH uses a handful of algorithms to identify SMGCs of different types/classes and also provides information on any homology to the MIBiG database. I uploaded each of my 24 RAST-annotated genomes to this server which identified 28-47 SMGC per genome and a total of 22 different cluster classes. (Many of these clusters were identified as hybrids). Ouph, stacked bar chart. So many gene clusters!
Next order of business was to identify clusters that are shared between genomes, conserved within members of the sample population, conserved within all genomes, etc. Here is where I ran into some problems. Very few of these SMGCs show high homology to the database. For many, only a few of the genes within a cluster show database homology. Remember, these are gene clusters comprised of many biosynthetic and regulatory genes. antiSMASH tells us the percentage of genes within a cluster that show homology to the database. Each genome was run independently, so it’s likely that “unknown” clusters are shared between genomes. For this next part, Chuck Pepe-Ranney helped me out significantly. We decided to define SMGCs based on gene content. Thus, we no longer need to depend on database homology.
- First, I made a master multi-fasta file containing all of the genes that make up all of the SMGCs from all of the genomes.
- I called orfs using Prodigal.
- Then I used parasail to identify pairs of orfs that are orthologous.
- From here, I generated an “OTU” table where columns are orthologous gene groups, and each row is a single SMGC from a single genome.
- From the OTU table, I made a binary/jaccard distance matrix.
- The working cutoff I’m using based on this presence/absence gene content is a dissimilarity of ≤0.4.
- Finally, the R package igraph was used to visualize these clusters and to define cluster membership.
## network is a table with 3 columns: cluster 1, cluster 2, dissimilarity snetwork = network %>% filter(value <= 0.4) g = igraph::graph.data.frame(snetwork, directed = F) ## Calculate a layout layout = igraph::layout_with_fr(g) ## Extract layout coordinates gpoints = data.frame("biocluster" = igraph::V(g)$name, "x" = layout[,1], "y" = layout[,2])
igraph network clustering SMGCs based on presence/absence gene content. Each circle is a single SMGC from a single genome (colored by population). Circles that are close together have similar genes.
Now, I have a useful network of shared SMGCs for comparing intra- and inter-population patterns of diversification! If you don’t get a chance to see it in person, you can catch my poster here as well.