PorthoMCL is a Parallel orthology prediction using MCL for the realm of massive genome availability.
Please look at github for source codes, samples and manuals.
To illustrate the power of PorthoMCL, we have applied it to all the 2,758 sequenced bacterial genomes in GenBank (downloaded: April 2014) using their annotated protein sequences. Here.
These genomes contain a total of 8,661,583 protein sequences with a median length of 270 amino acids. They serve as both the query and the database for all-against-all BLAST searches. We split the query into smaller files each containing about10,000 sequences, we ran BLAST searches (e-value cutoff: 1e-5; database size: 1e8) in parallel using PorthoMCL.
The combined output of the BLAST contained 2,957,375,578 hits. The total runtime of the BLAST searches were 11 days on a cluster with 60 computing nodes (each nodes has 12 cores and 36GBs of RAM), which would need 549 days if run on a single node.
Computing Cluster: COBRA
PorthoMCL identified 763,506,331 ortholog gene pairs and identified 230,815 ortholog groups.
PorthoMCL identified 318,186 paralogous gene pairs and identified 59,683 paralogous groups.
PorthoMCL finished this step in only 7 days (same computing cluster, total runtime 1,634 days), while OrthoMCL could not finish this after 35 days of running on a database server with 40 cores and 1TBs of RAM.
Runtime for each parallel process: Download.
If these files are unavailable please contact me at github.