Benchmark 3#
All shell command blocks begin at the folder root: benchmark_3
Description#
benchmark_3
data come from chloroplast genomes that can posses either inverted or direct repeats, joined by single-copy regions.
The data were not selected randomly:
some instances are easy to solve, this allows to verify the basic method requirements;
some hard instances with a high level of combinatoric.
Input data#
Dominique Lavenier generated the data for the khloraa scaffolding method, c.f. Synthetic data generation:
data/khloraa_results/contigs_links_fasta.tar.xz
: FASTA sequences of the contigs and the linksIn the directory
data/v1
:contigs_links.tar.xz
contains thekhloraascaf
input data for each instancemetadata.json
describes the instance in the compressed file and gives the starting contig for each
Alternative version of the data#
data/v2
directory contains some v1
input data with some minimal modifications.
They allow highlighting what did not work and why with the original v1
data.
You can find in Modified data: v2 page all the modifications of the original input data.
Install and configurations#
Python environment#
Be sure you have Python3.9.
Create python virtual environment:
./config/install_venv39.sh
Activate the environment:
source .venv_39/bin/activate
Always make sure the environment is activated!
(.venv_39) # command lines may be prefix by the environment name
Run the benchmark#
Run the benchmark:
python3.9 scripts/run_benchmark_3.py > stdout.txt
This script generates a unique directory name under the main run directory run
, e.g. run/2023-03-08_13:17:34_v1
.
.
├── config
├── data
├── run
│ └── 2023-03-08_13:17:34_v1
├── scripts
└── README.md
Script run_benchmark_3.py
prints in the standard output the path of this uniquely named run directory, like this:
[INFO] benchmark_3 run_dir: run/2023-03-08_13:17:34_v1
You can get it in bash
like this:
run_dir=$( cat stdout.txt | sed -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')
if you have redirected in stdout.txt
file, else without this intermediary file:
run_dir=$( python3.9 scripts/run_benchmark_3.py | sed -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')
At the root of the uniquely named directory you can find the run_io.yaml
metadata file that contains the options used for the run and the global times of the run.
For each instance in the data directory, a directory of the name of the instance is created in the uniquely named directory, and contains the khloraascaf
results.
In $run_dir
you can find the file mdcg_sizes.tsv
that gives for each instance the number of contigs, of links, and the multiplied doubled contig graph sizes.
Options#
By default, the above script run the benchmark for v1
data with cbc
solver (open-source MILP solver distributed with PuLP python package).
If you want to change any parameters, please refer to the script help: python3.9 scripts/run_benchmark_3.py --help
Evaluate the benchmark#
The two next sections explain how to verify the results for all the data instances.
Evaluate the order of the oriented contigs#
This evaluation is the more stringent one.
Indeed, for any reason, if, for all paths in the assembly graph, at least one oriented contig is misplaced (or not placed), then the test fails.
python3.9 scripts/evaluate_orc_path.py $run_dir
The latter will produce the TSV file $run_dir/orc_path_eval.tsv
.
Column label |
Meaning |
---|---|
instance_name |
The name of the instance for which one result is evaluated |
ilp_combi |
The order of ILP problems solved that produced the result |
total_paths |
Number of path in the assembly graph |
successful_paths |
Number of paths that exactly match the solution path |
There can be several rows with the same
instance_name
, they are differentiated by theilp_combi
value.If there is no solution for an instance, then
ilp_combi
=successful_paths
=total_paths
=NA
Evaluate the order of the oriented regions#
This evaluation is less stringent than the previous one.
Indeed, a path of oriented regions can exactly match the oriented region solution path even if the oriented contigs in the regions are not the same, or their order differ.
python3.9 scripts/evaluate_orreg_path.py $run_dir
The latter will produce the TSV file $run_dir/orreg_path_eval.tsv
.
Column label |
Meaning |
---|---|
instance_name |
The name of the instance for which one result is evaluated |
ilp_combi |
The order of ILP problems solved that produced the result |
total_paths |
Number of path in the assembly graph |
successful_paths |
Number of paths that exactly match the solution path |
There can be several rows with the same
instance_name
, they are differentiated by theilp_combi
value.If there is no solution for an instance, then
ilp_combi
=successful_paths
=total_paths
=NA
Evaluate the produced FASTA sequences with QUAST#
This evaluation has two benefits:
even if the order of contigs / regions was not found, perhaps the sequences produced can match the reference
even if the order of contigs / regions was found, if there are misassemblies, then we have to look after the sequence of the contigs itself or the sequences of the links
python3.9 scripts/evaluate_quast.py $run_dir
The latter will produce the TSV files:
$run_dir/contig_stat.tsv
that verifies if the contig was not misassembled during the assembly step (error in the input data)$run_dir/solscaf_stats.tsv
that verifies if the contigs of the given oriented contig solution orders match the references (we can also check if the given solutions are correct)$run_dir/wholescaf.tsv
that gives the misassemblies number and the percentages of genome coverage for each form of each problem successions
Column label |
Meaning |
---|---|
instance_name |
The name of the instance for which one result is evaluated |
ilp_combi |
The order of ILP problems solved that produced the result |
num_circuit |
Corresponds to the eulerian circuit number |
%gnm |
Genome coverage percentage given by QUAST |
#mis |
Number of misassemblies given by QUAST |
There can be several rows with the same
instance_name
, they are differentiated by theilp_combi
andnum_circuit
values.If there is no solution for an instance, then
ilp_combi
=successful_paths
=total_paths
=NA
Evaluate the solver performance (Gurobi only)#
What follows produces a TSV file that contains some solver stats.
python3 scripts/evaluate_solver.py $run_dir
Explore the multiplied doubled contig graph with the solution#
The script scripts/data_to_networkx.py
permits to generate the initial multiplied doubled contig graph in GraphML format.
The script tries to aggregate all the solutions (if they exist) as edges and vertices attributes.
You can explore the combinatoric by importing the graphs in Gephi for example.
Test the hardest data#
The Agathis_dammara
instance is the most difficult one.
Indeed, the combination of direct and inverted repeat produce a lot of results.
Furthermore, the graph possesses a lot of extra-edges that connect the single-copy regions between them and between the repeats.
Is the solution given is feasible?#
That the question the script scripts/test_agathis.py
answers.
And the answer is yes, but there is a lot of equivalent results (according to the equal ILP objective values), that differ in the order of oriented contigs in the single-copy regions.
You can refer to the standard output messages.
Bijection between the oriented region graphs#
To test it, first run the benchmark on v2
data (as the original data underestimate the contig multiplicities):
python3.9 scripts/run_benchmark_3.py --data-version v2 > stdout.txt
Retrieve the uniquely named directory path:
run_dir=$( cat stdout.txt | sed -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')
or directly:
run_dir=$( python3.9 scripts/run_benchmark_3.py --data-version v2 | sed -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')
scripts/asm_graph_bijection.py
aims to verify the existence of a bijection between the assembly graph produced by different ILP combination order.
python3 scripts/asm_graph_bijection.py $run_dir
Format the evaluations to LaTeX#
Oriented contig and oriented region orders#
python3.9 scripts/latex_contig_region_orders.py $run_dir
The latter generates a TeX file: $run_dir/benchmark_3_v1_orders.tex
Note that in our case, if you have run the benchmark with default options, you get
v1
as it was the first data version that have been processed.
QUAST evaluation#
python3 scripts/latex_quast.py $run_dir
The latter generates a TeX file: $run_dir/benchmark_3_v1_quast.tex
Note that in our case, if you have run the benchmark with default options, you get
v1
as it was the first data version that have been processed.
Solver stats#
python3 scripts/latex_ilp_stats.py $run_dir
Miscellaneous#
Remove
quast*
directories inrun/$run_uid
find run/$run_uid -maxdepth 2 -type d -name 'quast*' | xargs rm -rf
find $run_dir -maxdepth 2 -type d -name 'quast*' | xargs rm -rf
Results#
Order and Quast evaluations for v1 data can be found in:
Order and Quast evaluations for v2 data can be found in: