Benchmark 3#

All shell command blocks begin at the folder root: benchmark_3

Description#

benchmark_3 data come from chloroplast genomes that can posses either inverted or direct repeats, joined by single-copy regions.

The data were not selected randomly:

  • some instances are easy to solve, this allows to verify the basic method requirements;

  • some hard instances with a high level of combinatoric.

Input data#

Dominique Lavenier generated the data for the khloraa scaffolding method, c.f. Synthetic data generation:

  • data/khloraa_results/contigs_links_fasta.tar.xz: FASTA sequences of the contigs and the links

  • In the directory data/v1:

    • contigs_links.tar.xz contains the khloraascaf input data for each instance

    • metadata.json describes the instance in the compressed file and gives the starting contig for each

Alternative version of the data#

data/v2 directory contains some v1 input data with some minimal modifications.

They allow highlighting what did not work and why with the original v1 data.

You can find in Modified data: v2 page all the modifications of the original input data.

Install and configurations#

Python environment#

Be sure you have Python3.9.

Create python virtual environment:

./config/install_venv39.sh

Activate the environment:

source .venv_39/bin/activate

Always make sure the environment is activated!

(.venv_39) # command lines may be prefix by the environment name

Run the benchmark#

Run the benchmark:

python3.9 scripts/run_benchmark_3.py  > stdout.txt

This script generates a unique directory name under the main run directory run, e.g. run/2023-03-08_13:17:34_v1.

.
├── config
├── data
├── run
│   └── 2023-03-08_13:17:34_v1
├── scripts
└── README.md

Script run_benchmark_3.py prints in the standard output the path of this uniquely named run directory, like this:

[INFO] benchmark_3 run_dir: run/2023-03-08_13:17:34_v1

You can get it in bash like this:

run_dir=$( cat stdout.txt | sed  -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')

if you have redirected in stdout.txt file, else without this intermediary file:

run_dir=$( python3.9 scripts/run_benchmark_3.py  | sed  -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')

At the root of the uniquely named directory you can find the run_io.yaml metadata file that contains the options used for the run and the global times of the run.

For each instance in the data directory, a directory of the name of the instance is created in the uniquely named directory, and contains the khloraascaf results.

In $run_dir you can find the file mdcg_sizes.tsv that gives for each instance the number of contigs, of links, and the multiplied doubled contig graph sizes.

Options#

By default, the above script run the benchmark for v1 data with cbc solver (open-source MILP solver distributed with PuLP python package).

If you want to change any parameters, please refer to the script help: python3.9 scripts/run_benchmark_3.py --help

Evaluate the benchmark#

The two next sections explain how to verify the results for all the data instances.

Evaluate the order of the oriented contigs#

This evaluation is the more stringent one.

Indeed, for any reason, if, for all paths in the assembly graph, at least one oriented contig is misplaced (or not placed), then the test fails.

python3.9 scripts/evaluate_orc_path.py $run_dir

The latter will produce the TSV file $run_dir/orc_path_eval.tsv.

Column label

Meaning

instance_name

The name of the instance for which one result is evaluated

ilp_combi

The order of ILP problems solved that produced the result

total_paths

Number of path in the assembly graph

successful_paths

Number of paths that exactly match the solution path

  • There can be several rows with the same instance_name, they are differentiated by the ilp_combi value.

If there is no solution for an instance, then ilp_combi = successful_paths = total_paths = NA

Evaluate the order of the oriented regions#

This evaluation is less stringent than the previous one.

Indeed, a path of oriented regions can exactly match the oriented region solution path even if the oriented contigs in the regions are not the same, or their order differ.

python3.9 scripts/evaluate_orreg_path.py $run_dir

The latter will produce the TSV file $run_dir/orreg_path_eval.tsv.

Column label

Meaning

instance_name

The name of the instance for which one result is evaluated

ilp_combi

The order of ILP problems solved that produced the result

total_paths

Number of path in the assembly graph

successful_paths

Number of paths that exactly match the solution path

  • There can be several rows with the same instance_name, they are differentiated by the ilp_combi value.

If there is no solution for an instance, then ilp_combi = successful_paths = total_paths = NA

Evaluate the produced FASTA sequences with QUAST#

This evaluation has two benefits:

  • even if the order of contigs / regions was not found, perhaps the sequences produced can match the reference

  • even if the order of contigs / regions was found, if there are misassemblies, then we have to look after the sequence of the contigs itself or the sequences of the links

python3.9 scripts/evaluate_quast.py $run_dir

The latter will produce the TSV files:

  • $run_dir/contig_stat.tsv that verifies if the contig was not misassembled during the assembly step (error in the input data)

  • $run_dir/solscaf_stats.tsv that verifies if the contigs of the given oriented contig solution orders match the references (we can also check if the given solutions are correct)

  • $run_dir/wholescaf.tsv that gives the misassemblies number and the percentages of genome coverage for each form of each problem successions

Column label

Meaning

instance_name

The name of the instance for which one result is evaluated

ilp_combi

The order of ILP problems solved that produced the result

num_circuit

Corresponds to the eulerian circuit number

%gnm

Genome coverage percentage given by QUAST

#mis

Number of misassemblies given by QUAST

  • There can be several rows with the same instance_name, they are differentiated by the ilp_combi and num_circuit values.

If there is no solution for an instance, then ilp_combi = successful_paths = total_paths = NA

Evaluate the solver performance (Gurobi only)#

What follows produces a TSV file that contains some solver stats.

python3 scripts/evaluate_solver.py $run_dir

Explore the multiplied doubled contig graph with the solution#

The script scripts/data_to_networkx.py permits to generate the initial multiplied doubled contig graph in GraphML format.

The script tries to aggregate all the solutions (if they exist) as edges and vertices attributes.

You can explore the combinatoric by importing the graphs in Gephi for example.

Test the hardest data#

The Agathis_dammara instance is the most difficult one.

Indeed, the combination of direct and inverted repeat produce a lot of results.

Furthermore, the graph possesses a lot of extra-edges that connect the single-copy regions between them and between the repeats.

Is the solution given is feasible?#

That the question the script scripts/test_agathis.py answers.

And the answer is yes, but there is a lot of equivalent results (according to the equal ILP objective values), that differ in the order of oriented contigs in the single-copy regions.

You can refer to the standard output messages.

Bijection between the oriented region graphs#

To test it, first run the benchmark on v2 data (as the original data underestimate the contig multiplicities):

python3.9 scripts/run_benchmark_3.py --data-version v2 > stdout.txt

Retrieve the uniquely named directory path:

run_dir=$( cat stdout.txt | sed  -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')

or directly:

run_dir=$( python3.9 scripts/run_benchmark_3.py --data-version v2 | sed  -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')

scripts/asm_graph_bijection.py aims to verify the existence of a bijection between the assembly graph produced by different ILP combination order.

python3 scripts/asm_graph_bijection.py $run_dir

Format the evaluations to LaTeX#

Oriented contig and oriented region orders#

python3.9 scripts/latex_contig_region_orders.py $run_dir

The latter generates a TeX file: $run_dir/benchmark_3_v1_orders.tex

Note that in our case, if you have run the benchmark with default options, you get v1 as it was the first data version that have been processed.

QUAST evaluation#

python3 scripts/latex_quast.py $run_dir

The latter generates a TeX file: $run_dir/benchmark_3_v1_quast.tex

Note that in our case, if you have run the benchmark with default options, you get v1 as it was the first data version that have been processed.

Solver stats#

python3 scripts/latex_ilp_stats.py $run_dir

Miscellaneous#

  • Remove quast* directories in run/$run_uid

    find run/$run_uid -maxdepth 2 -type d -name 'quast*' | xargs rm -rf
    
    find $run_dir -maxdepth 2 -type d -name 'quast*' | xargs rm -rf
    

Results#

Order and Quast evaluations for v1 data can be found in:

Order and Quast evaluations for v2 data can be found in: