# Benchmark 3

```{toctree}
---
caption: First version
maxdepth: 1
hidden:
---
benchmark_3/synthetic_data_v1
benchmark_3/v1_order_analysis
benchmark_3/v1_quast_analysis
```

```{toctree}
---
caption: Second version
maxdepth: 1
hidden:
---
benchmark_3/synthetic_data_v2
benchmark_3/v2_order_analysis
benchmark_3/v2_quast_analysis
```

*All shell command blocks begin at the folder root: `benchmark_3`*

<!-- DOCU benchmark_3 README -->


## Description

`benchmark_3` data come from chloroplast genomes that can posses either inverted or direct repeats, joined by single-copy regions.

The data were not selected randomly:
* some instances are easy to solve, this allows to verify the basic method requirements;
* some hard instances with a high level of combinatoric.

### Input data

Dominique Lavenier generated the data for the khloraa scaffolding method, c.f. [Synthetic data generation](benchmark_3/synthetic_data_v1.md):

* `data/khloraa_results/contigs_links_fasta.tar.xz`: FASTA sequences of the contigs and the links
* In the directory `data/v1`:
   * `contigs_links.tar.xz` contains the `khloraascaf` input data for each instance
   * `metadata.json` describes the instance in the compressed file and gives the starting contig for each

#### Alternative version of the data

`data/v2` directory contains some `v1` input data with some minimal modifications.

They allow highlighting what did not work and why with the original `v1` data.

You can find in [Modified data: v2](benchmark_3/synthetic_data_v2.md) page all the modifications of the original input data.


## Install and configurations

### Python environment

Be sure you have Python3.9.

Create python virtual environment:
```sh
./config/install_venv39.sh
```
Activate the environment:
```sh
source .venv_39/bin/activate
```

**Always make sure the environment is activated!**
```sh
(.venv_39) # command lines may be prefix by the environment name
```


## Run the benchmark

Run the benchmark:
```sh
python3.9 scripts/run_benchmark_3.py  > stdout.txt
```

This script generates a unique directory name under the main run directory `run`, e.g. `run/2023-03-08_13:17:34_v1`.

```
.
├── config
├── data
├── run
│   └── 2023-03-08_13:17:34_v1
├── scripts
└── README.md
```

Script `run_benchmark_3.py` prints in the standard output the path of this uniquely named run directory, like this:
```
[INFO] benchmark_3 run_dir: run/2023-03-08_13:17:34_v1
```
You can get it in `bash` like this:
```sh
run_dir=$( cat stdout.txt | sed  -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')
```
if you have redirected in `stdout.txt` file, else without this intermediary file:
```sh
run_dir=$( python3.9 scripts/run_benchmark_3.py  | sed  -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')
```

At the root of the uniquely named directory you can find the `run_io.yaml` metadata file that contains the options used for the run and the global times of the run.

For each instance in the data directory, a directory of the name of the instance is created in the uniquely named directory, and contains the `khloraascaf` results.

In `$run_dir` you can find the file `mdcg_sizes.tsv` that gives for each instance the number of contigs, of links, and the multiplied doubled contig graph sizes.

### Options

By default, the above script run the benchmark for `v1` data with `cbc` solver (open-source MILP solver distributed with PuLP python package).

If you want to change any parameters, please refer to the script help: `python3.9 scripts/run_benchmark_3.py --help`


## Evaluate the benchmark

The two next sections explain how to verify the results for all the data instances.

### Evaluate the order of the oriented contigs

This evaluation is the more stringent one.

Indeed, for any reason, if, for all paths in the assembly graph, at least one oriented contig is misplaced (or not placed), then the test fails.

```sh
python3.9 scripts/evaluate_orc_path.py $run_dir
```

The latter will produce the TSV file `$run_dir/orc_path_eval.tsv`.

| Column label     | Meaning                                                    |
| ---------------- | ---------------------------------------------------------- |
| instance_name    | The name of the instance for which one result is evaluated |
| ilp_combi        | The order of ILP problems solved that produced the result  |
| total_paths      | Number of path in the assembly graph                       |
| successful_paths | Number of paths that exactly match the solution path       |

> * There can be several rows with the same `instance_name`, they are differentiated by the `ilp_combi` value.
>
> If there is no solution for an instance, then
> `ilp_combi` = `successful_paths` = `total_paths` = `NA`

### Evaluate the order of the oriented regions

This evaluation is less stringent than the previous one.

Indeed, a path of oriented regions can exactly match the oriented region solution path even if the oriented contigs in the regions are not the same, or their order differ.

```sh
python3.9 scripts/evaluate_orreg_path.py $run_dir
```

The latter will produce the TSV file `$run_dir/orreg_path_eval.tsv`.

| Column label     | Meaning                                                    |
| ---------------- | ---------------------------------------------------------- |
| instance_name    | The name of the instance for which one result is evaluated |
| ilp_combi        | The order of ILP problems solved that produced the result  |
| total_paths      | Number of path in the assembly graph                       |
| successful_paths | Number of paths that exactly match the solution path       |

> * There can be several rows with the same `instance_name`, they are differentiated by the `ilp_combi` value.
>
> If there is no solution for an instance, then
> `ilp_combi` = `successful_paths` = `total_paths` = `NA`

### Evaluate the produced FASTA sequences with QUAST

This evaluation has two benefits:
* even if the order of contigs / regions was not found, perhaps the sequences produced can match the reference
* even if the order of contigs / regions was found, if there are misassemblies, then we have to look after the sequence of the contigs itself or the sequences of the links

```sh
python3.9 scripts/evaluate_quast.py $run_dir
```

The latter will produce the TSV files:
* `$run_dir/contig_stat.tsv` that verifies if the contig was not misassembled during the assembly step (error in the input data)
* `$run_dir/solscaf_stats.tsv` that verifies if the contigs of the given oriented contig solution orders match the references (we can also check if the given solutions are correct)
* `$run_dir/wholescaf.tsv` that gives the misassemblies number and the percentages of genome coverage for each form of each problem successions

| Column label  | Meaning                                                    |
| ------------- | ---------------------------------------------------------- |
| instance_name | The name of the instance for which one result is evaluated |
| ilp_combi     | The order of ILP problems solved that produced the result  |
| num_circuit   | Corresponds to the eulerian circuit number                 |
| %gnm          | Genome coverage percentage given by QUAST                  |
| #mis          | Number of misassemblies given by QUAST                     |

> * There can be several rows with the same `instance_name`, they are differentiated by the `ilp_combi` and `num_circuit` values.
>
> If there is no solution for an instance, then
> `ilp_combi` = `successful_paths` = `total_paths` = `NA`

### Evaluate the solver performance (Gurobi only)

What follows produces a TSV file that contains some solver stats.
```sh
python3 scripts/evaluate_solver.py $run_dir
```


## Explore the multiplied doubled contig graph with the solution

The script `scripts/data_to_networkx.py` permits to generate the initial multiplied doubled contig graph in GraphML format.

The script tries to aggregate all the solutions (if they exist) as edges and vertices attributes.

You can explore the combinatoric by importing the graphs in Gephi for example.


## Test the hardest data

The `Agathis_dammara` instance is the most difficult one.

Indeed, the combination of direct and inverted repeat produce a lot of results.

Furthermore, the graph possesses a lot of extra-edges that connect the single-copy regions between them and between the repeats.

### Is the solution given is feasible?

That the question the script `scripts/test_agathis.py` answers.

And the answer is yes, but there is a lot of equivalent results (according to the equal ILP objective values), that differ in the order of oriented contigs in the single-copy regions.

You can refer to the standard output messages.

### Bijection between the oriented region graphs

<!-- DOCU biject section come after? -->

To test it, first run the benchmark on `v2` data (as the original data underestimate the contig multiplicities):

<!-- DOCU point to running v2 data? -->

```sh
python3.9 scripts/run_benchmark_3.py --data-version v2 > stdout.txt
```
Retrieve the uniquely named directory path:
```sh
run_dir=$( cat stdout.txt | sed  -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')
```
or directly:
```sh
run_dir=$( python3.9 scripts/run_benchmark_3.py --data-version v2 | sed  -E -e 's/\[INFO\] benchmark_3 run_dir: (.*)/\1/g')
```

`scripts/asm_graph_bijection.py` aims to verify the existence of a bijection between the assembly graph produced by different ILP combination order.

```sh
python3 scripts/asm_graph_bijection.py $run_dir
```


## Format the evaluations to LaTeX

### Oriented contig and oriented region orders

```sh
python3.9 scripts/latex_contig_region_orders.py $run_dir
```
The latter generates a TeX file: `$run_dir/benchmark_3_v1_orders.tex`
> Note that in our case, if you have run the benchmark with default options, you get `v1` as it was the first data version that have been processed.

### QUAST evaluation

```sh
python3 scripts/latex_quast.py $run_dir
```
The latter generates a TeX file: `$run_dir/benchmark_3_v1_quast.tex`
> Note that in our case, if you have run the benchmark with default options, you get `v1` as it was the first data version that have been processed.

### Solver stats

```sh
python3 scripts/latex_ilp_stats.py $run_dir
```


## Miscellaneous

* Remove `quast*` directories in `run/$run_uid`
    ```sh
    find run/$run_uid -maxdepth 2 -type d -name 'quast*' | xargs rm -rf
    ```
    ```sh
    find $run_dir -maxdepth 2 -type d -name 'quast*' | xargs rm -rf
    ```


## Results

Order and Quast evaluations for v1 data can be found in:
* [Order evaluation v1](benchmark_3/v1_order_analysis.md)
* [Quast evaluation v1](benchmark_3/v1_quast_analysis.md)

Order and Quast evaluations for v2 data can be found in:
* [Order evaluation v2](benchmark_3/v2_order_analysis.md)
* [Quast evaluation v2](benchmark_3/v2_quast_analysis.md)