Synthetic data generation#

This readme describes how the files in section 7 are obtained from simulated data and database knowledge.

1. Get a list of plant names with available chloroplast genome from CpGDB (2023-01-04)#

CpGDB: A Comprehensive Database of Chloroplast Genomes http://www.gndu.ac.in/CpGDB/index.aspx

Download Supplementary Table S1: List of plant species present in CpGDB
get SupplementaryTableS1.pdf from http://www.gndu.ac.in/CpGDB/SupplementaryData.aspx

2. Convert PDF file to CSV file#

Use online converter https://document.online-convert.com/fr/convertir/pdf-en-xlsx

get TableS1.xlsx
Run excel on TableS1.xlsx
- save as .csv file (separator = ';')
get TableS1.csv

3. Select randomly plant names#

Choose 200 chloroplast genomes from different families of plants stored in CpGDB among the 3823 chloroplast genomes available (2023-01-04)

python selectPlantNames.py

output ident_genome.txt

4. Download chloroplast genomes#

Material#

website: NCBI Natch Entrez https://www.ncbi.nlm.nih.gov/sites/batchentrez
Database: nucleotide
File: ident_genome.txt

Method#

click: Retrieve

menu

Received lines: 200
Rejected lines: 0
Removed duplicates: 0
Passed to Entrez: 200
Retrieve records for 200 UID(s)

click: Retrieve records for 200 UID(s)
- Send to
- Choose Destination: File
- Format: FASTA
- Create file
get sequence.fasta (Download repertory)
- move to genomes.fasta

5. Split genomes.fasta#

python split_genome.py

get results.csv (to prepare results Excel file)
get 200 FASTA files (one for each chloroplast genome) in db/ repertory

6. Select interesting genomes#

Run (manually) khloraa on the 200 genome datasets
Select (manually) genomes which present some difficulties
get khloraa_results.xlsx (selection of 31 chloroplast genomes)
Generate Graph_<plant_name>.txt files

7. Gather useful information#

Copy to the same directory:

Khloraa_instance_id.tar.xz archive:
- Khloraa__<plant_name>.txt khloraa parameters to generate contigs and links (31 files)
Genome_instance_id.tar.xz archive:
- Genome_<plant_name>.fa genome in FASTA format (31 files)
Graph_instance_id.tar.xz archive:
- Graph_<plant_name>.txt contig graph (31 files)
khloraa_results.odt information on selected genomes
readme.md this file

Graph_<plant_name>.txt format#

NODE line#

NODE   node_name   contig_name   multiplicity   X_coord   Y_coord

node_name in the graph, a contig is represented by 2 nodes: contig_name_L and contig_name_R
contig_name name of the contig
multiplicity number of occurrence of the contig
- s = starter contig
- 1 = else
X_coord X coordinate used for the graphic representation
Y_coord Y coordinate used for the graphic representation

LINK line#

LINK  node_name_1  node_name_2  type   ident   color   sequence

node_name_* the 2 nodes that are linked
type
- intra: the 2 nodes belongs to the same contig
- inter: the 2 nodes come from different contigs
ident identifier used to display graph
- intra = contig_name
- inter = link length
color color of the link
- intra = reflect the coverage
- inter = grey
sequence ATGC string of the link (* means link of size 0)

CONTIG line#

CONTIG  name  coverage  mapping_length  sequence

name name of the contig
coverage average coverage of the contig
mapping_length length of the disjoint union of mapping alignment of (nucleotide) protein’s sequences against the contig’s sequence
sequence ATGC string of the contig

8. Retrieve the wanted solutions#

The oriented contig orders and the oriented region orders were retrieved by hand. For each instance, the contigs were mapped to the associated reference chloroplast genome.

Each contigs and regions order thus represent only one form of a chloroplast genome.

The solutions are in the file data/khloraa_results/solution_orders.tar.xz.

9. Format khloraa output to khloraascaf input#

khloraa outputs were formatted with the script scripts/khloraa_results_to_khloraascaf_inputs.py:

# Set the Python3.9 virtual environment:

./config/install_venv39.sh
source .venv_39/bin/activate

# in .venv_39:

python3 scripts/khloraa_results_to_khloraascaf_inputs.py

This creates:

data/khloraa_results/contigs_links_fasta.tar.xz: FASTA sequences of the contigs and the links
A data/v1 directory where:
- contigs_links.tar.xz contains the khloraascaf input data for each instance
- metadata.json describes the instance in the compressed file and gives the starting contig for each

Contig attributes computation#

The multiplicity and the existence probability for each contig were calculated with the functions in klhloraascaf_utils python package:

khloraascaf_utils.contigs_attributes.cov_to_mult;
khloraascaf_utils.contigs_attributes.dis_union_align_len_to_presence_score.