Synthetic data generation#

This readme describes how the files in section 7 are obtained from simulated data and database knowledge.

1. Get a list of plant names with available chloroplast genome from CpGDB (2023-01-04)#

CpGDB: A Comprehensive Database of Chloroplast Genomes

2. Convert PDF file to CSV file#

Use online converter

  • get TableS1.xlsx

  • Run excel on TableS1.xlsx

    • save as .csv file (separator = ';')

  • get TableS1.csv

3. Select randomly plant names#

Choose 200 chloroplast genomes from different families of plants stored in CpGDB among the 3823 chloroplast genomes available (2023-01-04)

  • output ident_genome.txt

4. Download chloroplast genomes#



  1. click: Retrieve

    • menu

      Received lines: 200
      Rejected lines: 0
      Removed duplicates: 0
      Passed to Entrez: 200
      Retrieve records for 200 UID(s)
  2. click: Retrieve records for 200 UID(s)

    • Send to

    • Choose Destination: File

    • Format: FASTA

    • Create file

  3. get sequence.fasta (Download repertory)

    • move to genomes.fasta

5. Split genomes.fasta#

  • get results.csv (to prepare results Excel file)

  • get 200 FASTA files (one for each chloroplast genome) in db/ repertory

6. Select interesting genomes#

  1. Run (manually) khloraa on the 200 genome datasets

  2. Select (manually) genomes which present some difficulties

  3. get khloraa_results.xlsx (selection of 31 chloroplast genomes)

  4. Generate Graph_<plant_name>.txt files

7. Gather useful information#

Copy to the same directory:

  • Khloraa_instance_id.tar.xz archive:

    • Khloraa__<plant_name>.txt khloraa parameters to generate contigs and links (31 files)

  • Genome_instance_id.tar.xz archive:

    • Genome_<plant_name>.fa genome in FASTA format (31 files)

  • Graph_instance_id.tar.xz archive:

    • Graph_<plant_name>.txt contig graph (31 files)

  • khloraa_results.odt information on selected genomes

  • this file

Graph_<plant_name>.txt format#

NODE line#

NODE   node_name   contig_name   multiplicity   X_coord   Y_coord
  • node_name in the graph, a contig is represented by 2 nodes: contig_name_L and contig_name_R

  • contig_name name of the contig

  • multiplicity number of occurrence of the contig

    • s = starter contig

    • 1 = else

  • X_coord X coordinate used for the graphic representation

  • Y_coord Y coordinate used for the graphic representation

CONTIG line#

CONTIG  name  coverage  mapping_length  sequence
  • name name of the contig

  • coverage average coverage of the contig

  • mapping_length length of the disjoint union of mapping alignment of (nucleotide) protein’s sequences against the contig’s sequence

  • sequence ATGC string of the contig

8. Retrieve the wanted solutions#

The oriented contig orders and the oriented region orders were retrieved by hand. For each instance, the contigs were mapped to the associated reference chloroplast genome.

Each contigs and regions order thus represent only one form of a chloroplast genome.

The solutions are in the file data/khloraa_results/solution_orders.tar.xz.

9. Format khloraa output to khloraascaf input#

khloraa outputs were formatted with the script scripts/

# Set the Python3.9 virtual environment:

source .venv_39/bin/activate

# in .venv_39:

python3 scripts/

This creates:

  • data/khloraa_results/contigs_links_fasta.tar.xz: FASTA sequences of the contigs and the links

  • A data/v1 directory where:

    • contigs_links.tar.xz contains the khloraascaf input data for each instance

    • metadata.json describes the instance in the compressed file and gives the starting contig for each

Contig attributes computation#

The multiplicity and the existence probability for each contig were calculated with the functions in klhloraascaf_utils python package:

  • khloraascaf_utils.contigs_attributes.cov_to_mult;

  • khloraascaf_utils.contigs_attributes.dis_union_align_len_to_presence_score.