Synthetic data generation#
This readme describes how the files in section 7 are obtained from simulated data and database knowledge.
1. Get a list of plant names with available chloroplast genome from CpGDB (2023-01-04)#
CpGDB: A Comprehensive Database of Chloroplast Genomes http://www.gndu.ac.in/CpGDB/index.aspx
Download Supplementary Table S1: List of plant species present in CpGDB
get
SupplementaryTableS1.pdf
from http://www.gndu.ac.in/CpGDB/SupplementaryData.aspx
2. Convert PDF file to CSV file#
Use online converter https://document.online-convert.com/fr/convertir/pdf-en-xlsx
get
TableS1.xlsx
Run excel on TableS1.xlsx
save as .csv file (separator =
';'
)
get
TableS1.csv
3. Select randomly plant names#
Choose 200 chloroplast genomes from different families of plants stored in CpGDB among the 3823 chloroplast genomes available (2023-01-04)
python selectPlantNames.py
output
ident_genome.txt
4. Download chloroplast genomes#
Material#
website: NCBI Natch Entrez https://www.ncbi.nlm.nih.gov/sites/batchentrez
Database: nucleotide
File:
ident_genome.txt
Method#
click:
Retrieve
menu
Received lines: 200 Rejected lines: 0 Removed duplicates: 0 Passed to Entrez: 200 Retrieve records for 200 UID(s)
click:
Retrieve records for 200 UID(s)
Send to
Choose Destination: File
Format: FASTA
Create file
get
sequence.fasta
(Download repertory)move to
genomes.fasta
5. Split genomes.fasta#
python split_genome.py
get
results.csv
(to prepare results Excel file)get 200
FASTA
files (one for each chloroplast genome) indb/
repertory
6. Select interesting genomes#
Run (manually)
khloraa
on the 200 genome datasetsSelect (manually) genomes which present some difficulties
get
khloraa_results.xlsx
(selection of 31 chloroplast genomes)Generate
Graph_<plant_name>.txt
files
7. Gather useful information#
Copy to the same directory:
Khloraa_instance_id.tar.xz
archive:Khloraa__<plant_name>.txt
khloraa
parameters to generate contigs and links (31 files)
Genome_instance_id.tar.xz
archive:Genome_<plant_name>.fa
genome inFASTA
format (31 files)
Graph_instance_id.tar.xz
archive:Graph_<plant_name>.txt
contig graph (31 files)
khloraa_results.odt
information on selected genomesreadme.md
this file
Graph_<plant_name>.txt format#
NODE line#
NODE node_name contig_name multiplicity X_coord Y_coord
node_name
in the graph, a contig is represented by 2 nodes:contig_name_L
andcontig_name_R
contig_name
name of the contigmultiplicity
number of occurrence of the contigs
= starter contig1
= else
X_coord
X coordinate used for the graphic representationY_coord
Y coordinate used for the graphic representation
LINK line#
LINK node_name_1 node_name_2 type ident color sequence
node_name_*
the 2 nodes that are linkedtype
intra
: the 2 nodes belongs to the same contiginter
: the 2 nodes come from different contigs
ident
identifier used to display graphintra
=contig_name
inter
= link length
color
color of the linkintra
= reflect the coverageinter
= grey
sequence
ATGC string of the link (*
means link of size 0)
CONTIG line#
CONTIG name coverage mapping_length sequence
name
name of the contigcoverage
average coverage of the contigmapping_length
length of the disjoint union of mapping alignment of (nucleotide) protein’s sequences against the contig’s sequencesequence
ATGC string of the contig
8. Retrieve the wanted solutions#
The oriented contig orders and the oriented region orders were retrieved by hand. For each instance, the contigs were mapped to the associated reference chloroplast genome.
Each contigs and regions order thus represent only one form of a chloroplast genome.
The solutions are in the file data/khloraa_results/solution_orders.tar.xz
.
9. Format khloraa output to khloraascaf input#
khloraa
outputs were formatted with the script scripts/khloraa_results_to_khloraascaf_inputs.py
:
# Set the Python3.9 virtual environment:
./config/install_venv39.sh
source .venv_39/bin/activate
# in .venv_39:
python3 scripts/khloraa_results_to_khloraascaf_inputs.py
This creates:
data/khloraa_results/contigs_links_fasta.tar.xz
: FASTA sequences of the contigs and the linksA
data/v1
directory where:contigs_links.tar.xz
contains thekhloraascaf
input data for each instancemetadata.json
describes the instance in the compressed file and gives the starting contig for each
Contig attributes computation#
The multiplicity and the existence probability for each contig were calculated with the functions in klhloraascaf_utils python package:
khloraascaf_utils.contigs_attributes.cov_to_mult
;khloraascaf_utils.contigs_attributes.dis_union_align_len_to_presence_score
.