Synthetic data generation¶
This readme describes how the files in section 7 are obtained from simulated data and database knowledge.
1. Get a list of plant names with available chloroplast genome from CpGDB (2023-01-04)¶
CpGDB: A Comprehensive Database of Chloroplast Genomes http://www.gndu.ac.in/CpGDB/index.aspx
Download Supplementary Table S1: List of plant species present in CpGDB
get
SupplementaryTableS1.pdffrom http://www.gndu.ac.in/CpGDB/SupplementaryData.aspx
2. Convert PDF file to CSV file¶
Use online converter https://document.online-convert.com/fr/convertir/pdf-en-xlsx
get
TableS1.xlsxRun excel on TableS1.xlsx
save as .csv file (separator =
';')
get
TableS1.csv
3. Select randomly plant names¶
Choose 200 chloroplast genomes from different families of plants stored in CpGDB among the 3823 chloroplast genomes available (2023-01-04)
python selectPlantNames.py
output
ident_genome.txt
4. Download chloroplast genomes¶
Material¶
website: NCBI Natch Entrez https://www.ncbi.nlm.nih.gov/sites/batchentrez
Database: nucleotide
File:
ident_genome.txt
Method¶
click:
Retrievemenu
Received lines: 200 Rejected lines: 0 Removed duplicates: 0 Passed to Entrez: 200 Retrieve records for 200 UID(s)
click:
Retrieve records for 200 UID(s)Send toChoose Destination: FileFormat: FASTACreate file
get
sequence.fasta(Download repertory)move to
genomes.fasta
5. Split genomes.fasta¶
python split_genome.py
get
results.csv(to prepare results Excel file)get 200
FASTAfiles (one for each chloroplast genome) indb/repertory
6. Select interesting genomes¶
Run (manually)
khloraaon the 200 genome datasetsSelect (manually) genomes which present some difficulties
get
khloraa_results.xlsx(selection of 31 chloroplast genomes)Generate
Graph_<plant_name>.txtfiles
7. Gather useful information¶
Copy to the same directory:
Khloraa_instance_id.tar.xzarchive:Khloraa__<plant_name>.txtkhloraaparameters to generate contigs and links (31 files)
Genome_instance_id.tar.xzarchive:Genome_<plant_name>.fagenome inFASTAformat (31 files)
Graph_instance_id.tar.xzarchive:Graph_<plant_name>.txtcontig graph (31 files)
khloraa_results.odtinformation on selected genomesreadme.mdthis file
Graph_<plant_name>.txt format¶
NODE line¶
NODE node_name contig_name multiplicity X_coord Y_coord
node_namein the graph, a contig is represented by 2 nodes:contig_name_Landcontig_name_Rcontig_namename of the contigmultiplicitynumber of occurrence of the contigs= starter contig1= else
X_coordX coordinate used for the graphic representationY_coordY coordinate used for the graphic representation
LINK line¶
LINK node_name_1 node_name_2 type ident color sequence
node_name_*the 2 nodes that are linkedtypeintra: the 2 nodes belongs to the same contiginter: the 2 nodes come from different contigs
identidentifier used to display graphintra=contig_nameinter= link length
colorcolor of the linkintra= reflect the coverageinter= grey
sequenceATGC string of the link (*means link of size 0)
CONTIG line¶
CONTIG name coverage mapping_length sequence
namename of the contigcoverageaverage coverage of the contigmapping_lengthlength of the disjoint union of mapping alignment of (nucleotide) protein’s sequences against the contig’s sequencesequenceATGC string of the contig
8. Retrieve the wanted solutions¶
The oriented contig orders and the oriented region orders were retrieved by hand. For each instance, the contigs were mapped to the associated reference chloroplast genome.
Each contigs and regions order thus represent only one form of a chloroplast genome.
The solutions are in the file data/khloraa_results/solution_orders.tar.xz.
9. Format khloraa output to khloraascaf input¶
khloraa outputs were formatted with the script scripts/khloraa_results_to_khloraascaf_inputs.py:
# Set the Python3.9 virtual environment:
./config/install_venv39.sh
source .venv_39/bin/activate
# in .venv_39:
python3 scripts/khloraa_results_to_khloraascaf_inputs.py
This creates:
data/khloraa_results/contigs_links_fasta.tar.xz: FASTA sequences of the contigs and the linksA
data/v1directory where:contigs_links.tar.xzcontains thekhloraascafinput data for each instancemetadata.jsondescribes the instance in the compressed file and gives the starting contig for each
Contig attributes computation¶
The multiplicity and the existence probability for each contig were calculated with the functions in klhloraascaf_utils python package:
khloraascaf_utils.contigs_attributes.cov_to_mult;khloraascaf_utils.contigs_attributes.dis_union_align_len_to_presence_score.