SmartGene BiomeScan™ Report

This Webpage provides you with details, explanations, and definitions of terms and interpretations displayed in the different sections of the BiomeScan™ report.

In case of questions not answered here, we invite you to contact SmartGene Support Europe (mailto:This email address is being protected from spambots. You need JavaScript enabled to view it.); or SmartGene Support North America (mailto:This email address is being protected from spambots. You need JavaScript enabled to view it.)

Disclaimer

The SmartGene BiomeScan™ Report («the Report») in PDF or in electronic format provides an overview of the composition of a microbial population based on the sequence data submitted and an interpretation of the results for various specimens or sources. The assessment of the relevance of such interpretation for a given specimen always needs to be performed and validated by a trained microbiologist at the laboratory which has executed the analysis and generated the Report, and the result must be considered in the context of the specimen and the source. Results and interpretations displayed on the Report greatly depend on the quality of data submitted for the analysis which has been generated by the laboratory. SmartGene gives no assurance on the validity or relevance of the findings and interpretations displayed on the Report. In addition, SmartGene is not responsible for any interpretations, interpretational rules or conclusions on the Report which are entered by the user or by user's organization.

Important Note: the absence of a given microorganism on the Report can be due to technical or other reasons and does not exclude the possibility of its presence in the original specimen; the presence of potentially harmful microorganisms may be suggested by the Report when ambiguous results are obtained. In both cases, a trained microbiologist of the laboratory generating the Report must validate the result and provide guidance when advisable to the recipient of the Report. Reference populations, values and interpretations are only indicative of possible conditions of wellness or disease.

The BiomeScan™ Report is not intended for diagnostic purposes but for research use only.

Distribution of microbial flora

  • Microbial flora, or microflora: defined in microbiology as the collective microorganisms in an analyzed sample as detected by the chosen analysis method.

  • Taxon: a taxon (back-formation from taxonomy; plural taxa) is a group of one or more populations of an organism or organisms which are considered by taxonomists to form a unit. Although neither is required, a taxon is usually known by a particular name and given a particular ranking, especially if and when it is accepted or becomes established (https://en.wikipedia.org/wiki/Taxon). Taxonomic ranks displayed and consolidated in BiomeScan™ are species, genus, family, order, class, phylum, kingdom. When selecting a higher taxonomic rank in the BiomeScan™ App, all counts belonging to a lower rank are consolidated accordingly.

  • OTU: Operational Taxonomic Unit, i.e. the basic unit used in numerical taxonomy, used to classify groups of closely related biological entities; closely related here refers to sequence similarity. An OTU may refer to an individual species, genus, or class. BiomeScan™ generates its OTUs not only by the level (%) of sequence similarity but also by matching a specific taxon. Consolidated OTUs are the number of OTUs defined at a specified taxonomic rank.

Alpha diversity

Alpha diversity, also called Shannon-Wiener Index: is a measure for the diversity of species within a sample. Whereas this index is often computed on the basis of OTU regardless of their assignment to a certain species, SmartGene recommends using the index based on the species rank, in order to minimize the influence of sequencing, assembly, or matching artifacts.

Alpha diversity = -Σpi * ln(pi), where pi = proportion of each unit in the analyzed sample.

In many contexts of wellness or disease, a more diverse flora, with more metabolic functions available, is considered advantageous for health (1).

Microbial Composition: Definitions and Explanations for the Table of Results

  • Reads [%] is the measure of relative abundance: relative number of reads assigned to a specific taxon at the selected rank in a sample; the total abundance of all taxa in a sample is 100%.

  • Reads [N]: absolute number of reads assigned to a specific taxon at the selected rank.

  • Read length: median number of base pairs (bp) of the reads or contigs (for merged reads) assigned to a specific taxon at the selected rank.

  • Match length: median length of the matches in bp of the reads or contigs (for merged reads) to a reference sequence for the specific taxon at the selected rank.

  • Mismatches: median number of mismatches for the reads or contigs (for merged reads) versus the Centroid reference sequence of the closest taxon at the selected rank.

  • Taxon level: taxonomic rank to which the reads or contigs could be assigned. Selection of «species» as rank will indicate the most granular taxonomic resolution possible.

  • Attribution: best possible attribution to a taxon, dependent on the rank selected.

  • Species found: display of all possible taxa of Centroids at the lowest rank which match the reads or contigs.

  • Distance: mean relative distance of the reads with regard to the closest Centroid at the taxon rank specified. Distance is a measure of sequence diversity of the reads (measured with a metric including sequence mismatches, gap openings, and alignment length) with the Centroid sequence of the respective species.

  • Match: stretch of a sequence that maps to a sequence of the reference database used.

  • Centroid: A Centroid is the most representative sequence of a group of variant sequences of a taxon group, usually of a species; such Centroids are real sequence entries retrieved from public databases, not artificial consensus sequences. SmartGene updates its reference databases of Centroids periodically in order to reflect recently published entries and up-to-date nomenclature.

  • Consistency, range 0-100: indicates how many times a distinct Centroid has been assigned to the reads or contigs; consistency varies depending on the match to a Centroid.

  • Confidence [%], range 0-100: represents the attribution confidence, as the measure of probability of correct attribution of a taxon at the specified rank. The measure is directly proportional to the diversity of the set of sequences belonging to the same species which was used to define the Centroid itself: more sequences used for centroid inference, the higher weight. The greater the distance, the lower the confidence.

  • Confidence [%] (Species), range 0-100: SmartGene confidence score for OTU attribution at the lowest possible rank, usually «species». Confidence [%] (Species) is the mean score representing the probability for matching a distinct Centroid, depending on distance and consistency.

  • Lineage(s): complete taxonomic lineage of the closest matching Centroid.

  • Reference Group: a number of microbiome samples that have been assigned to a certain Reference Group which is representative of a certain condition of wellness or disease. Such Reference Groups allow users to compare if a sample is similar to the Reference Group and thus reflects the condition represented.

  • Control population: user-defined reference (from the literature) giving taxon percentages (e.g. phyla distribution).

Species of interest

The «Species of interest» is a list of taxons preselected by the Customer to represent potentially pathogenic or other interesting taxa in the context of the specimen. These taxa are listed specifically to allow for quick review and easy interpretation. Please contact SmartGene Support to edit or customize your list.

Phyla Distribution

The distribution of phyla across a microbiome is changing with environment, age, or disease (2).

The distribution of phyla, e.g. across a stool sample, allows for the calculation of significant ratios such as Firmicutes to Bacteriodetes, which are widely used to assess intestinal homeostasis. An abnormally low or high ratio is considered as a dysbiosis: An increased Firmicutes to Bacteriodetes ratio is associated with obesity and weight gain since Firmicutes contribute to the absorption of fats. It has been suggested that a diet rich in fiber can help reversion to a more balanced ratio. Whereas a decreased Firmicutes to Bacteriodetes ratio is usually observed in inflammatory bowel disease.

Enterotypes

This feature is specific for bacterial intestinal microbiomes; it provides a classification of the gut microbiome in three profiles, defined by the predominance of the genera Bacteroides, Prevotella, and Ruminococcus (see references 3, 4, 5).

Bacteroides is predominant in the gut microbiota of people eating meat and animal fats.

Prevotella is predominant when eating fruits and vegetables, and associates as well with glucid-rich diet.

Ruminococcus is predominant in people eating vegetal oils and in people drinking alcohol.

Literature references

(1) Le Chatelier et al., Nature. 2013 Aug 29; 500: 541–546. doi:10.1038/nature12506
(2) Rinninella et al., Microorganisms 2019, 7, 14; doi:10.3390/microorganisms7010014
(3) Arumugam et al., Nature. 2011 May 12; 473: 174–180. doi:10.1038/nature09944
(4) Christensen et al., Am J Clin Nutr 2018;108:645–651. doi:/10.1093/ajcn/nqy175
(5) Hills et al., Nutrients 2019, 11, 1613; doi:10.3390/nu11071613
 

SmartGene Quality Report

This section provides definitions of sequencing metrics and bioinformatics terms which are displayed in the different sections of the SmartGene Quality Report. In case of questions not answered here, we invite you to contact SmartGene Support Europe (mailto:This email address is being protected from spambots. You need JavaScript enabled to view it.); SmartGene Support North America (mailto:This email address is being protected from spambots. You need JavaScript enabled to view it.)

The SmartGene Quality Report ("the Report") displays technical information linked to an analyzed file or files. The Report is only available in electronic format and provides an overview of the specifications of the uploaded file and the related analysis. The Report format is standardized across all Apps. It is composed of three sections:

  1. General section, presenting information on the analyzed Sample and the System software (version / pipeline) used to analyze it

  2. Sequencing quality statistics, including a table to report  the percentage of sequence reads which exceed specific Q-score (quality) thresholds

  3. Read processing, including a table to report the filtering criteria (directly dependent on the pipeline used for the analysis, and tables for reads statistics and read lengths.

Sample

  • Sample ID: unique identifier (user-specified and compulsory) for the given sample, this entry is explicitly attributed to a given uploaded entity at the creation of the Worklist.

  • Source ID: identifier (user-specified and compulsory) for the given source or patient from which the sample was taken, this entry is explicitly or implicitly attributed to a given uploaded entity at the creation of the Worklist. If the Sample ID was automatically assigned (via the "Save with default file ID" button), then the Source ID will be identical to the Sample ID but is editable thereafter; if the Sample ID was manually assigned (with the "+" button), then the Source can be entered manually and/or chosen from among existing Source IDs.

  • File: raw input files containing basecalled and unprocessed sequence reads generated on a next-generation sequencer, in an accepted format (SFF, BAM, FASTQ). Files can be compressed or not. The standardized format assures (pre-)processing of the files, regardless of the sequencing protocol.

  • Filename(s): unprocessed name(s) of the uploaded sequencing file(s), as defined by the user before uploading them into the SmartGene System via the UI (User Interface), including the file extension; in case of paired-end file upload, the filenames are separated by a semicolon ";". File size is displayed in MB.

  • MID: molecular identifier for multiplexing. Individual "barcode" (index adapter) sequences are added to each DNA fragment during next-generation sequencing (NGS) library preparation so that each read can be identified to its original sample and sorted before the final data analysis.

  • Upload date: date on which the entity (one file in case of single-end sequencing or two files in case of paired-end sequencing) was uploaded to the SmartGene System, through the Sequence File Management UI.

  • Uploaded by: username of the logged user at the moment of entity upload.

  • Analysis date: date on which the the uploaded entity was submitted to the Worklist (at "Starting analysis").

  • Analysed by: username of the logged user at the moment of submission to the Worklist.

System

  • Pipeline name: name of the pipeline used for the analysis.

  • Analysis pipeline version: the version of the selected pipeline which was deployed in the System at the time of the analysis. The versioning of the System’s bioinformatics software is specific to SmartGene and is updated whenever the software code is changed. The version number (e.g. 2.4.4_HIV1_v1.6) contains in fact two distinct versioning systems: the bioinformatics software version (e.g. 2.4.4) and the pipeline version (HIV1_v1.6). Depending on the pipeline type, the version number has a different meaning: it can reflect the profile-based pipeline (version of the profile) or the reference dataset-based pipeline (version of the reference dataset, e.g. centroids for the microbiome pipelines).

  • Software version: version of the SmartGene ASP interface that the user was using when analyzing the sample. The ASP release number is always displayed in the UI (bottom-left in the page footer) and has the following format: v3_X_YrZ, where: X is the main version number, Y is the minor version number, Z is patch number (if applicable).

Sequencing quality statistics

Table Basecalling quality scores (Phred quality scores Q) [%]

Thresholds (Q≥X where X is a Q-score) are defined in the 1st column, the percentage of bases at or above the given threshold are reported in the following column(s) (1 column in case of single-end files or 2 in case of paired-end files; the column name corresponds to the filename). The last column displays the percentages for the total bases considering all the files which contribute to the entity (corresponding to the previous column in case of single-end files or the overall percentage in case of paired-end files).

  • Bases Q score: also called Phred Score, indicates quality of base-calling in terms of estimated probability for erroneous calling; the integer value is defined: Q = -10 log10(P); where P = 10-Q/10. The average Phred Score will vary according to the sequencing technology, which can have different quality thresholds for their reads.

Read processing

Table Filtering criteria for read inclusion

The column headers correspond to the filtering criteria which determine  inclusion or exclusion of reads for subsequent analysis in the selected SmartGene pipeline. Threshold values are reported in the second row of this table.

  • Minimum read length: value in basepairs, reads whose length is below this value are not included in the analysis.

  • Minimum phred score: value in Q-score, reads whose average Q-score is below this value are not included in the analysis.

  • Sliding window size: the size of the sliding window for quality trimming (value in basepairs) depends on the variability of the gene target sequenced At each nucleotide position along the read in 5' to 3' direction, an average quality score is determined for the nucleotides within the window. When the average quality score within the window drops below the threshold, all bases from the middle of the window to the 3' end of the read are trimmed off.

Table Read statistics

This table reports the read statistics; values are the number of reads.

  • Reads: sequence of base pairs corresponding to a single DNA fragment, raw reads are the direct output of sequencing and represent all the sequences created after the imaging process.

  • Total Reads in file(s): total number of reads in the uploaded file(s) contributing to the uploaded entity. In case of single-end files this value corresponds to the total reads in the file; in case of paired-end files this value corresponds to the sum of reads in both files.

  • Paired-end detection: number of reads in the uploaded files contributing to the uploaded entity split by sequencing end. Reads in single-end files are all systematically reported in the subheader "merged". Reads in paired-end files that have only been sequenced in the 5' or 3' direction are reported in the corresponding subheaders "5' " and "3' ", while those composed of sequences from both 5' and 3' directions which have been merged thanks to their partial complementary sequence are reported in the subheader "merged" (concatenation occurring after reverse complementation of the 3' read and subtraction of its complementary part).

  • Quality filter passed: number of reads in the uploaded files contributing to the uploaded entity that pass the filtering criteria reported in the previous table and split by sequencing end. Reads in single-end files that pass the filtering criteria are all systematically reported in the subheader "merged". Reads in paired-end files passing the filtering criteria and that have only been sequenced in the 5' or 3' direction are reported in the corresponding subheaders "5' " and "3' ", while those composed of sequences from both 5' and 3' directions which have been merged thanks to their partial complementary sequence are reported in the subheader "merged".

  • Retained for analysis: total number of reads passing the filtering criteria.

  • Mapped reads: total number of reads retained after quality filtering which could be aligned against one or several profiles or reference sequences (depending on the chosen analysis pipeline).

Table Read lengths

This table reports statistics for the length of reads uploaded and processed; values in base pairs. The values are reported for first quartile (Q1), Median, third quartile (Q3) and Mean, corresponding to the standard statistics values as follows:

  • the Median is the middle value when a data set is ordered from least to greatest

  • Q1 is the median of the lower half of the data

  • Q3 is the median of the upper half of the data

  • the Mean (average) of a data set is found by adding all numbers in the data set and then dividing by the number of values in the set.

In this table:

  • Paired-end detection: lengths of reads in the uploaded files contributing to the uploaded entity split by sequencing end. Lengths of reads in single-end files are all systematically reported in the subheader "merged". Lengths of reads in paired-end files that have only been sequenced in the 5' or 3' direction are reported in the corresponding subheaders "5' " and "3' ", while those composed of sequences from both 5' and 3' directions which have been merged thanks to their partial complementary sequence are reported in the subheader "merged" (concatenation occurring after reverse complementation of the 3' read and subtraction of its complementary part).

  • Quality filter passed: lengths of reads in the uploaded files contributing to the uploaded entity that pass the filtering criteria reported in the previous table and split by sequencing end. Lengths of reads in single-end files that pass the filtering criteria are all systematically reported in the subheader "merged". Lengths of reads in paired-end files passing the filtering criteria and that have only been sequenced in the 5' or 3' direction are reported in the corresponding subheaders "5' " and "3' ", while those composed of sequences from both 5' and 3' directions which have been merged thanks to their partial complementary sequence are reported in the subheader "merged".

  • Retained for analysis: lengths of reads passing the filtering criteria.

  • Mapped reads: lengths of reads retained after quality filtering which could be aligned against one or several profiles or reference sequences (depending on the chosen analysis pipeline).