As various pathogens have been reported in the blood, cerebrospinal fluid, and central nervous system of patients with ALS, a few scientific publications have suggested that infectious agents may play a role in neurodegenerative diseases, but these agents have never been identified.
In a new scientific publication, the authors report they have found an unidentified virus like signature in 120 whole blood RNA samples, in ALS patient as well in controls samples. However when they used other public databases they where unable to find this viral signature, so little have been learned.
The search for pathogens using sequencing data from blood samples in patients with ALS has already been done, but as sequencing techniques can only read tiny fragments of DNA at the same time (the "reads"), they must recourse to reference genomes to succeed in a reconstitution, and this reconstitution does not allow the discovery of genomes are not included in the reference bases.
To complicate matters further, the readings belong to the patient and not to an hypothetical microbe or virus.
So when Melnick, Prudencio and his colleagues at Boulder and Jacksonville set out to design a new sequencing pipeline they made sure it does not ignore "reads" that cannot be aligned with a known genome.
The authors developed a bioinformatics pipeline that identifies microbial sequences in mammalian RNA-seq data, including sequences without significant nucleotide similarity results in GenBank.
They opted for a de novo assembly of unmapped reads into contigs, followed by aligning unmapped reads to these contigs for quantification. The code used in this manuscript is available at https://github.com/Senorelegans/MysteryMiner
A total of 120 whole blood RNA samples were initially used. It included 30 healthy controls (from the general population who do not have blood relatives with ALS), 30 pre-symptomatic C9ORF72 mutant carriers, 30 symptomatic cases of C9ORF72 ALS, and 30 cases symptomatic C9ORF72 negative ALS.
The efficiency of this pipeline has been tested by the authors on public RNA-seq data. The scientists then applied this pipeline to a new RNA-seq dataset generated from a cohort of 120 samples from patients and controls with amyotrophic lateral sclerosis (ALS), and identified sequences corresponding to bacteria and known viruses, as well as new virus-like sequences.
The complete dataset contains 8.64 X 109,406 combined reads. About 2.7% (2.34 X 10 ^ 8) of the reads did not match the human genome. From these non-host reads, 2,976,988 contigs were assembled and 17,047 BLASTN (regular biome) contigs were identified. A total of 25,815 contigs did not match by BLASTN and after filtering they identified 2,980 dark biome contigs (identified by BLASTX) and 859 double dark biome contigs (no BLASTX or BLASTN hit).
In the dark biome contigs, Melnick and his colleagues noted many contigs with a region of protein sequence similar to the RNA-dependent RNA polymerase (RdRP) of several RNA viruses, showing the greatest similarity to the virus. velvet tobacco marbling. This was present in the control as well.
RdRP is an essential protein encoded in the genomes of all viruses containing RNA without a DNA stage, that is to say RNA viruses including SARS-CoV-2.
To validate that this virus-like sequence was not a contig assembly artifact or a contaminant introduced during library construction or sequencing, the authors used RT-PCR of the original patient samples to demonstrate that this sequence was present in positive samples identified by RNA-seq analysis and not detectable in negative samples.
The scientists then investigated whether similar results would be obtained from other ALS data sets. To this end, they examined five other publicly available ALS datasets.
However, they found no statistically significant difference between samples from patients with ALS and control samples for virus / bacteria genus / species in normal / dark biome for any of the remaining ALS datasets.