We assessed several RNA-seq data processing pipelines and selected the most accurate approach based on the performance of predicting fetal sex by measuring the extent of chromosome Y (chrY) gene expression level. This task, however, is rather complicated because 1) there are multiple copies of chrY gene families, such as TSPY which contains ~30 copies, and 2) most chrY genes have close homologues in chromosome X – some chrY genes show ~99% identity with their corresponding X-linked homologues.
The study from Godfrey et al. (1) showed that kallisto (2) performed better than the two most used conventional methods (TopHat2 (3)+ featureCounts (4) and TopHat2 (3) + Cufflinks(5)) in terms of quantifying the expression level of male specific chrY genes precisely. Recently, Srivastava et al. (6) showed improved performance of their new methodology, so called selective alignment (SA) implemented in the salmon software package. This outperformed various existing methods, including kallisto, in both simulated and real RNA-seq datasets. Therefore, we tested the performance of salmon (v1.5.2) (7) with the following two approaches: 1) the mapping-based mode, and 2) the alignment-based mode. For the mapping-based mode, we applied so-called selective alignment option and employed “decoy” sequences – this approach was named as “Salmon (SA mode)”. For the alignment-based mode, we used HISAT2 (v2.2.1) (8), a successor of TopHat2 (3), followed by salmon in the alignment mode – this approach was named as “HiSat2+Salmon”. Firstly, we selected 100 RNA-seq samples randomly and down-sampled the original sequencing reads to 1, 5 and 10 million per sample (Supplementary Text Table 1) using seqtk (v1.2-r101-dirty) (9). We ran the two pipelines for each set of down-sampled reads, then measured the read counts for protein-coding chrY genes – there were 42 such chrY genes based on the Ensembl annotation v88 (10). If there was at least one read quantified in any of the 42 chrY genes, such a sample was simply predicted as male; otherwise, female. Then we tabulated confusion matrices by the numbers of true-positives (i.e. predicted as male and the true fetal sex being male; TP), true-negatives (i.e. predicted as female as the true fetal sex being female; TN), false-positives (i.e. predicted as male but the true fetal sex being female; FP), and false-negative (i.e. predicted as female but the true fetal sex being male; FN). We calculated the precision (also known as, positive predictive value), recall (also known as, true positive rate or sensitivity), and the F1 score as the followings:
Finally, we chose the pipeline based on the higher value of F1 score which is in favour of a smaller number of FP and FN.
For the 42 chrY genes, we found that a smaller number of reads was quantified from the female samples when “Salmon (SA mode)” was used compared to “HiSat2+Salmon” – this pattern was more noticeable from the samples of advanced gestation (Section 7.1). Therefore, the number of FP and FN were smaller for “Salmon (SA mode)” (Supplementary Text Table 2), and the F1 score was higher for “Salmon (SA mode)” than “HiSat2+Salmon” (Section 7.2 and Supplementary Text Table 2). Based on the 10M down-sampled RNA-seq samples, there were 24 and 16 chrY genes quantified as having at least one read from the 100 randomly selected samples, by “HiSat2+Salmon” and “Salmon (SA mode)”, respectively (Section 7.3).
4.1 Randomly select 100 samples and downsample reads
4.2 Import downsampled Salmon result
The code below internally calls the prep_salmon function defined in the _libs folder.
Code to parse downsampled Salmon results
dl.chrY.TPM<-list()all.ds<-c("1M","5M","10M")dl.chrY.TPM<-lapply(all.ds, function(my.ds){message(my.ds) my.RData<-paste0("RData/li.TPM.",my.ds,".RData")if(!file.exists(my.RData)){ # read the local output of Salmon my.slx<-ifelse(my.ds=="1M",paste0("SLX-ILM-Plasma2021-ds",my.ds,".Homo_sapiens.v2"),paste0("SLX-ILM-Plasma2021-ds",my.ds,".Homo_sapiens.v1")) li.TPM<-prep_ds_salmon(my.slx) # set li.TPM. See `_libs/local.R` how `li.TPM` was set upsave(li.TPM, file=paste0("RData/li.TPM.",my.ds,".RData"))message("li.TPM saved") }load(my.RData) %>% system.time dt.chrY.TPM<-rbind(data.table(`Type`="Salmon (SA mode)",`SampleID`=li.TPM[["Salmon"]][["counts"]][this.gene,] %>% colSums %>% names,`Counts`=li.TPM[["Salmon"]][["counts"]][this.gene,] %>% colSums, `TPM`=li.TPM[["Salmon"]][["TPM"]][this.gene,] %>% colSums ),data.table(`Type`="HiSat2+Salmon",`SampleID`=li.TPM[["Salmon_aln"]][["counts"]][this.gene,] %>% colSums %>% names,`Counts`=li.TPM[["Salmon_aln"]][["counts"]][this.gene,] %>% colSums,`TPM`=li.TPM[["Salmon_aln"]][["TPM"]][this.gene,] %>% colSums ) )# add Sex and GAmerge(dt.chrY.TPM, dt.samples[,.(SampleID,GA,Sex,pn_female)])})names(dl.chrY.TPM)<-all.dsdl.stat<-lapply(dl.chrY.TPM, function(i) i[,.(.N,SumCount=sum(Counts),SumTPM=sum(TPM),MeanCount=mean(Counts),MeanTPM=mean(TPM)),.(GA,Sex,Type)][order(GA,Sex)])
4.3 Number of samples by GA and sex among the 100 randomly selected samples
# Selection of RNA-seq quantification method {#sec-si-quant-method}We assessed several RNA-seq data processing pipelines and selected the most accurate approach based on the performance of predicting fetal sex by measuring the extent of chromosome Y (chrY) gene expression level. This task, however, is rather complicated because 1) there are multiple copies of chrY gene families, such as _TSPY_ which contains ~30 copies, and 2) most chrY genes have close homologues in chromosome X – some chrY genes show ~99% identity with their corresponding X-linked homologues.The study from Godfrey et al. (1) showed that kallisto (2) performed better than the two most used conventional methods (TopHat2 (3)+ featureCounts (4) and TopHat2 (3) + Cufflinks(5)) in terms of quantifying the expression level of male specific chrY genes precisely. Recently, Srivastava et al. (6) showed improved performance of their new methodology, so called selective alignment (SA) implemented in the salmon software package. This outperformed various existing methods, including kallisto, in both simulated and real RNA-seq datasets. Therefore, we tested the performance of salmon (v1.5.2) (7) with the following two approaches: 1) the mapping-based mode, and 2) the alignment-based mode. For the mapping-based mode, we applied so-called selective alignment option and employed “decoy” sequences – this approach was named as “Salmon (SA mode)”. For the alignment-based mode, we used HISAT2 (v2.2.1) (8), a successor of TopHat2 (3), followed by salmon in the alignment mode – this approach was named as “HiSat2+Salmon”. Firstly, we selected 100 RNA-seq samples randomly and down-sampled the original sequencing reads to 1, 5 and 10 million per sample (Supplementary Text Table 1) using [seqtk](https://github.com/lh3/seqtk) (v1.2-r101-dirty) (9). We ran the two pipelines for each set of down-sampled reads, then measured the read counts for protein-coding chrY genes – there were 42 such chrY genes based on the Ensembl annotation v88 (10). If there was at least one read quantified in any of the 42 chrY genes, such a sample was simply predicted as male; otherwise, female. Then we tabulated confusion matrices by the numbers of true-positives (i.e. predicted as male and the true fetal sex being male; TP), true-negatives (i.e. predicted as female as the true fetal sex being female; TN), false-positives (i.e. predicted as male but the true fetal sex being female; FP), and false-negative (i.e. predicted as female but the true fetal sex being male; FN). We calculated the precision (also known as, positive predictive value), recall (also known as, true positive rate or sensitivity), and the F1 score as the followings:$$Precision = TP / (TP +FP)$$$$Recall=TP⁄((TP+FN))$$$$F_1 score=2 (Precision×Recall)/(Precision+Recall)$$Finally, we chose the pipeline based on the higher value of F1 score which is in favour of a smaller number of FP and FN.For the 42 chrY genes, we found that a smaller number of reads was quantified from the female samples when “Salmon (SA mode)” was used compared to “HiSat2+Salmon” – this pattern was more noticeable from the samples of advanced gestation (@sec-si-figure-a). Therefore, the number of FP and FN were smaller for “Salmon (SA mode)” (Supplementary Text Table 2), and the F1 score was higher for “Salmon (SA mode)” than “HiSat2+Salmon” (@sec-si-figure-b and Supplementary Text Table 2). Based on the 10M down-sampled RNA-seq samples, there were 24 and 16 chrY genes quantified as having at least one read from the 100 randomly selected samples, by “HiSat2+Salmon” and “Salmon (SA mode)”, respectively (@sec-si-figure-c).## Randomly select 100 samples and downsample reads {#sec-si-rand}```{#lst-si-downsample .bash lst-cap="downsampling command using `bash` and [`seqtk`](https://github.com/lh3/seqtk)"}{{< include static/shell/downsample.sh >}}```## Import downsampled `Salmon` result {#sec-si-quant-salmon}The code below internally calls the `prep_salmon` function defined in the `_libs` folder.```{r load-ds-salmon}#| label: load-ds-salmon#| eval: false#| code-summary: "Code to parse downsampled `Salmon` results"dl.chrY.TPM<-list()all.ds<-c("1M","5M","10M")dl.chrY.TPM<-lapply(all.ds, function(my.ds){ message(my.ds) my.RData<-paste0("RData/li.TPM.",my.ds,".RData") if(!file.exists(my.RData)){ # read the local output of Salmon my.slx<-ifelse(my.ds=="1M",paste0("SLX-ILM-Plasma2021-ds",my.ds,".Homo_sapiens.v2"),paste0("SLX-ILM-Plasma2021-ds",my.ds,".Homo_sapiens.v1")) li.TPM<-prep_ds_salmon(my.slx) # set li.TPM. See `_libs/local.R` how `li.TPM` was set up save(li.TPM, file=paste0("RData/li.TPM.",my.ds,".RData")) message("li.TPM saved") } load(my.RData) %>% system.time dt.chrY.TPM<-rbind( data.table( `Type`="Salmon (SA mode)", `SampleID`=li.TPM[["Salmon"]][["counts"]][this.gene,] %>% colSums %>% names, `Counts`=li.TPM[["Salmon"]][["counts"]][this.gene,] %>% colSums, `TPM`=li.TPM[["Salmon"]][["TPM"]][this.gene,] %>% colSums ), data.table( `Type`="HiSat2+Salmon", `SampleID`=li.TPM[["Salmon_aln"]][["counts"]][this.gene,] %>% colSums %>% names, `Counts`=li.TPM[["Salmon_aln"]][["counts"]][this.gene,] %>% colSums, `TPM`=li.TPM[["Salmon_aln"]][["TPM"]][this.gene,] %>% colSums ) ) # add Sex and GA merge(dt.chrY.TPM, dt.samples[,.(SampleID,GA,Sex,pn_female)])})names(dl.chrY.TPM)<-all.dsdl.stat<-lapply(dl.chrY.TPM, function(i) i[,.(.N,SumCount=sum(Counts),SumTPM=sum(TPM),MeanCount=mean(Counts),MeanTPM=mean(TPM)),.(GA,Sex,Type)][order(GA,Sex)])```## Number of samples by GA and sex among the 100 randomly selected samples```{r ds100}#! label: ds100library(magrittr)dt.samples<-fread("static/R/data/dt.samples.csv") # n=755rand.100<-fread("static/R/data/random.sample.100.txt", header=F) # n=100xtabs(~GA+Sex, data=dt.samples[SampleID %in% rand.100$V1]) %>% addmargins```