4  Selection of RNA-seq quantification method

We assessed several RNA-seq data processing pipelines and selected the most accurate approach based on the performance of predicting fetal sex by measuring the extent of chromosome Y (chrY) gene expression level. This task, however, is rather complicated because 1) there are multiple copies of chrY gene families, such as TSPY which contains ~30 copies, and 2) most chrY genes have close homologues in chromosome X – some chrY genes show ~99% identity with their corresponding X-linked homologues.

The study from Godfrey et al. (1) showed that kallisto (2) performed better than the two most used conventional methods (TopHat2 (3)+ featureCounts (4) and TopHat2 (3) + Cufflinks(5)) in terms of quantifying the expression level of male specific chrY genes precisely. Recently, Srivastava et al. (6) showed improved performance of their new methodology, so called selective alignment (SA) implemented in the salmon software package. This outperformed various existing methods, including kallisto, in both simulated and real RNA-seq datasets. Therefore, we tested the performance of salmon (v1.5.2) (7) with the following two approaches: 1) the mapping-based mode, and 2) the alignment-based mode. For the mapping-based mode, we applied so-called selective alignment option and employed “decoy” sequences – this approach was named as “Salmon (SA mode)”. For the alignment-based mode, we used HISAT2 (v2.2.1) (8), a successor of TopHat2 (3), followed by salmon in the alignment mode – this approach was named as “HiSat2+Salmon”. Firstly, we selected 100 RNA-seq samples randomly and down-sampled the original sequencing reads to 1, 5 and 10 million per sample (Supplementary Text Table 1) using seqtk (v1.2-r101-dirty) (9). We ran the two pipelines for each set of down-sampled reads, then measured the read counts for protein-coding chrY genes – there were 42 such chrY genes based on the Ensembl annotation v88 (10). If there was at least one read quantified in any of the 42 chrY genes, such a sample was simply predicted as male; otherwise, female. Then we tabulated confusion matrices by the numbers of true-positives (i.e. predicted as male and the true fetal sex being male; TP), true-negatives (i.e. predicted as female as the true fetal sex being female; TN), false-positives (i.e. predicted as male but the true fetal sex being female; FP), and false-negative (i.e. predicted as female but the true fetal sex being male; FN). We calculated the precision (also known as, positive predictive value), recall (also known as, true positive rate or sensitivity), and the F1 score as the followings:

\[Precision = TP / (TP +FP)\]

\[Recall=TP⁄((TP+FN))\]

\[F_1 score=2 (Precision×Recall)/(Precision+Recall)\]

Finally, we chose the pipeline based on the higher value of F1 score which is in favour of a smaller number of FP and FN.

For the 42 chrY genes, we found that a smaller number of reads was quantified from the female samples when “Salmon (SA mode)” was used compared to “HiSat2+Salmon” – this pattern was more noticeable from the samples of advanced gestation (Section 7.1). Therefore, the number of FP and FN were smaller for “Salmon (SA mode)” (Supplementary Text Table 2), and the F1 score was higher for “Salmon (SA mode)” than “HiSat2+Salmon” (Section 7.2 and Supplementary Text Table 2). Based on the 10M down-sampled RNA-seq samples, there were 24 and 16 chrY genes quantified as having at least one read from the 100 randomly selected samples, by “HiSat2+Salmon” and “Salmon (SA mode)”, respectively (Section 7.3).

4.1 Randomly select 100 samples and downsample reads

Listing 4.1: downsampling command using bash and seqtk
###########################
# Random sampling for 100 #
###########################
for i in `ls ~/data/fastq/SLX-ILM-Plasma2021/*r_1.fq.gz`; do BARCODE=`echo $i | cut -d/ -f7 | cut -d. -f2`; echo $BARCODE;done | grep -v PPC | sort | uniq -c | awk '$1==4{print $0}' | sort -R | head -n 100 | awk '{print $2}' > ~/results/SLX-ILM-Plasma2021.Homo_sapiens.v1/random.sample.100.txt
join -t "," <(sort ~/results/SLX-ILM-Plasma2021.Homo_sapiens.v1/random.sample.100.txt) <(sort -t "," -k1,1 ~/results/SLX-ILM-Plasma2021.Homo_sapiens.v1/FASTQ.list.r1.txt) > ~/results/SLX-ILM-Plasma2021.Homo_sapiens.v1/random.sample.100.FASTQ.list.r1.txt

######################################
# Downsampling to 1M for 100 samples #
######################################
 for i in `awk -F"," '{print $2}' ~/results/SLX-ILM-Plasma2021.Homo_sapiens.v1/random.sample.100.FASTQ.list.r1.txt`; do FNAME=SLX-ILM-Plasma2021-ds1M.`echo $i | cut -d/ -f7 | cut -d. -f2,3,4,5,6,7`; echo $i; time seqtk sample -s11 $i 250000 | gzip -9 >  ~/data/fastq/SLX-ILM-Plasma2021-ds1M/$FNAME; done
 for i in `awk -F"," '{print $2}' ~/results/SLX-ILM-Plasma2021.Homo_sapiens.v1/random.sample.100.FASTQ.list.r2.txt`; do FNAME=SLX-ILM-Plasma2021-ds1M.`echo $i | cut -d/ -f7 | cut -d. -f2,3,4,5,6,7`; echo $i; time seqtk sample -s11 $i 250000 | gzip -9 >  ~/data/fastq/SLX-ILM-Plasma2021-ds1M/$FNAME; done

######################################
# Downsampling to 5M for 100 samples #
######################################
 for i in `awk -F"," '{print $2}' ~/results/SLX-ILM-Plasma2021.Homo_sapiens.v1/random.sample.100.FASTQ.list.r1.txt`; do FNAME=SLX-ILM-Plasma2021-ds5M.`echo $i | cut -d/ -f7 | cut -d. -f2,3,4,5,6,7`; echo $i; time seqtk sample -s11 $i 1250000 | gzip -9 >  ~/data/fastq/SLX-ILM-Plasma2021-ds5M/$FNAME; done
 for i in `awk -F"," '{print $2}' ~/results/SLX-ILM-Plasma2021.Homo_sapiens.v1/random.sample.100.FASTQ.list.r2.txt`; do FNAME=SLX-ILM-Plasma2021-ds5M.`echo $i | cut -d/ -f7 | cut -d. -f2,3,4,5,6,7`; echo $i; time seqtk sample -s11 $i 1250000 | gzip -9 >  ~/data/fastq/SLX-ILM-Plasma2021-ds5M/$FNAME; done

#######################################
# Downsampling to 10M for 100 samples #
###################]###################
 for i in `awk -F"," '{print $2}' ~/results/SLX-ILM-Plasma2021.Homo_sapiens.v1/random.sample.100.FASTQ.list.r1.txt`; do FNAME=SLX-ILM-Plasma2021-ds10M.`echo $i | cut -d/ -f7 | cut -d. -f2,3,4,5,6,7`; echo $i; time seqtk sample -s11 $i 2500000 | gzip -9 >  ~/data/fastq/SLX-ILM-Plasma2021-ds10M/$FNAME; done
 for i in `awk -F"," '{print $2}' ~/results/SLX-ILM-Plasma2021.Homo_sapiens.v1/random.sample.100.FASTQ.list.r2.txt`; do FNAME=SLX-ILM-Plasma2021-ds10M.`echo $i | cut -d/ -f7 | cut -d. -f2,3,4,5,6,7`; echo $i; time seqtk sample -s11 $i 2500000 | gzip -9 >  ~/data/fastq/SLX-ILM-Plasma2021-ds10M/$FNAME; done

4.2 Import downsampled Salmon result

The code below internally calls the prep_salmon function defined in the _libs folder.

Code to parse downsampled Salmon results
dl.chrY.TPM<-list()
all.ds<-c("1M","5M","10M")
dl.chrY.TPM<-lapply(all.ds, function(my.ds){
  message(my.ds)
  my.RData<-paste0("RData/li.TPM.",my.ds,".RData")
  if(!file.exists(my.RData)){ # read the local output of Salmon 
    my.slx<-ifelse(my.ds=="1M",paste0("SLX-ILM-Plasma2021-ds",my.ds,".Homo_sapiens.v2"),paste0("SLX-ILM-Plasma2021-ds",my.ds,".Homo_sapiens.v1"))
    li.TPM<-prep_ds_salmon(my.slx) # set li.TPM. See `_libs/local.R` how `li.TPM` was set up
    save(li.TPM, file=paste0("RData/li.TPM.",my.ds,".RData"))
    message("li.TPM saved")
  }
  load(my.RData) %>% system.time

  dt.chrY.TPM<-rbind(
    data.table(
      `Type`="Salmon (SA mode)",
      `SampleID`=li.TPM[["Salmon"]][["counts"]][this.gene,] %>% colSums %>% names,
      `Counts`=li.TPM[["Salmon"]][["counts"]][this.gene,] %>% colSums, 
      `TPM`=li.TPM[["Salmon"]][["TPM"]][this.gene,] %>% colSums 
      ),
    data.table(
      `Type`="HiSat2+Salmon",
      `SampleID`=li.TPM[["Salmon_aln"]][["counts"]][this.gene,] %>% colSums %>% names,
      `Counts`=li.TPM[["Salmon_aln"]][["counts"]][this.gene,] %>% colSums,
      `TPM`=li.TPM[["Salmon_aln"]][["TPM"]][this.gene,] %>% colSums 
      )
  )
  # add Sex and GA
  merge(dt.chrY.TPM, dt.samples[,.(SampleID,GA,Sex,pn_female)])
})
names(dl.chrY.TPM)<-all.ds

dl.stat<-lapply(dl.chrY.TPM, function(i) i[,.(.N,SumCount=sum(Counts),SumTPM=sum(TPM),MeanCount=mean(Counts),MeanTPM=mean(TPM)),.(GA,Sex,Type)][order(GA,Sex)])

4.3 Number of samples by GA and sex among the 100 randomly selected samples

Code
#! label: ds100
library(magrittr)
dt.samples<-fread("static/R/data/dt.samples.csv") # n=755
rand.100<-fread("static/R/data/random.sample.100.txt", header=F) # n=100
xtabs(~GA+Sex, data=dt.samples[SampleID %in% rand.100$V1]) %>% addmargins
      Sex
GA     Female Male Sum
  12wk     17    5  22
  20wk     18    7  25
  28wk     19   15  34
  36wk     11    8  19
  Sum      65   35 100