Analyzing L. reuteri dataΒΆ

We have our data! Hooray! Now, we can go through and analyze all of it, from Trimmomatic to HTSeq. We are only going to take it through HTSeq and then do the differential gene expression interactively on our own machines. This is simply because it’s nice to work with everything on our own machines because we will be generating plots and may have to iteratively do the differential gene expression analysis based on the results/plots.

To this point, we have worked with a small subset of example RNA-seq data (100,000 reads per sample). Now, we will be working with full datasets (~15 million reads per sample). Thus, we are going to submit the jobs to the MSU HPCC to run for the next 20 hours or so.

Let’s have a look at the raw data that we received from the MSU RTSF. We sequenced 24 samples but got 48 files back! Why did we get twice the data back? This is due to the limitations of the sequencer itself. The particular flowcell for our data has 2 lanes. Each lane generates around 150 million reads total and we want around 15 million reads per sample. Thus, 24 samples * 15 million reads = 360 million total reads. This is more than twice what a single lane can produce for us. Thus, we pooled all of our samples together and ran them on both lanes to achieve the desired read number. So in order to work with this data, we need to combine the reads from both lanes together into a single file. However, to do this effectively, we need to decompress the files and then merge them.

If you are interested in what I did to get to this point, the commands are as follows in pseudocode.

Decompress files: gunzip -c file_I_want_to_decompress.fastq.gz > decompressed_file.fastq

Merge files: cat sample1_read_data_from_lane2.fastq >> sample1_read_data_from_lane1.fastq

Recompress files: gzip -c file_I_want_to_compress.fastq > compressed_file.fastq.gz

I have already done all of this for you and grouped the merged and recompressed data into folders corresponding to different treatment comparisons:

LB vs Indole treated L. reuteri

LB vs Commensal E. coli conditioned medium treated L. reuteri

LB vs EHEC conditioned medium treated L. reuteri

These folders are located in /mnt/research/mmg434/LreuteriRNAseqData/RawData

Dr. Viswanathan has put together a list of who is doing which comparisons.

LAST_NAME FIRST_NAME Lbcontrol/ Indole/ E.coli conditioned EHEC conditioned
    Sample # Sample # medium/Sample # medium/ Sample #
Asopa Mrinal A1 B1    
Babiker Leena A2 B2    
Biernat Emily Rachel A3 B3    
Cermak Megan Molloy A4 B4    
Collins Sasha Lauren A1 B5    
Freiberger Katharina Marianne A2 B6    
           
Ginzburg Daniel A3   C1  
Griffin Caleigh Ellen A4   C2  
Jones Ashley Sade A1   C3  
Kaminski Theresa Marie A2   C4  
Li Irene A3   C5  
Macko Haley Marie A4   C6  
           
Rahn Christopher Louis Von        
Reske Jake Jordan A1     D1
Stawkey Danielle Marie A2     D2
Struble Nicole Danielle A3     D3
Tencer Joel Ira A4     D4
Vanalst Andrew John A1     D5
Wolfson Abigail Esther A2     D6

Now, since time is of the essence, I have written a script to do the analysis for us! Let’s have a look at it:

#PBS -l walltime=20:00:00,nodes=1:ppn=4,mem=24gb

|----------------------------------------------------------------------------|
|This script was written *specifically* for the MMG434 course to analyze the |
|L. reuteri RNA-seq data generated by the MSU RTSF. This series of processes |
|will generate a folder called 'RNAseq_(currentdate)' in your home (~) direc-|
|tory. Within that folder a series of sub-folders are made to organize the   |
|data and analysis results. The general workflow is Trimmomatic -> FastQC -> |
|Bowtie -> HTSeq. After this has been run, you can download the map.sam files|
|and analyze them in your favorite differential gene expression software pac-|
|age (e.g. edgeR). To use this script, you will need to change the data loca-|
|tion and then save it (e.g. use nano or another text editor to make the cha-|
|nges). To run this script, follow the tutorial at mmg434.readthedocs.org.   |
|----------------------------------------------------------------------------|

#change to home directory on HPCC
cd ~

#check if the folder RNAseq_(current date) exists and if not, make the folder
NEWFOLDNAME=RNAseq
DATE=`date +%d-%m-%y`
NEWFOLD=${NEWFOLDNAME}_${DATE}

#if the folder already exists, make a new one with a _number added
n=1	#start with the number as 1

while [ -d ~/$NEWFOLD ]	#while the folder exists, do the commands below
do

	NEWFOLD=${NEWFOLDNAME}_${DATE}_${n}	#add the number to the folder name
	n=$((n+1))	#increment n by 1

done

mkdir $NEWFOLD
cd ~/$NEWFOLD

#make several folders within this RNAseq_(current date) folder to organize data
mkdir TrimmedData FastQCreports Bowtie HTSeq edgeR


#load the appropriate modules for all of the analysis steps
module load Trimmomatic/0.32
module load FastQC/0.11.2
module load bowtie/1.0.0
module load HTSeq/0.6.1

#Copy the data into the TrimmedData folder
#DEPENDING ON YOUR COMPARISON, YOU NEED TO CHANGE THE LBvsCOMM FOLDER LOCATION TO EITHER
#LBvsEHEC OR LBvsIndole
#IF YOU DO NOT CHANGE IT, YOU WILL RUN THE ANALYSIS FOR THE LBvsComm COMPARISON

cp /mnt/research/mmg434/LreuteriRNAseqData/RawData/LBvsComm/*.fastq.gz ~/$NEWFOLD/TrimmedData

#Do the trimming with Trimmomatic

cd ~/$NEWFOLD/TrimmedData
for untrimmedreads in ~/$NEWFOLD/TrimmedData/*
do
	FILENAME=`basename ${untrimmedreads%.*.*}`
	PREFIX=trimmed
	NEWTRIMFILE=${PREFIX}${FILENAME}
	java -jar $TRIM/trimmomatic SE -threads 4 $untrimmedreads $NEWTRIMFILE.fastq.gz ILLUMINACLIP:$TRIM/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
done

#Generate the FastQC reports

cp ~/$NEWFOLD/TrimmedData/trimmed*.fastq.gz ~/$NEWFOLD/FastQCreports
cd ~/$NEWFOLD/FastQCreports
for trimmedreads in ~/$NEWFOLD/FastQCreports/trimmed*.fastq.gz
do
	fastqc $trimmedreads
done
rm trimmed*.fastq.gz

#Decompress the reads files and move them to the Bowtie directory

cd ~/$NEWFOLD/TrimmedData
cp trimmed*.fastq.gz ~/$NEWFOLD/Bowtie

for compfiles in ~/$NEWFOLD/Bowtie/*.fastq.gz
do
	FILENAME=${compfiles%.*.*}
	gunzip -c $FILENAME.fastq.gz > $FILENAME.fastq
done

cd ~/$NEWFOLD/Bowtie
rm *.fastq.gz	#clean the .gz files

#Align the reads with Bowtie to the reference genome

cd ~/$NEWFOLD/Bowtie

bowtie-build /mnt/research/mmg434/LreuteriRNAseqData/RefGenomeFiles/trimmedlreuterijcm1112.fa trimmedlreuterijcm1112

for renametrim in ~/$NEWFOLD/Bowtie/*.fastq
do
	LESSPREFIX=`basename ${renametrim//trimmed/}`
	mv $renametrim ~/$NEWFOLD/Bowtie/$LESSPREFIX
done

for alignment in ~/$NEWFOLD/Bowtie/*.fastq
do	
	FILENAME=`basename ${alignment%.*}`
	PREFIX=align
	NEWALIGNFILE=${PREFIX}${FILENAME}
	bowtie -S -p 3 trimmedlreuterijcm1112 $FILENAME.fastq > $NEWALIGNFILE.sam
done

rm ~/$NEWFOLD/Bowtie/*.fastq #clean the .fastq files

#Count gene features with HTSeq

cp ~/$NEWFOLD/Bowtie/align*.sam ~/$NEWFOLD/HTSeq
cp /mnt/research/mmg434/LreuteriRNAseqData/RefGenomeFiles/alignlreuterijcm1112.gtf ~/$NEWFOLD/HTSeq
cd ~/$NEWFOLD/HTSeq

for renamealign in ~/$NEWFOLD/HTSeq/*.sam
do
	LESSPREFIX=`basename ${renamealign//align/}`
	mv $renamealign ~/$NEWFOLD/HTSeq/$LESSPREFIX
done

for counts in ~/$NEWFOLD/HTSeq/*.sam
do
	FILENAME=`basename ${counts%.*}`
	PREFIX=map
	NEWMAPFILE=${PREFIX}${FILENAME}
	htseq-count -m intersection-nonempty --stranded=reverse $FILENAME.sam alignlreuterijcm1112.gtf > $NEWMAPFILE.sam
done

rm ~/$NEWFOLD/HTSeq/LR*.sam	#clean the alignment files

This is a lot to take in! All this script is doing is linking all of the steps together and analyzing the data in one fell swoop for us. But! We NEED to change one thing in the script, depending on the comparison we are making.

Let’s grab the script file:

cp /mnt/research/mmg434/LreuteriRNAseqData/LreuteriAnalysisScript.sh ~

Now, let’s open it in the nano text editor:

nano LreuteriAnalysisScript.sh

It will look something like the script above. You can’t use your mouse here and must use your arrow keys to get around. Let’s read through the beginning of the script and see what we need to change.

Now, we will move down through the script and change the name of the folder containing the appropriate comparison (unless you happen to be in the LB vs Commensal E. coli conditioned medium treated group!).

To save the changes, hit Ctrl + o, followed by the Enter/Return key.

To exit nano, hit Ctrl + x

To run the script:

module load powertools
intel14-phi
module load powertools
cd ~
qsub LreuteriAnalysisScript.sh

Hooray! The script has been submitted to the queue to run. Write down your job ID.