Analyzing L. reuteri data¶

We have our data! Hooray! Now, we can go through and analyze all of it, from Trimmomatic to HTSeq. We are only going to take it through HTSeq and then do the differential gene expression interactively on our own machines. This is simply because it’s nice to work with everything on our own machines because we will be generating plots and may have to iteratively do the differential gene expression analysis based on the results/plots.

To this point, we have worked with a small subset of example RNA-seq data (100,000 reads per sample). Now, we will be working with full datasets (~15 million reads per sample). Thus, we are going to submit the jobs to the MSU HPCC to run for the next 20 hours or so.

Let’s have a look at the raw data that we received from the MSU RTSF. We sequenced 24 samples but got 48 files back! Why did we get twice the data back? This is due to the limitations of the sequencer itself. The particular flowcell for our data has 2 lanes. Each lane generates around 150 million reads total and we want around 15 million reads per sample. Thus, 24 samples * 15 million reads = 360 million total reads. This is more than twice what a single lane can produce for us. Thus, we pooled all of our samples together and ran them on both lanes to achieve the desired read number. So in order to work with this data, we need to combine the reads from both lanes together into a single file. However, to do this effectively, we need to decompress the files and then merge them.

If you are interested in what I did to get to this point, the commands are as follows in pseudocode.

Decompress files: gunzip -c file_I_want_to_decompress.fastq.gz > decompressed_file.fastq

Merge files: cat sample1_read_data_from_lane2.fastq >> sample1_read_data_from_lane1.fastq

Recompress files: gzip -c file_I_want_to_compress.fastq > compressed_file.fastq.gz

I have already done all of this for you and grouped the merged and recompressed data into folders corresponding to different treatment comparisons:

LB vs Indole treated L. reuteri

LB vs Commensal E. coli conditioned medium treated L. reuteri

LB vs EHEC conditioned medium treated L. reuteri

These folders are located in /mnt/research/mmg434/LreuteriRNAseqData/RawData

Dr. Viswanathan has put together a list of who is doing which comparisons.

LAST_NAME	FIRST_NAME	Lbcontrol/	Indole/	E.coli conditioned	EHEC conditioned
		Sample #	Sample #	medium/Sample #	medium/ Sample #
Asopa	Mrinal	A1	B1
Babiker	Leena	A2	B2
Biernat	Emily Rachel	A3	B3
Cermak	Megan Molloy	A4	B4
Collins	Sasha Lauren	A1	B5
Freiberger	Katharina Marianne	A2	B6

Ginzburg	Daniel	A3		C1
Griffin	Caleigh Ellen	A4		C2
Jones	Ashley Sade	A1		C3
Kaminski	Theresa Marie	A2		C4
Li	Irene	A3		C5
Macko	Haley Marie	A4		C6

Rahn	Christopher Louis Von
Reske	Jake Jordan	A1			D1
Stawkey	Danielle Marie	A2			D2
Struble	Nicole Danielle	A3			D3
Tencer	Joel Ira	A4			D4
Vanalst	Andrew John	A1			D5
Wolfson	Abigail Esther	A2			D6

Now, since time is of the essence, I have written a script to do the analysis for us! Let’s have a look at it:

#PBS -l walltime=20:00:00,nodes=1:ppn=4,mem=24gb

|----------------------------------------------------------------------------|
|This script was written *specifically* for the MMG434 course to analyze the |
|L. reuteri RNA-seq data generated by the MSU RTSF. This series of processes |
|will generate a folder called 'RNAseq_(currentdate)' in your home (~) direc-|
|tory. Within that folder a series of sub-folders are made to organize the   |
|data and analysis results. The general workflow is Trimmomatic -> FastQC -> |
|Bowtie -> HTSeq. After this has been run, you can download the map.sam files|
|and analyze them in your favorite differential gene expression software pac-|
|age (e.g. edgeR). To use this script, you will need to change the data loca-|
|tion and then save it (e.g. use nano or another text editor to make the cha-|
|nges). To run this script, follow the tutorial at mmg434.readthedocs.org.   |
|----------------------------------------------------------------------------|

#change to home directory on HPCC
cd ~

#check if the folder RNAseq_(current date) exists and if not, make the folder
NEWFOLDNAME=RNAseq
DATE=`date +%d-%m-%y`
NEWFOLD=${NEWFOLDNAME}_${DATE}

#if the folder already exists, make a new one with a _number added
n=1	#start with the number as 1

while [ -d ~/$NEWFOLD ]	#while the folder exists, do the commands below
do

	NEWFOLD=${NEWFOLDNAME}_${DATE}_${n}	#add the number to the folder name
	n=$((n+1))	#increment n by 1

done

mkdir $NEWFOLD
cd ~/$NEWFOLD

#make several folders within this RNAseq_(current date) folder to organize data
mkdir TrimmedData FastQCreports Bowtie HTSeq edgeR


#load the appropriate modules for all of the analysis steps
module load Trimmomatic/0.32
module load FastQC/0.11.2
module load bowtie/1.0.0
module load HTSeq/0.6.1

#Copy the data into the TrimmedData folder
#DEPENDING ON YOUR COMPARISON, YOU NEED TO CHANGE THE LBvsCOMM FOLDER LOCATION TO EITHER
#LBvsEHEC OR LBvsIndole
#IF YOU DO NOT CHANGE IT, YOU WILL RUN THE ANALYSIS FOR THE LBvsComm COMPARISON

cp /mnt/research/mmg434/LreuteriRNAseqData/RawData/LBvsComm/*.fastq.gz ~/$NEWFOLD/TrimmedData

#Do the trimming with Trimmomatic

cd ~/$NEWFOLD/TrimmedData
for untrimmedreads in ~/$NEWFOLD/TrimmedData/*
do
	FILENAME=`basename ${untrimmedreads%.*.*}`
	PREFIX=trimmed
	NEWTRIMFILE=${PREFIX}${FILENAME}
	java -jar $TRIM/trimmomatic SE -threads 4 $untrimmedreads $NEWTRIMFILE.fastq.gz ILLUMINACLIP:$TRIM/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
done

#Generate the FastQC reports

cp ~/$NEWFOLD/TrimmedData/trimmed*.fastq.gz ~/$NEWFOLD/FastQCreports
cd ~/$NEWFOLD/FastQCreports
for trimmedreads in ~/$NEWFOLD/FastQCreports/trimmed*.fastq.gz
do
	fastqc $trimmedreads
done
rm trimmed*.fastq.gz

#Decompress the reads files and move them to the Bowtie directory

cd ~/$NEWFOLD/TrimmedData
cp trimmed*.fastq.gz ~/$NEWFOLD/Bowtie

for compfiles in ~/$NEWFOLD/Bowtie/*.fastq.gz
do
	FILENAME=${compfiles%.*.*}
	gunzip -c $FILENAME.fastq.gz > $FILENAME.fastq
done

cd ~/$NEWFOLD/Bowtie
rm *.fastq.gz	#clean the .gz files

#Align the reads with Bowtie to the reference genome

cd ~/$NEWFOLD/Bowtie

bowtie-build /mnt/research/mmg434/LreuteriRNAseqData/RefGenomeFiles/trimmedlreuterijcm1112.fa trimmedlreuterijcm1112

for renametrim in ~/$NEWFOLD/Bowtie/*.fastq
do
	LESSPREFIX=`basename ${renametrim//trimmed/}`
	mv $renametrim ~/$NEWFOLD/Bowtie/$LESSPREFIX
done

for alignment in ~/$NEWFOLD/Bowtie/*.fastq
do	
	FILENAME=`basename ${alignment%.*}`
	PREFIX=align
	NEWALIGNFILE=${PREFIX}${FILENAME}
	bowtie -S -p 3 trimmedlreuterijcm1112 $FILENAME.fastq > $NEWALIGNFILE.sam
done

rm ~/$NEWFOLD/Bowtie/*.fastq #clean the .fastq files

#Count gene features with HTSeq

cp ~/$NEWFOLD/Bowtie/align*.sam ~/$NEWFOLD/HTSeq
cp /mnt/research/mmg434/LreuteriRNAseqData/RefGenomeFiles/alignlreuterijcm1112.gtf ~/$NEWFOLD/HTSeq
cd ~/$NEWFOLD/HTSeq

for renamealign in ~/$NEWFOLD/HTSeq/*.sam
do
	LESSPREFIX=`basename ${renamealign//align/}`
	mv $renamealign ~/$NEWFOLD/HTSeq/$LESSPREFIX
done

for counts in ~/$NEWFOLD/HTSeq/*.sam
do
	FILENAME=`basename ${counts%.*}`
	PREFIX=map
	NEWMAPFILE=${PREFIX}${FILENAME}
	htseq-count -m intersection-nonempty --stranded=reverse $FILENAME.sam alignlreuterijcm1112.gtf > $NEWMAPFILE.sam
done

rm ~/$NEWFOLD/HTSeq/LR*.sam	#clean the alignment files

This is a lot to take in! All this script is doing is linking all of the steps together and analyzing the data in one fell swoop for us. But! We NEED to change one thing in the script, depending on the comparison we are making.

Let’s grab the script file:

cp /mnt/research/mmg434/LreuteriRNAseqData/LreuteriAnalysisScript.sh ~

Now, let’s open it in the nano text editor:

nano LreuteriAnalysisScript.sh

It will look something like the script above. You can’t use your mouse here and must use your arrow keys to get around. Let’s read through the beginning of the script and see what we need to change.

Now, we will move down through the script and change the name of the folder containing the appropriate comparison (unless you happen to be in the LB vs Commensal E. coli conditioned medium treated group!).

To save the changes, hit Ctrl + o, followed by the Enter/Return key.

To exit nano, hit Ctrl + x

To run the script:

module load powertools
intel14-phi
module load powertools
cd ~
qsub LreuteriAnalysisScript.sh

Hooray! The script has been submitted to the queue to run. Write down your job ID.