Analyzing L. reuteri dataΒΆ
We have our data! Hooray! Now, we can go through and analyze all of it, from Trimmomatic to HTSeq. We are only going to take it through HTSeq and then do the differential gene expression interactively on our own machines. This is simply because it’s nice to work with everything on our own machines because we will be generating plots and may have to iteratively do the differential gene expression analysis based on the results/plots.
To this point, we have worked with a small subset of example RNA-seq data (100,000 reads per sample). Now, we will be working with full datasets (~15 million reads per sample). Thus, we are going to submit the jobs to the MSU HPCC to run for the next 20 hours or so.
Let’s have a look at the raw data that we received from the MSU RTSF. We sequenced 24 samples but got 48 files back! Why did we get twice the data back? This is due to the limitations of the sequencer itself. The particular flowcell for our data has 2 lanes. Each lane generates around 150 million reads total and we want around 15 million reads per sample. Thus, 24 samples * 15 million reads = 360 million total reads. This is more than twice what a single lane can produce for us. Thus, we pooled all of our samples together and ran them on both lanes to achieve the desired read number. So in order to work with this data, we need to combine the reads from both lanes together into a single file. However, to do this effectively, we need to decompress the files and then merge them.
If you are interested in what I did to get to this point, the commands are as follows in pseudocode.
Decompress files: gunzip -c file_I_want_to_decompress.fastq.gz > decompressed_file.fastq
Merge files: cat sample1_read_data_from_lane2.fastq >> sample1_read_data_from_lane1.fastq
Recompress files: gzip -c file_I_want_to_compress.fastq > compressed_file.fastq.gz
I have already done all of this for you and grouped the merged and recompressed data into folders corresponding to different treatment comparisons:
LB vs Indole treated L. reuteri
LB vs Commensal E. coli conditioned medium treated L. reuteri
LB vs EHEC conditioned medium treated L. reuteri
These folders are located in /mnt/research/mmg434/LreuteriRNAseqData/RawData
Dr. Viswanathan has put together a list of who is doing which comparisons.
| LAST_NAME | FIRST_NAME | Lbcontrol/ | Indole/ | E.coli conditioned | EHEC conditioned |
| Sample # | Sample # | medium/Sample # | medium/ Sample # | ||
| Asopa | Mrinal | A1 | B1 | ||
| Babiker | Leena | A2 | B2 | ||
| Biernat | Emily Rachel | A3 | B3 | ||
| Cermak | Megan Molloy | A4 | B4 | ||
| Collins | Sasha Lauren | A1 | B5 | ||
| Freiberger | Katharina Marianne | A2 | B6 | ||
| Ginzburg | Daniel | A3 | C1 | ||
| Griffin | Caleigh Ellen | A4 | C2 | ||
| Jones | Ashley Sade | A1 | C3 | ||
| Kaminski | Theresa Marie | A2 | C4 | ||
| Li | Irene | A3 | C5 | ||
| Macko | Haley Marie | A4 | C6 | ||
| Rahn | Christopher Louis Von | ||||
| Reske | Jake Jordan | A1 | D1 | ||
| Stawkey | Danielle Marie | A2 | D2 | ||
| Struble | Nicole Danielle | A3 | D3 | ||
| Tencer | Joel Ira | A4 | D4 | ||
| Vanalst | Andrew John | A1 | D5 | ||
| Wolfson | Abigail Esther | A2 | D6 |
Now, since time is of the essence, I have written a script to do the analysis for us! Let’s have a look at it:
#PBS -l walltime=20:00:00,nodes=1:ppn=4,mem=24gb
|----------------------------------------------------------------------------|
|This script was written *specifically* for the MMG434 course to analyze the |
|L. reuteri RNA-seq data generated by the MSU RTSF. This series of processes |
|will generate a folder called 'RNAseq_(currentdate)' in your home (~) direc-|
|tory. Within that folder a series of sub-folders are made to organize the |
|data and analysis results. The general workflow is Trimmomatic -> FastQC -> |
|Bowtie -> HTSeq. After this has been run, you can download the map.sam files|
|and analyze them in your favorite differential gene expression software pac-|
|age (e.g. edgeR). To use this script, you will need to change the data loca-|
|tion and then save it (e.g. use nano or another text editor to make the cha-|
|nges). To run this script, follow the tutorial at mmg434.readthedocs.org. |
|----------------------------------------------------------------------------|
#change to home directory on HPCC
cd ~
#check if the folder RNAseq_(current date) exists and if not, make the folder
NEWFOLDNAME=RNAseq
DATE=`date +%d-%m-%y`
NEWFOLD=${NEWFOLDNAME}_${DATE}
#if the folder already exists, make a new one with a _number added
n=1 #start with the number as 1
while [ -d ~/$NEWFOLD ] #while the folder exists, do the commands below
do
NEWFOLD=${NEWFOLDNAME}_${DATE}_${n} #add the number to the folder name
n=$((n+1)) #increment n by 1
done
mkdir $NEWFOLD
cd ~/$NEWFOLD
#make several folders within this RNAseq_(current date) folder to organize data
mkdir TrimmedData FastQCreports Bowtie HTSeq edgeR
#load the appropriate modules for all of the analysis steps
module load Trimmomatic/0.32
module load FastQC/0.11.2
module load bowtie/1.0.0
module load HTSeq/0.6.1
#Copy the data into the TrimmedData folder
#DEPENDING ON YOUR COMPARISON, YOU NEED TO CHANGE THE LBvsCOMM FOLDER LOCATION TO EITHER
#LBvsEHEC OR LBvsIndole
#IF YOU DO NOT CHANGE IT, YOU WILL RUN THE ANALYSIS FOR THE LBvsComm COMPARISON
cp /mnt/research/mmg434/LreuteriRNAseqData/RawData/LBvsComm/*.fastq.gz ~/$NEWFOLD/TrimmedData
#Do the trimming with Trimmomatic
cd ~/$NEWFOLD/TrimmedData
for untrimmedreads in ~/$NEWFOLD/TrimmedData/*
do
FILENAME=`basename ${untrimmedreads%.*.*}`
PREFIX=trimmed
NEWTRIMFILE=${PREFIX}${FILENAME}
java -jar $TRIM/trimmomatic SE -threads 4 $untrimmedreads $NEWTRIMFILE.fastq.gz ILLUMINACLIP:$TRIM/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
done
#Generate the FastQC reports
cp ~/$NEWFOLD/TrimmedData/trimmed*.fastq.gz ~/$NEWFOLD/FastQCreports
cd ~/$NEWFOLD/FastQCreports
for trimmedreads in ~/$NEWFOLD/FastQCreports/trimmed*.fastq.gz
do
fastqc $trimmedreads
done
rm trimmed*.fastq.gz
#Decompress the reads files and move them to the Bowtie directory
cd ~/$NEWFOLD/TrimmedData
cp trimmed*.fastq.gz ~/$NEWFOLD/Bowtie
for compfiles in ~/$NEWFOLD/Bowtie/*.fastq.gz
do
FILENAME=${compfiles%.*.*}
gunzip -c $FILENAME.fastq.gz > $FILENAME.fastq
done
cd ~/$NEWFOLD/Bowtie
rm *.fastq.gz #clean the .gz files
#Align the reads with Bowtie to the reference genome
cd ~/$NEWFOLD/Bowtie
bowtie-build /mnt/research/mmg434/LreuteriRNAseqData/RefGenomeFiles/trimmedlreuterijcm1112.fa trimmedlreuterijcm1112
for renametrim in ~/$NEWFOLD/Bowtie/*.fastq
do
LESSPREFIX=`basename ${renametrim//trimmed/}`
mv $renametrim ~/$NEWFOLD/Bowtie/$LESSPREFIX
done
for alignment in ~/$NEWFOLD/Bowtie/*.fastq
do
FILENAME=`basename ${alignment%.*}`
PREFIX=align
NEWALIGNFILE=${PREFIX}${FILENAME}
bowtie -S -p 3 trimmedlreuterijcm1112 $FILENAME.fastq > $NEWALIGNFILE.sam
done
rm ~/$NEWFOLD/Bowtie/*.fastq #clean the .fastq files
#Count gene features with HTSeq
cp ~/$NEWFOLD/Bowtie/align*.sam ~/$NEWFOLD/HTSeq
cp /mnt/research/mmg434/LreuteriRNAseqData/RefGenomeFiles/alignlreuterijcm1112.gtf ~/$NEWFOLD/HTSeq
cd ~/$NEWFOLD/HTSeq
for renamealign in ~/$NEWFOLD/HTSeq/*.sam
do
LESSPREFIX=`basename ${renamealign//align/}`
mv $renamealign ~/$NEWFOLD/HTSeq/$LESSPREFIX
done
for counts in ~/$NEWFOLD/HTSeq/*.sam
do
FILENAME=`basename ${counts%.*}`
PREFIX=map
NEWMAPFILE=${PREFIX}${FILENAME}
htseq-count -m intersection-nonempty --stranded=reverse $FILENAME.sam alignlreuterijcm1112.gtf > $NEWMAPFILE.sam
done
rm ~/$NEWFOLD/HTSeq/LR*.sam #clean the alignment files
This is a lot to take in! All this script is doing is linking all of the steps together and analyzing the data in one fell swoop for us. But! We NEED to change one thing in the script, depending on the comparison we are making.
Let’s grab the script file:
cp /mnt/research/mmg434/LreuteriRNAseqData/LreuteriAnalysisScript.sh ~
Now, let’s open it in the nano text editor:
nano LreuteriAnalysisScript.sh
It will look something like the script above. You can’t use your mouse here and must use your arrow keys to get around. Let’s read through the beginning of the script and see what we need to change.
Now, we will move down through the script and change the name of the folder containing the appropriate comparison (unless you happen to be in the LB vs Commensal E. coli conditioned medium treated group!).
To save the changes, hit Ctrl + o, followed by the Enter/Return key.
To exit nano, hit Ctrl + x
To run the script:
module load powertools
intel14-phi
module load powertools
cd ~
qsub LreuteriAnalysisScript.sh
Hooray! The script has been submitted to the queue to run. Write down your job ID.