Comparison of the transcriptomes of two tardigrades with different hatching coordination
https://bmcdevbiol.biomedcentral.com/articles/10.1186/s12861-019-0205-9
They included data from a previous publication that compared the genomes of these two species:
Comparative genomics of the tardigrades Hypsibius dujardini and Ramazzottius varieornatus
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2002266
In this assignment (9&10), we will use their publicly available data to conduct our own gene expression analyses. Each student will analyze part of the RNA-seq dataset to try to answer these questions:
For each species:
- How many genes does it have?
- How many genes are annotated in the genome?
- How many genes do we find orthologous between the two genomes?
- How many genes are expressed in different stages of development?
- Do we identify novel genes that are not yet annotated?
- Are there any genes that are expressed in every developmental stage (e.g. housekeeping)?
- Are there any genes that are not expressed in any of the different stages?
- What is the average number of stages that genes are expressed in?
- Which genes are highest expressed in each stage and on average?
- Which stage has the most genes expressed, and the highest average expression?
For your own DE comparisons: - How is your gene of interest expressed compared to the average, and is it what you expected?
- Which stages have the most DE genes between them? Is this what you expected?
- What GO terms are enriched among DE genes (all vs up- vs down-regulated genes)?
- How many DE genes are in common between the species (i.e. orthologous DE genes)?
- How many enriched GO terms are in common between the species?
*Before starting this assignment, answer the following questions and submit them on Blackboard in “Tardigrade (Assignment 10) data outline”
- Choose which developmental stages to contrast (in both species). Tell me which 2 stages you chose.
- Choose two genes you will be tracking to see how they are expressed among the chosen stages. Tell me which genes these are – provide both gene names and their accession numbers in both species, and tell me your rationale behind choosing these two genes.
- What do you expect you will find in terms of gene expression among these stages and genes and why? Will they differ between species? What GO terms are associated with the genes you chose?
Some resources: https://www.ncbi.nlm.nih.gov/genome and http://ensembl.tardigrades.org/index.html
Assignment 9 – Comparative Transcriptomics part I (5 pts)
Once your dataset is chosen and approved by me, you will upload and process your data in Galaxy. Remember: running these jobs can take a lot of trial and error and long run times. Get your list of accession numbers in two text files (one accession number per line): one for the first species, another for the second species. This makes it easy to upload all files in a batch mode. It’s easier to run your analyses from start to finish with one species, then you can reapply the same exact workflow on the second species (e.g. https://galaxyproject.org/learn/advanced-workflow/extract/).
Make your first Galaxy history for the species of your choice and name the history appropriately. Upload the accession text file, the compressed genome file (.fna.gz), and compressed annotation file (.gff.gz) for the corresponding species. Answer the questions below for the first species, then when you are done, redo all the steps and answer all the questions for the second species. Make sure to tell me which species you start with.
1 Upload data. Use “Faster Download and Extract Reads” with your text file. Notice this will make a “Collection” of files, so you can run operations on the entire ‘batch’ all at the same time.
2 Run FastQC on the collection of FASTQ files.
3 Run Trim Galore on the collection of FASTQ files – make sure to select from the advanced settings “Yes” to “Generate a report file”. This report output will be used in the MultiQC step below.
4 Run FastQC on the collection of Trim Galore FASTQ files.
5 Run RNA STAR on the Trim Galore collection. You need to select the genome you uploaded with the gene-model (gff file), and choose “Per gene read counts (GeneCounts)”.
6 Run Samtools stats to gather statistics on your STAR output BAM files.
7 Run featureCounts using the BAM files. Ours should be Unstranded libraries. Use the Gene annotation file that you uploaded (.gff.gz), and select Yes to create a gene-length file. But notice that running this will give an error because the annotation file it expects is in GTF format (not GFF). So first you will need to run gffread to convert your annotation file you uploaded (.gff.gz) from GFF to GTF format.
8 Run MultiQC – Run this to generate results from 4 of the steps above: steps 2,3,5,7.
9 Download the featureCounts Counts and lengths. These will be imported into R in the next assignment.
10 Share with me your finalized Galaxy histories
For each species:
Provide a screenshot of your entire desktop screen showing the general statistics output from MultiQC in your Galaxy browser. Without this screenshot, you will not get graded. Also, share with me your Galaxy histories or else you will not get graded.
In a sentence or two, answer these questions for each species:
- What developmental stages are you analyzing?
- Describe the mean quality of the data in lay terms. 0.5 pts
- What is the approximate length of most reads after trimming? 0.5 pts
- Are there any samples in particular that you are worried about (e.g. you would consider excluding from analysis) after looking at the FastQC metrics, why or why not? 1 pt
- What is the approximate average % read duplication, and is this surprising given your dataset (why/why not)? 1 pt
- Describe the mapping results and variance among samples, and if you are satisfied with this result or not (and why). Also provide a screenshot of the STAR Alignment Scores from MultiQC. 1 pt
- Interpret the Assigned Reads vs % Assigned Reads, explain what these numbers represent, and how was the % calculated? 1 pt
Bonus: What do you think is more important in determining the success of your sequencing (and why), Assigned or Aligned reads? 0.5 pts