Skip to content

medinari/nanopore_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Oxford Nanopore Technology ONT sequencing - data analysis

creating project directory and linking ONT fastq.gz data

create project folder and links to FATSQ raw fsatq.gz files:

Create project folders

# create folders for compressed raw data, and decompressed raw data 
mkdir -p ~/nanopore_analysis/{1_raw_data,2_fastq_decompressed}
# create folders for QC
mkdir -p ~/nanopore_analysis/{3_fastqc,4_fastp,5_minionqc}
mkdir -p ~/nanopore_analysis/{6_nanofilt,7_nanoplot}
# create folders for Assembly
mkdir -p ~/nanopore_analysis/{9_metaflye,10_raven,11_canu}
# create folders for verification tools Assembly 
mkdir -p ~/nanopore_analysis/{12_metaquast,13_bandage}
# show definitive folder structure
tree ~/nanopore_analysis

Create links to FASTQ raw data

IMPORTANT: Dorado tool is needed to convert pod5 to fastq, but it’s unavailable on the server, so the fastq files provided were used instead. After created the project folder, we needed to create links to conect the origin “fastq_pass” folder, containing the barcode folders, to the destiny folder ~/nanopore_analysis/1_raw_data:

# create links to raw data (all the barcode directories with fastq.gz files stored in fastq_pass)
ln -s /group/lectures/DTAN25data/BIOINF26_Gruppe2/no_sample_id/20250219_2052_MN35031_FBA50370_f12dc3bb/fastq_pass/* ~/nanopore_analysis/1_raw_data
# remove non-relevant folders (barcodes 01-08 and barcodes 17-24 and also the unclassified...)
ls | grep -vE 'barcode0[9-9]|barcode1[0-6]' | xargs rm
# now we can have access to the barcode folders 
ls -l
# and also access the fastq.gz files inside, like in this summary:
for folder in barcode09 barcode1{0..6}; do
  count=$(ls "$folder"/*.gz 2>/dev/null | wc -l) # Count the .gz files in the folder
  echo "Folder: $folder ($count gz files)"
done
# create links to ALL barcode01-24/fastq.gz files stored in fastq_pass 
ln -s /group/lectures/DTAN25data/BIOINF26_Gruppe2/no_sample_id/20250219_2052_MN35031_FBA50370_f12dc3bb/fastq_pass/* ~/nanopore_analysis/1_raw_data
# remove non-relevant folders (barcodes 01-08 and barcodes 17-24 and also the unclassified...)
ls | grep -vE 'barcode0[9-9]|barcode1[0-6]' | xargs rm
# now we can have access to the barcode folders 
ls -l
# and also access the fastq.gz files inside, like in this summary:
for folder in barcode09 barcode1{0..6}; do
  count=$(ls "$folder"/*.gz 2>/dev/null | wc -l) # Count the .gz files in the folder
  echo "Folder: $folder ($count gz files)"
done

Decompressing fastq.gz Files

Decompress all the fastq.gz files (do not worry, it is less than 10 min)

# decompress all fastq.gz files (takes < 10 min)
cd ~/nanopore_analysis/1_raw_data
for barcode in barcode*; do
  mkdir -p ../2_fastq_decompressed/$barcode
  gunzip -c $barcode/*.fastq.gz > ../2_fastq_decompressed/$barcode/combined.fastq
done
ls -hs ~/nanopore_analysis/2_fastq_decompressed/*/*
# check that fastqc worked:
# cat ~/nanopore_analysis/2_fastq_decompressed/barcode01/combined.fastq | head -n 10

Data Exploration (reference pdf)

input: .fastq file or sequencing_summary.txt output: different plots and or summary

In accordance with the results of the initial QC different filter parameters will be defined:

  • trimming of read ends
  • adapter/barcode removal
  • filtering mean read quality
  • filterung read length

fastQC

example with only barcode09

# fastqc: - requirements (output directory must exist)
cd ~/nanopore_analysis/2_fastq_decompressed
for barcode in barcode09; do
  # mkdir -p ~/nanopore_analysis/3_fastqc/"$barcode"
  # fastqc "$barcode"/combined.fastq \
    #        --outdir ~/nanopore_analysis/3_fastqc/"$barcode" \
    #        --extract \
    # 	   --threads 4
  tree ~/nanopore_analysis/3_fastqc/"$barcode"
done
  1. Result: all the fastqc results have the same structure
/home/rmedina/nanopore_analysis/3_fastqc/barcode09
├── combined_fastqc
│   ├── fastqc_data.txt
│   ├── fastqc.fo
│   ├── fastqc_report.html
│   ├── Icons
│   │   ├── error.png
│   │   ├── fastqc_icon.png
│   │   ├── tick.png
│   │   └── warning.png
│   ├── Images
│   │   ├── adapter_content.png
│   │   ├── duplication_levels.png
│   │   ├── per_base_n_content.png
│   │   ├── per_base_quality.png
│   │   ├── per_base_sequence_content.png
│   │   ├── per_sequence_gc_content.png
│   │   ├── per_sequence_quality.png
│   │   └── sequence_length_distribution.png
│   └── summary.txt
├── combined_fastqc.html
└── combined_fastqc.zip

3 directories, 18 files
  1. Basic Statistics Fastqc: summary.txt

The Basic Statistics module provides key data about the analyzed file:

  • Filename: Name of the analyzed file.
  • File type: Base calls or colorspace.
  • Encoding: ASCII quality format.
  • Total Sequences: Number processed (actual/estimated).
  • Filtered Sequences: Removed sequences (Casava mode).
  • Sequence Length: Shortest to longest range.
  • %GC: Percentage of guanine and cytosine bases.

here in the folder barcode09 we can see the summary.txt:

cat ~/nanopore_analysis/3_fastqc/barcode09/combined_fastqc/summary.txt

RESULTS:

PASS	Basic Statistics	combined.fastq
FAIL	Per base sequence quality	combined.fastq
PASS	Per sequence quality scores	combined.fastq
FAIL	Per base sequence content	combined.fastq
PASS	Per sequence GC content	combined.fastq
PASS	Per base N content	combined.fastq
WARN	Sequence Length Distribution	combined.fastq
PASS	Sequence Duplication Levels	combined.fastq
PASS	Overrepresented sequences	combined.fastq
PASS	Adapter Content	combined.fastq

WE will focus in the statistics with fail and warning signals:

FAILPer base sequence qualitycombined.fastq
FAILPer base sequence contentcombined.fastq
WARNSequence Length Distributioncombined.fastq

generate the rest fastqc: from barcode10 to barcode16

cd ~/nanopore_analysis/2_fastq_decompressed
for barcode in barcode10; do
  mkdir -p ~/nanopore_analysis/3_fastqc/"$barcode"
  fastqc "$barcode"/combined.fastq \
         --outdir ~/nanopore_analysis/3_fastqc/"$barcode" \
         --extract \
	 --threads 4
done
printf '\n==> summary.txt:\n'
cat ~/nanopore_analysis/3_fastqc/barcode10/combined_fastqc/summary.txt
printf '\n--> fastqc_data.txt:\n'
head ~/nanopore_analysis/3_fastqc/barcode10/combined_fastqc/fastqc_data.txt

RESULTS:

==> summary.txt:
PASS	Basic Statistics	combined.fastq
FAIL	Per base sequence quality	combined.fastq
PASS	Per sequence quality scores	combined.fastq
FAIL	Per base sequence content	combined.fastq
FAIL	Per sequence GC content	combined.fastq
PASS	Per base N content	combined.fastq
WARN	Sequence Length Distribution	combined.fastq
PASS	Sequence Duplication Levels	combined.fastq
PASS	Overrepresented sequences	combined.fastq
PASS	Adapter Content	combined.fastq

--> fastqc_data.txt:
##FastQC	0.11.9
>>Basic Statistics	pass
#Measure	Value
Filename	combined.fastq
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	729889
Sequences flagged as poor quality	0
Sequence length	61-225281
%GC	36
  • Per base sequence quality: FAIL

Cause: Sequencing chemistry degrades with increasing read length and for long runs Solution: sequencing chemistry degrades with increasing read length and for long runs

ls ~/nanopore_analysis/3_fastqc/barcode09/combined_fastqc/fastqc_report.html
scp -r -P 1722 bioinf02:/home/rmedina/nanopore_analysis/3_fastqc/barcode09/combined_fastqc/fastqc_report.html /home/riccardo

generate fastqc of all barcodes

cd ~/nanopore_analysis/2_fastq_decompressed
for barcode in barcode09 barcode1{0..6}; do
  mkdir -p ~/nanopore_analysis/3_fastqc/"$barcode"
  fastqc "$barcode"/combined.fastq \
         --outdir ~/nanopore_analysis/3_fastqc/"$barcode" \
         --extract \
	 --threads 4
done

check total sequences and the most relevant information from fastqc_data.txt

total sequeces fastqc files

Path to the fastqc_data.txt file:

/home/rmedina/nanopore_analysis/3_fastqc/barcode09 ├── combined_fastqc │   ├── fastqc_data.txt

summary table:

cd /home/rmedina/nanopore_analysis/3_fastqc
for barcode in barcode*
do
  printf "${barcode}:   "
  cat /home/rmedina/nanopore_analysis/3_fastqc/"$barcode"/combined_fastqc/fastqc_data.txt \
    | grep '^Total'
done

RESULTS:

barcode09:Total Sequences 1065437
barcode10:Total Sequences 729889
barcode11:Total Sequences 1217908
barcode12:Total Sequences 667557
barcode13:Total Sequences 407956
barcode14:Total Sequences 83556
barcode15:Total Sequences 735701
barcode16:Total Sequences 1156564

a table with a most detailed information in fastqc_data.txt

  • Total Sequences
  • Sequences flagged as poor quality
  • Sequence length
cd ~/nanopore_analysis/2_fastq_decompressed
print_separator() {
    printf "|-------------------------------------+-----------------------|\n"
}
for barcode in barcode09 barcode1{0..6}; do
  # mkdir -p ~/nanopore_analysis/3_fastqc/"$barcode"
  # fastqc "$barcode"/combined.fastq \
  #        --outdir ~/nanopore_analysis/3_fastqc/"$barcode" \
  #        --extract \
  # 	 --threads 4
	 fastqc_file="../3_fastqc/${barcode}/combined_fastqc/fastqc_data.txt"

  # Function to print the horizontal separator
  # Print table header
  print_separator
  printf "| dir: %-30s |  file: %-14s |\n" "${barcode}" "combined.fastq"
  print_separator
  
  # Check if the file exists and format the output
  if [[ -f "$fastqc_file" ]]; then
    # Define the specific lines to extract
    sed -n '6p;7p;8p;9p;10p' "$fastqc_file" | while IFS=$'\t' read -r measure value; do
    printf "| %-35s | %-21s |\n" "$measure" "$value"
    done
  else
    printf "| %-35s | %-21s |\n" "File Missing" "N/A"
  fi
done
print_separator

RESULTS:

dir: barcode09file: combined.fastq
EncodingSanger / Illumina 1.9
Total Sequences1065437
Sequences flagged as poor quality0
Sequence length29-391635
%GC42
dir: barcode10file: combined.fastq
EncodingSanger / Illumina 1.9
Total Sequences729889
Sequences flagged as poor quality0
Sequence length61-225281
%GC36
dir: barcode11file: combined.fastq
EncodingSanger / Illumina 1.9
Total Sequences1217908
Sequences flagged as poor quality0
Sequence length64-305294
%GC44
dir: barcode12file: combined.fastq
EncodingSanger / Illumina 1.9
Total Sequences667557
Sequences flagged as poor quality0
Sequence length44-259254
%GC44
dir: barcode13file: combined.fastq
EncodingSanger / Illumina 1.9
Total Sequences407956
Sequences flagged as poor quality0
Sequence length42-380798
%GC40
dir: barcode14file: combined.fastq
EncodingSanger / Illumina 1.9
Total Sequences83556
Sequences flagged as poor quality0
Sequence length59-242572
%GC43
dir: barcode15file: combined.fastq
EncodingSanger / Illumina 1.9
Total Sequences735701
Sequences flagged as poor quality0
Sequence length29-397562
%GC69
dir: barcode16file: combined.fastq
EncodingSanger / Illumina 1.9
Total Sequences1156564
Sequences flagged as poor quality0
Sequence length59-241224
%GC45

QC: fastp

# cd ~/nanopore_analysis/0_scripts
mkdir -p ~/nanopore_analysis/4_fastp
cd $_
wget http://opengene.org/fastp/fastp
chmod a+x ./fastp
ls -hs fastp

QC: MinIONQC

R script

  • input: sequencing_summary.txt
  • output:
    • summary.yaml
    • different plots
  • combining different data sets for possible comparison

download Rscript minion_qc

mkdir -p ~/nanopore_analysis/5_minionqc
cd ~/nanopore_analysis/5_minionqc

curl https://raw.githubusercontent.com/roblanf/minion_qc/master/MinIONQC.R > MinIONQC.R

In this project, the 433 .fastq files contained in each barcode folder, were joined in one called combined.fastq. As a result, each barcode folder contain only one sequencing_summary.txt, which will be used as an input for minion_qc

run the script

cd ~/nanopore_analysis/5_minionqc
Rscript MinIONQC.R \
	--input=~/nanopore_analysis/3_fastqc \
	--output=~/nanopore_analysis/5_minionqc \
	--processors=4 \
	--qscore_cutoff=7 \
	--format=tiff \
	--smallfigures=TRUE

It sounds like you’re looking to structure the data exploration and filtering process for your Nanopore metagenomics data. Let me break it down step by step using the tools and methods available:

  1. Data Exploration (Initial QC)
    • Input: .fastq files or a folder containing them, or sequencing_summary.txt files.
    • Tools:
      • MinIONQC: Generate diagnostic plots to explore read quality and sequencing performance.
      • NanoPlot: Visualize distributions (e.g., read length, quality scores) to identify data trends and potential issues.
    • Output:
      • Quality control plots (e.g., read quality histograms, length distributions).
      • Summary statistics about read counts, mean quality, and sequencing performance.
  2. Defining Filter Parameters Based on your QC results, establish the filtering criteria:
    • Trimming Read Ends:
      • Use NanoFilt to trim low-quality bases at read ends.
    • Adapter/Barcode Removal:
      • If adapters or barcodes are present, use tools like Porechop to remove them.
    • Filtering Mean Read Quality:
      • Set a threshold (e.g., minimum Q-score) and filter using NanoFilt.
    • Filtering by Read Length:
      • Use NanoFilt or a custom script to remove reads below or above a specific length threshold.

NanoPlot: Visualize data

NanoPlot --fastq <path_to_fastq> --plots hex dot

NanoFilt : Filter reads

NanoFilt -q 7 --length 1000 <input.fastq> > filtered.fastq

This example sets a minimum Q-score of 7 and a minimum read length of 1000 bp.

relevant aditional information

tools available in server

ls /group/bin/kaiju*
ls /group/bin/kraken*

databases for kraken:

tree -d /group/db

detete the whole project

rm rf ~/nanopore_analysis/

About

nanopore fastq files analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published