Oxford Nanopore Technology ONT sequencing - data analysis

creating project directory and linking ONT fastq.gz data

create project folder and links to FATSQ raw fsatq.gz files:

Create project folders

# create folders for compressed raw data, and decompressed raw data 
mkdir -p ~/nanopore_analysis/{1_raw_data,2_fastq_decompressed}
# create folders for QC
mkdir -p ~/nanopore_analysis/{3_fastqc,4_fastp,5_minionqc}
mkdir -p ~/nanopore_analysis/{6_nanofilt,7_nanoplot}
# create folders for Assembly
mkdir -p ~/nanopore_analysis/{9_metaflye,10_raven,11_canu}
# create folders for verification tools Assembly 
mkdir -p ~/nanopore_analysis/{12_metaquast,13_bandage}
# show definitive folder structure
tree ~/nanopore_analysis

Create links to FASTQ raw data

IMPORTANT: Dorado tool is needed to convert pod5 to fastq, but it’s unavailable on the server, so the fastq files provided were used instead. After created the project folder, we needed to create links to conect the origin “fastq_pass” folder, containing the barcode folders, to the destiny folder ~/nanopore_analysis/1_raw_data:

# create links to raw data (all the barcode directories with fastq.gz files stored in fastq_pass)
ln -s /group/lectures/DTAN25data/BIOINF26_Gruppe2/no_sample_id/20250219_2052_MN35031_FBA50370_f12dc3bb/fastq_pass/* ~/nanopore_analysis/1_raw_data
# remove non-relevant folders (barcodes 01-08 and barcodes 17-24 and also the unclassified...)
ls | grep -vE 'barcode0[9-9]|barcode1[0-6]' | xargs rm
# now we can have access to the barcode folders 
ls -l
# and also access the fastq.gz files inside, like in this summary:
for folder in barcode09 barcode1{0..6}; do
  count=$(ls "$folder"/*.gz 2>/dev/null | wc -l) # Count the .gz files in the folder
  echo "Folder: $folder ($count gz files)"
done
# create links to ALL barcode01-24/fastq.gz files stored in fastq_pass 
ln -s /group/lectures/DTAN25data/BIOINF26_Gruppe2/no_sample_id/20250219_2052_MN35031_FBA50370_f12dc3bb/fastq_pass/* ~/nanopore_analysis/1_raw_data
# remove non-relevant folders (barcodes 01-08 and barcodes 17-24 and also the unclassified...)
ls | grep -vE 'barcode0[9-9]|barcode1[0-6]' | xargs rm
# now we can have access to the barcode folders 
ls -l
# and also access the fastq.gz files inside, like in this summary:
for folder in barcode09 barcode1{0..6}; do
  count=$(ls "$folder"/*.gz 2>/dev/null | wc -l) # Count the .gz files in the folder
  echo "Folder: $folder ($count gz files)"
done

Decompressing `fastq.gz` Files

Decompress all the fastq.gz files (do not worry, it is less than 10 min)

# decompress all fastq.gz files (takes < 10 min)
cd ~/nanopore_analysis/1_raw_data
for barcode in barcode*; do
  mkdir -p ../2_fastq_decompressed/$barcode
  gunzip -c $barcode/*.fastq.gz > ../2_fastq_decompressed/$barcode/combined.fastq
done
ls -hs ~/nanopore_analysis/2_fastq_decompressed/*/*
# check that fastqc worked:
# cat ~/nanopore_analysis/2_fastq_decompressed/barcode01/combined.fastq | head -n 10

Data Exploration (reference pdf)

input: .fastq file or sequencing_summary.txt output: different plots and or summary

In accordance with the results of the initial QC different filter parameters will be defined:

trimming of read ends
adapter/barcode removal
filtering mean read quality
filterung read length

fastQC

example with only barcode09

# fastqc: - requirements (output directory must exist)
cd ~/nanopore_analysis/2_fastq_decompressed
for barcode in barcode09; do
  # mkdir -p ~/nanopore_analysis/3_fastqc/"$barcode"
  # fastqc "$barcode"/combined.fastq \
    #        --outdir ~/nanopore_analysis/3_fastqc/"$barcode" \
    #        --extract \
    # 	   --threads 4
  tree ~/nanopore_analysis/3_fastqc/"$barcode"
done

Result: all the fastqc results have the same structure

/home/rmedina/nanopore_analysis/3_fastqc/barcode09
├── combined_fastqc
│   ├── fastqc_data.txt
│   ├── fastqc.fo
│   ├── fastqc_report.html
│   ├── Icons
│   │   ├── error.png
│   │   ├── fastqc_icon.png
│   │   ├── tick.png
│   │   └── warning.png
│   ├── Images
│   │   ├── adapter_content.png
│   │   ├── duplication_levels.png
│   │   ├── per_base_n_content.png
│   │   ├── per_base_quality.png
│   │   ├── per_base_sequence_content.png
│   │   ├── per_sequence_gc_content.png
│   │   ├── per_sequence_quality.png
│   │   └── sequence_length_distribution.png
│   └── summary.txt
├── combined_fastqc.html
└── combined_fastqc.zip

3 directories, 18 files

Basic Statistics Fastqc: summary.txt

The Basic Statistics module provides key data about the analyzed file:

Filename: Name of the analyzed file.
File type: Base calls or colorspace.
Encoding: ASCII quality format.
Total Sequences: Number processed (actual/estimated).
Filtered Sequences: Removed sequences (Casava mode).
Sequence Length: Shortest to longest range.
%GC: Percentage of guanine and cytosine bases.

here in the folder barcode09 we can see the summary.txt:

cat ~/nanopore_analysis/3_fastqc/barcode09/combined_fastqc/summary.txt

RESULTS:

PASS	Basic Statistics	combined.fastq
FAIL	Per base sequence quality	combined.fastq
PASS	Per sequence quality scores	combined.fastq
FAIL	Per base sequence content	combined.fastq
PASS	Per sequence GC content	combined.fastq
PASS	Per base N content	combined.fastq
WARN	Sequence Length Distribution	combined.fastq
PASS	Sequence Duplication Levels	combined.fastq
PASS	Overrepresented sequences	combined.fastq
PASS	Adapter Content	combined.fastq

WE will focus in the statistics with fail and warning signals:

FAIL	Per base sequence quality	combined.fastq
FAIL	Per base sequence content	combined.fastq
WARN	Sequence Length Distribution	combined.fastq

generate the rest fastqc: from barcode10 to barcode16

cd ~/nanopore_analysis/2_fastq_decompressed
for barcode in barcode10; do
  mkdir -p ~/nanopore_analysis/3_fastqc/"$barcode"
  fastqc "$barcode"/combined.fastq \
         --outdir ~/nanopore_analysis/3_fastqc/"$barcode" \
         --extract \
	 --threads 4
done

printf '\n==> summary.txt:\n'
cat ~/nanopore_analysis/3_fastqc/barcode10/combined_fastqc/summary.txt
printf '\n--> fastqc_data.txt:\n'
head ~/nanopore_analysis/3_fastqc/barcode10/combined_fastqc/fastqc_data.txt

RESULTS:

==> summary.txt:
PASS	Basic Statistics	combined.fastq
FAIL	Per base sequence quality	combined.fastq
PASS	Per sequence quality scores	combined.fastq
FAIL	Per base sequence content	combined.fastq
FAIL	Per sequence GC content	combined.fastq
PASS	Per base N content	combined.fastq
WARN	Sequence Length Distribution	combined.fastq
PASS	Sequence Duplication Levels	combined.fastq
PASS	Overrepresented sequences	combined.fastq
PASS	Adapter Content	combined.fastq

--> fastqc_data.txt:
##FastQC	0.11.9
>>Basic Statistics	pass
#Measure	Value
Filename	combined.fastq
File type	Conventional base calls
Encoding	Sanger / Illumina 1.9
Total Sequences	729889
Sequences flagged as poor quality	0
Sequence length	61-225281
%GC	36

Per base sequence quality: FAIL

Cause: Sequencing chemistry degrades with increasing read length and for long runs Solution: sequencing chemistry degrades with increasing read length and for long runs

ls ~/nanopore_analysis/3_fastqc/barcode09/combined_fastqc/fastqc_report.html
scp -r -P 1722 bioinf02:/home/rmedina/nanopore_analysis/3_fastqc/barcode09/combined_fastqc/fastqc_report.html /home/riccardo

generate fastqc of all barcodes

cd ~/nanopore_analysis/2_fastq_decompressed
for barcode in barcode09 barcode1{0..6}; do
  mkdir -p ~/nanopore_analysis/3_fastqc/"$barcode"
  fastqc "$barcode"/combined.fastq \
         --outdir ~/nanopore_analysis/3_fastqc/"$barcode" \
         --extract \
	 --threads 4
done

check total sequences and the most relevant information from fastqc_data.txt

total sequeces fastqc files

Path to the fastqc_data.txt file:

/home/rmedina/nanopore_analysis/3_fastqc/barcode09 ├── combined_fastqc │ ├── fastqc_data.txt

summary table:

cd /home/rmedina/nanopore_analysis/3_fastqc
for barcode in barcode*
do
  printf "${barcode}:   "
  cat /home/rmedina/nanopore_analysis/3_fastqc/"$barcode"/combined_fastqc/fastqc_data.txt \
    | grep '^Total'
done

RESULTS:

barcode09:	Total Sequences 1065437
barcode10:	Total Sequences 729889
barcode11:	Total Sequences 1217908
barcode12:	Total Sequences 667557
barcode13:	Total Sequences 407956
barcode14:	Total Sequences 83556
barcode15:	Total Sequences 735701
barcode16:	Total Sequences 1156564

a table with a most detailed information in fastqc_data.txt

Total Sequences
Sequences flagged as poor quality
Sequence length

cd ~/nanopore_analysis/2_fastq_decompressed
print_separator() {
    printf "|-------------------------------------+-----------------------|\n"
}
for barcode in barcode09 barcode1{0..6}; do
  # mkdir -p ~/nanopore_analysis/3_fastqc/"$barcode"
  # fastqc "$barcode"/combined.fastq \
  #        --outdir ~/nanopore_analysis/3_fastqc/"$barcode" \
  #        --extract \
  # 	 --threads 4
	 fastqc_file="../3_fastqc/${barcode}/combined_fastqc/fastqc_data.txt"

  # Function to print the horizontal separator
  # Print table header
  print_separator
  printf "| dir: %-30s |  file: %-14s |\n" "${barcode}" "combined.fastq"
  print_separator
  
  # Check if the file exists and format the output
  if [[ -f "$fastqc_file" ]]; then
    # Define the specific lines to extract
    sed -n '6p;7p;8p;9p;10p' "$fastqc_file" | while IFS=$'\t' read -r measure value; do
    printf "| %-35s | %-21s |\n" "$measure" "$value"
    done
  else
    printf "| %-35s | %-21s |\n" "File Missing" "N/A"
  fi
done
print_separator

RESULTS:

dir: barcode09	file: combined.fastq
Encoding	Sanger / Illumina 1.9
Total Sequences	1065437
Sequences flagged as poor quality	0
Sequence length	29-391635
%GC	42
dir: barcode10	file: combined.fastq
Encoding	Sanger / Illumina 1.9
Total Sequences	729889
Sequences flagged as poor quality	0
Sequence length	61-225281
%GC	36
dir: barcode11	file: combined.fastq
Encoding	Sanger / Illumina 1.9
Total Sequences	1217908
Sequences flagged as poor quality	0
Sequence length	64-305294
%GC	44
dir: barcode12	file: combined.fastq
Encoding	Sanger / Illumina 1.9
Total Sequences	667557
Sequences flagged as poor quality	0
Sequence length	44-259254
%GC	44
dir: barcode13	file: combined.fastq
Encoding	Sanger / Illumina 1.9
Total Sequences	407956
Sequences flagged as poor quality	0
Sequence length	42-380798
%GC	40
dir: barcode14	file: combined.fastq
Encoding	Sanger / Illumina 1.9
Total Sequences	83556
Sequences flagged as poor quality	0
Sequence length	59-242572
%GC	43
dir: barcode15	file: combined.fastq
Encoding	Sanger / Illumina 1.9
Total Sequences	735701
Sequences flagged as poor quality	0
Sequence length	29-397562
%GC	69
dir: barcode16	file: combined.fastq
Encoding	Sanger / Illumina 1.9
Total Sequences	1156564
Sequences flagged as poor quality	0
Sequence length	59-241224
%GC	45

QC: fastp

# cd ~/nanopore_analysis/0_scripts
mkdir -p ~/nanopore_analysis/4_fastp
cd $_
wget http://opengene.org/fastp/fastp
chmod a+x ./fastp
ls -hs fastp

QC: MinIONQC

R script

input: sequencing_summary.txt
output:
- summary.yaml
- different plots
combining different data sets for possible comparison

download Rscript minion_qc

mkdir -p ~/nanopore_analysis/5_minionqc
cd ~/nanopore_analysis/5_minionqc

curl https://raw.githubusercontent.com/roblanf/minion_qc/master/MinIONQC.R > MinIONQC.R

In this project, the 433 .fastq files contained in each barcode folder, were joined in one called combined.fastq. As a result, each barcode folder contain only one sequencing_summary.txt, which will be used as an input for minion_qc

run the script

cd ~/nanopore_analysis/5_minionqc
Rscript MinIONQC.R \
	--input=~/nanopore_analysis/3_fastqc \
	--output=~/nanopore_analysis/5_minionqc \
	--processors=4 \
	--qscore_cutoff=7 \
	--format=tiff \
	--smallfigures=TRUE

It sounds like you’re looking to structure the data exploration and filtering process for your Nanopore metagenomics data. Let me break it down step by step using the tools and methods available:

Data Exploration (Initial QC)
- Input: .fastq files or a folder containing them, or sequencing_summary.txt files.
- Tools:
  - MinIONQC: Generate diagnostic plots to explore read quality and sequencing performance.
  - NanoPlot: Visualize distributions (e.g., read length, quality scores) to identify data trends and potential issues.
- Output:
  - Quality control plots (e.g., read quality histograms, length distributions).
  - Summary statistics about read counts, mean quality, and sequencing performance.
Defining Filter Parameters Based on your QC results, establish the filtering criteria:
- Trimming Read Ends:
  - Use NanoFilt to trim low-quality bases at read ends.
- Adapter/Barcode Removal:
  - If adapters or barcodes are present, use tools like Porechop to remove them.
- Filtering Mean Read Quality:
  - Set a threshold (e.g., minimum Q-score) and filter using NanoFilt.
- Filtering by Read Length:
  - Use NanoFilt or a custom script to remove reads below or above a specific length threshold.

NanoPlot: Visualize data

NanoPlot --fastq <path_to_fastq> --plots hex dot

NanoFilt : Filter reads

NanoFilt -q 7 --length 1000 <input.fastq> > filtered.fastq

This example sets a minimum Q-score of 7 and a minimum read length of 1000 bp.

relevant aditional information

tools available in server

ls /group/bin/kaiju*
ls /group/bin/kraken*

databases for kraken:

tree -d /group/db

detete the whole project

rm rf ~/nanopore_analysis/

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
README.org		README.org
alternative_workflow_2.org		alternative_workflow_2.org
explore_nanopore_source_files.org		explore_nanopore_source_files.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Oxford Nanopore Technology ONT sequencing - data analysis

creating project directory and linking ONT fastq.gz data

create project folder and links to FATSQ raw fsatq.gz files:

Create project folders

Create links to FASTQ raw data

Decompressing `fastq.gz` Files

Data Exploration (reference pdf)

fastQC

example with only barcode09

generate the rest fastqc: from barcode10 to barcode16

generate fastqc of all barcodes

check total sequences and the most relevant information from fastqc_data.txt

total sequeces fastqc files

summary table:

QC: fastp

QC: MinIONQC

NanoPlot: Visualize data

NanoFilt : Filter reads

relevant aditional information

tools available in server

databases for kraken:

detete the whole project

About

Uh oh!

Releases

Packages

medinari/nanopore_analysis

Folders and files

Latest commit

History

Repository files navigation

Oxford Nanopore Technology ONT sequencing - data analysis

creating project directory and linking ONT fastq.gz data

create project folder and links to FATSQ raw fsatq.gz files:

Create project folders

Create links to FASTQ raw data

Decompressing fastq.gz Files

Data Exploration (reference pdf)

fastQC

example with only barcode09

generate the rest fastqc: from barcode10 to barcode16

generate fastqc of all barcodes

check total sequences and the most relevant information from fastqc_data.txt

total sequeces fastqc files

summary table:

QC: fastp

QC: MinIONQC

NanoPlot: Visualize data

NanoFilt : Filter reads

relevant aditional information

tools available in server

databases for kraken:

detete the whole project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Decompressing `fastq.gz` Files

Packages