Playing with fastq/BAM using awk and Python

Jul 15, 2021

Just some notes for commonly-used commands/code.

Filter fastq based on read ids

Given a list of read ids, extract their sequences and read qualities from a fastq:

import gzip

def extractFq(read_ids, fq, out):
    '''Filter fastq'''
    ids, lineno, flag = set('@' + x + '\n' for x in read_ids), 0, False
    fq = gzip.open(fq, 'rb')
    out = gzip.open(out, 'wb')
    
    for line in fq:
        lineno += 1
        if lineno % 4 == 1: 
            flag = (line.decode('ascii') in ids)
        if flag: 
            out.write(line)

Filter BAM based on read ids

samtools view -h your.bam | awk -F"\t" -vOFS="\t" 'NR==FNR{id[$1]=$1; next} \
    {if($1 ~ /^@/) {print $0} else {if($1 in id) print $0}}' <(zcat ids.gz) - | \
    samtools view -bS - > your.sub.bam

samtools view -h prints the header as well, - means the stdout/stdin from samtools. Can also extract the files with other stuff.

Filter fastq by sequence length

def filterFq(fastq, out, cutoff_len): 
    lineno, flag = 0, False
    fq = gzip.open(fastq, 'rb')
    out = gzip.open(out, 'wb')
    
    for line in fq:
        lineno += 1
        if lineno % 4 == 1:
            read_id = line
        if lineno % 4 == 2: 
            flag = (len(line.decode('ascii').rstrip()) > cutoff_len)
            if flag: 
                out.write(read_id)
        if flag and lineno % 4 != 1:
        	out.write(line)

Count the occurrence of sequences/reads

awk 'NR%4 == 2 {seen[$0]++} END {for (i in seen) {print i"\t"seen[i]}}' \
  	<(zcat barcode.fastq.gz) | gzip > barcode.cnts.txt.gz

Take a peek at the sequence in fastq given a read id

zcat fastq.gz | awk '/@AAA:BBB:12345/{x=NR+3} (NR<=x) {print}'

To be continued...