Just some notes for commonly-used commands/code.
Filter fastq based on read ids
Given a list of read ids, extract their sequences and read qualities from a fastq:
import gzip
def extractFq(read_ids, fq, out):
'''Filter fastq'''
ids, lineno, flag = set('@' + x + '\n' for x in read_ids), 0, False
fq = gzip.open(fq, 'rb')
out = gzip.open(out, 'wb')
for line in fq:
lineno += 1
if lineno % 4 == 1:
flag = (line.decode('ascii') in ids)
if flag:
out.write(line)
Filter BAM based on read ids
samtools view -h your.bam | awk -F"\t" -vOFS="\t" 'NR==FNR{id[$1]=$1; next} \
{if($1 ~ /^@/) {print $0} else {if($1 in id) print $0}}' <(zcat ids.gz) - | \
samtools view -bS - > your.sub.bam
samtools view -h
prints the header as well, -
means the stdout/stdin from samtools. Can also extract the files with other stuff.
Filter fastq by sequence length
def filterFq(fastq, out, cutoff_len):
lineno, flag = 0, False
fq = gzip.open(fastq, 'rb')
out = gzip.open(out, 'wb')
for line in fq:
lineno += 1
if lineno % 4 == 1:
read_id = line
if lineno % 4 == 2:
flag = (len(line.decode('ascii').rstrip()) > cutoff_len)
if flag:
out.write(read_id)
if flag and lineno % 4 != 1:
out.write(line)
Count the occurrence of sequences/reads
awk 'NR%4 == 2 {seen[$0]++} END {for (i in seen) {print i"\t"seen[i]}}' \
<(zcat barcode.fastq.gz) | gzip > barcode.cnts.txt.gz
Take a peek at the sequence in fastq given a read id
zcat fastq.gz | awk '/@AAA:BBB:12345/{x=NR+3} (NR<=x) {print}'
To be continued...