Extracting GFF from Genbank file

2016-02-28

Here are two ways of getting sequence annotations from a Genbank-format file into GFF format.

Method 1: BioPerl

Use the Perl script provided by the BioPerl project at https://github.com/bioperl/bioperl-live/blob/master/examples/tools/gb_to_gff.pl:

#!/usr/bin/env perl
use strict;

use Bio::Tools::GFF;
use Bio::SeqIO;

my ($seqfile) = @ARGV;
die("must define a valid seqfile to read") unless ( defined $seqfile && -r $seqfile);

my $seqio = new Bio::SeqIO(-format => 'genbank',
                           -file   => $seqfile);
my $count = 0;
while( my $seq = $seqio->next_seq ) {
    $count++;
    # defined a default name
    my $fname = sprintf("%s.gff", $seq->display_id || "seq-$count");
    my $gffout = new Bio::Tools::GFF(-file => ">$fname" ,
                                     -gff_version => 1);

    foreach my $feature ( $seq->top_SeqFeatures() ) {
        $gffout->write_feature($feature);
    }
}

Method 2: ReadSeq

If you don’t want to install BioPerl (understandably), you could use readseq:

Extract sequence as FastA format file:

$ readseq -format=fasta -output=seq.fas seq.gbk

Extract feature table as GFF v.2 format file:

$ readseq -format=gff -output=seq.gff seq.gbk

Split a GenBank annotated sequence file into a GFF v.2 feature table file and a FastA sequence file:

$ readseq -format=gff -output=seq.gff -unpair=1 seq.gbk \
    && mv seq.gff.fasta seq.fas

The ReadSeq project page is at https://sourceforge.net/projects/readseq/, though you should be able to get it from your OS’s package manager, e.g. sudo apt-get install readseq.