Creating a command-line Python app with Click

2015-07-04

This tutorial demonstrates how to add a command-line interface to a script to turn it into a CLI utility program.

As a simple example, let’s write a script to convert a DNA sequence file from one format to another. We’ll actually just call Biopython to do the conversion for us.

First, let’s create a dummy EMBL file test.embl to use for testing:

from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

record = SeqRecord(
    Seq("ACGT", IUPAC.unambiguous_dna),
    id="1",
    name="A",
    description="A sp. genome",
    annotations={"organism": "A sp."},
)

with open("test.embl", "w") as f:
    f.write(record.format("embl"))

Simple script

# embl2fasta_v1.py
import sys
from Bio import SeqIO

embl_file = sys.argv[1]
fasta_file = sys.argv[2]

SeqIO.convert(embl_file, "embl", fasta_file, "fasta")

Test the script by running

$ python embl2fasta_v1.py test.embl test.fasta

test.fasta should look like this:

>1 A sp. genome
ACGT

Using functions

In this case, the script is super-simple. But usually, it is more useful to package the code up into a function, since you can then factor out anything that needs repeating. It also allows the code to be reused in other scripts.

# embl2fasta_v2.py
from Bio import SeqIO

def embl2fasta(embl_file, fasta_file):
    """Convert EMBL_FILE to FASTA_FILE."""
    SeqIO.convert(embl_file, "embl", fasta_file, "fasta")

if __name__ == "__main__":
    import sys
    embl_file = sys.argv[1]
    fasta_file = sys.argv[2]
    embl2fasta(embl_file, fasta_file)

__name__ is "__main__" only when the script is run from the command line. This means that the last block will not be executed if this script is imported by another script (e.g. from embl2fasta import embl2fasta).

Argument parsing

Let’s add some code to check the input.

Approach 1: Look before you leap

This approach is common in languages like R and (apparently) C.

# embl2fasta_v3.1.py
from Bio import SeqIO

def embl2fasta(embl_file, fasta_file):
    """Convert EMBL_FILE to FASTA_FILE."""
    SeqIO.convert(embl_file, "embl", fasta_file, "fasta")

if __name__ == "__main__":
    import sys

    if len(sys.argv) != 3:
        sys.exit("Error: Provide input and output file names.")
    else:
        # Unpack the arguments provided.
        script, embl_file, fasta_file = sys.argv

    embl2fasta(embl_file, fasta_file)

The disadvantage is that the len(), !=, and if operations are executed every time the script is run, whether the input was actually correct or not.

Approach 2: It’s better to beg forgiveness than to ask permission

It is considered more Pythonic to use a try/except block, since the extra code is only run if an error gets thrown. In other words, the code is more efficient when the input is assumed to be correct, and efficiency doesn’t matter when the input is wrong anyway.

# embl2fasta_v3.2.py
from Bio import SeqIO

def embl2fasta(embl_file, fasta_file):
    """Convert EMBL_FILE to FASTA_FILE."""
    SeqIO.convert(embl_file, "embl", fasta_file, "fasta")

if __name__ == "__main__":
    import sys

    try:
        script, embl_file, fasta_file = sys.argv
    except ValueError:
        sys.exit("Error: Provide input and output file names.")

    embl2fasta(embl_file, fasta_file)

Approach 3: Use an argument parser

We could the argparse module in the standard library, but it would take at least five lines to set up. The Click package is more user-friendly. First install it with pip install click.

# embl2fasta_v3.3.py
from Bio import SeqIO
import click

@click.command()
@click.argument("embl_file")
@click.argument("fasta_file")
def embl2fasta(embl_file, fasta_file):
    """Convert EMBL_FILE to FASTA_FILE."""
    SeqIO.convert(embl_file, "embl", fasta_file, "fasta")

if __name__ == "__main__":
    embl2fasta()

Calling this script with the wrong number of arguments now prints an informative usage message.

Calling it as python embl2fasta_v3.3.py --help will print the following:

Usage: embl2fasta_v3.3.py [OPTIONS] EMBL_FILE FASTA_FILE

  Convert EMBL_FILE to FASTA_FILE.

Options:
  --help  Show this message and exit.

A generic converter

We can now easily add some options to allow conversion between various formats, so that we don’t need to write a separate script for conversion from Genbank format, for example. (The function and parameter names should be updated too.)

# convert_seq_v1.py
from Bio import SeqIO
import click

@click.command()
@click.argument("in_file")
@click.argument("out_file")
@click.option("-f", "--in-format", default="embl", show_default=True)
@click.option("-t", "--out-format", default="fasta", show_default=True)
def convert_seq(in_file, in_format, out_file, out_format):
    """Convert IN_FILE in IN_FORMAT to OUT_FILE in OUT_FORMAT."""
    SeqIO.convert(in_file, in_format, out_file, out_format)

if __name__ == "__main__":
    convert_seq()

The help message now reads as follows:

Usage: convert_seq.py [OPTIONS] IN_FILE OUT_FILE

  Convert IN_FILE in IN_FORMAT to OUT_FILE in OUT_FORMAT.

Options:
  -f, --in-format TEXT   [default: embl]
  -t, --out-format TEXT  [default: fasta]
  --help                 Show this message and exit.

Try running, e.g.

$ python convert_seq.py -t genbank test.embl test.genbank

But what happens if an invalid format is used? We could either specify a list of accepted formats, or handle the errors.

Click can take a list of valid choices for options:

SEQ_FORMATS = ("fasta", "fastq", "embl", "genbank")

@click.option("-f", "--in-format", default="embl", show_default=True,
    type=click.Choice(SEQ_FORMATS))
@click.option("-t", "--out-format", default="fasta", show_default=True,
    type=click.Choice(SEQ_FORMATS))

However, if Bio.SeqIO starts supporting additional formats, this list would have to be updated manually.

Instead, we could allow any input into the script and catch Bio.SeqIO’s own error:

import sys

def convert_seq(in_file, in_format, out_file, out_format):
    try:
        SeqIO.convert(in_file, in_format, out_file, out_format)
    except ValueError as err:
        sys.exit("Error: %s" % e)

Let’s assume we know all the formats we expect to deal with, and go with the first option.

Logging

Finally, let’s have the script produce some status messages to stderr. We’ll use the logging module in the standard library to produce the log and click.style() to colourize the output.

Here is the final script:

# convert_seq_v2.py
import logging
import sys
from Bio import SeqIO
import click

SEQ_FORMATS = ("fasta", "fastq", "embl", "genbank")

logging.basicConfig(
    level=logging.INFO,
    datefmt="%Y-%m-%d %X",
    format="%(asctime)s %(levelname)s %(message)s",
)

@click.command()
@click.argument("in_file")
@click.argument("out_file")
@click.option("-f", "--in-format", type=click.Choice(SEQ_FORMATS),
    default="embl", show_default=True)
@click.option("-t", "--out-format", type=click.Choice(SEQ_FORMATS),
    default="fasta", show_default=True)
def convert_seq(in_file, in_format, out_file, out_format):
    """Convert IN_FILE in IN_FORMAT to OUT_FILE in OUT_FORMAT."""
    logging.info(
        click.style("Converting %s from %s to %s", fg="green"),
        in_file, in_format, out_format,
    )
    try:
        SeqIO.convert(in_file, in_format, out_file, out_format)
    except Exception as err:
        logging.error(click.style("%s", fg="red"), err)
        sys.exit()
    else:
        logging.info(
            click.style("Output written to %s", fg="green"),
            out_file,
        )

if __name__ == "__main__":
    convert_seq()

To trigger the error handling, try converting our dummy EMBL, which doesn’t contain any quality values, into FastQ format:

$ python convert_seq_v2.py -t fastq test.embl test.fq
2015-07-04 07:35:27 INFO Converting test.embl from embl to fastq
2015-07-04 07:35:27 ERROR No suitable quality scores found in  ↩
                        ↪ letter_annotations of SeqRecord (id=1).