InterProScan 6: a modern large-scale protein function annotation pipeline
- Track: Bioinformatics & Computational Biology
- Room: K.4.601
- Day: Saturday
- Start: 16:35
- End: 16:50
- Video only: k4601
- Chat: Join the conversation!
The rate of sequencing novel biological constructs far outpaces the capacity for accurate laboratory-based functional annotation. Computational methods help prioritise research and manage resources, yet protein function prediction remains a core challenge in bioinformatics. InterProScan is a widely adopted tool for functional annotation, scanning protein sequences against predictive models, including HMMs and BLAST PSSMs, and mapping results to InterPro entries. This facilitates assignment of Gene Ontology (GO) terms, pathways, and other curated data. InterProScan is integral to genome annotation pipelines like UniProt, Ensembl, and MGnify Genomes.
InterProScan 5 introduced a Java-based architecture capable of coordinating multiple analyses across compute environments. However, its monolithic design and tight coupling to specific data releases created challenges: users had to download large bundles and manage dependencies manually, complicating installation and reproducibility.
To address these issues, we present InterProScan 6, a complete reimplementation using the Nextflow workflow management system. Designed for flexibility, scalability, and reproducibility, it incorporates modular data handling, modern container technologies, and improved integration mechanisms.
InterProScan 6 is a modular Nextflow pipeline, enabling parallelisation and scalability across environments ranging from local systems to HPC (e.g. Slurm, LSF) and cloud platforms (e.g. AWS, Google Cloud). Users need only install Nextflow and a container engine (e.g. Docker, Singularity); all other dependencies are bundled in containers and automatically retrieved by the workflow, ensuring consistent environments and simplifying setup.
InterProScan 6 supports predictive tools from InterPro member databases (e.g. CDD, Pfam) and others like AntiFam, Coils, and MobiDB-lite. It also integrates advanced deep learning predictors: TMHMM is replaced by TMbed for transmembrane helix prediction, and SignalP 4.1 by SignalP 6.0. The architecture allows easy integration of new methods without altering the core pipeline.
Another major improvement is the decoupling of code and data. Unlike prior versions, software and data are no longer distributed as a single package. Users can specify the InterPro data version at runtime (via ‘--interpro
To lower the storage burden and improve usability, InterProScan 6 introduces on-demand data retrieval. By invoking the pipeline with the ‘--applications’ and ‘--interpro’ parameters, the workflow automatically fetches only the necessary signature data required for the selected analyses and InterPro version. This significantly reduces disk usage and simplifies setup for users only interested in a subset of available tools.
Containerisation is central to InterProScan 6’s design. Every pipeline step is executed within a defined container image, bundling all required software and system libraries. Profiles for Docker, Singularity, and Apptainer are included, allowing users to run the pipeline consistently across different environments. This approach removes the need for manual dependency management, simplifies troubleshooting, and greatly enhances reproducibility.
InterProScan 5 uses the Match Lookup Service, a web service that provides precomputed matches for known sequences. When sequences are scanned with InterProScan, their MD5 checksums are submitted to this service, which returns existing annotations for recognised sequences, allowing InterProScan to bypass redundant local computation and improve performance. However, the original Match Lookup Service was purpose-built for InterProScan 5 and not designed for broader accessibility, representing a missed opportunity to share over 14 billion matches derived from 1 billion sequences with the wider research community. To address this, the Match Lookup Service has been reimplemented as the Matches API (https://www.ebi.ac.uk/interpro/matches/api), a modern, developer-friendly, RESTful web service. The Matches API supports programmatic submission of up to 100 sequence checksums per request and returns results in JSON format consistent with InterProScan 6 output. Unlike the original service, the API also returns associated InterPro entries for matches linked to integrated signatures, as well as residue-level annotations for CDD, SFLD, and PIRSR. Furthermore, support for Cross-Origin Resource Sharing (CORS) allows client-side web applications to directly access the API, greatly facilitating integration of InterPro annotations into external tools and analysis pipelines.
InterProScan 6 is available under the Apache 2.0 open source license, and distributed via GitHub (https://github.com/ebi-pf-team/interproscan6/), with containers hosted on DockerHub. Extensive documentation and example configurations are provided to help users deploy and customise the pipeline for diverse annotation scenarios.
Speakers
| Matthias Blum |