rncl: A new package to import NEXUS and Newick trees in R

4 minute read

rncl is now officially on CRAN. This package provides an interface with the NEXUS Class Library (NCL). NCL, written in C++, is included in several popular phylogenetic packages including Garli. NCL is comprehensive and is intended to parse and write valid NEXUS files, and has been extended to also support Newick files. rncl provides an interface to this library to import phylogenetic files in R.

What are NEXUS files? What are Newick files?

NEXUS files are commonly used files in phylogenetic and can be used to represent data used to build phylogenetic trees (e.g., DNA sequence alignment), phylogenetic trees, and/or data associated with species in the tree. Each element of a NEXUS file is found in a “block”.

The trees in NEXUS files are represented using the Newick notation (e.g., ((A,B),C);), but typically the list of taxa are defined first in a TAXA block, and represented later in the tree with a corresponding number.

NEXUS is an extension of the Newick file format. The Newick file format could represent either sequence data or trees. It was first designed in the mid-1980's and was most famously implemented in Felsenstein's PHYLIP program. NEXUS extended Newick by allowing the inclusion in a single file of both sequence and trees, and providing more flexibility by allowing programs to add additional blocks. If you are interested more in the topic, I suggest you read the original paper by Maddison, Swofford, Maddison describing the NEXUS file format.

Why rncl?

Because NEXUS files can contain a lot information, that can be presented in different ways, writing a robust parser is very difficult. The NEXUS Class Library has been developed since the early 2000's, first by Paul O. Lewis and more recently by Mark Holder. Because it is mature and used by several popular programs, its implementation is comprehensive and robust.

Importing NEXUS and Newick files in R works in many cases, but is not ideal. Additionally, NCL can parse trees, and is able to deal with situations that the phylogenetic packages in R cannot deal with.

Last September, I participated in the hackathon for the Open Tree of Life. There, with David Winter and Joseph Brown, we started to develop a package that interfaces with the Open Tree of Life API (rotl). It allows users to work with the data from the Open Tree of Life directly in R. However, we couldn’t use the functions provided by ape to import the trees, as many trees contained singletons (a node doesn’t lead to two descendants) represented as in this example:

'(((((A)cats,B)dogs,(C,D)mammals)tetrapods)animals,E)life;

Currently, it’s difficult to read such trees in R. When using ape:

library(ape)
singTree <- "(((((A)cats,B)dogs,(C,D)mammals)tetrapods)animals,E)life;"
ape::read.tree(text=singTree)
## Error in if (sum(obj[[i]]$edge[, 1] == ROOT) == 1  dim(obj[[i]]$edge)[1]
   missing value where TRUE/FALSE needed

phytools has a function that allows it but is slow on large trees as it has to parse characters one by one:

library(phytools)
phytools::read.newick(text=singTree)
## Phylogenetic tree with 5 tips and 6 internal nodes.
##
## Tip labels:
## [1] "A" "B" "C" "D" "E"
## Node labels:
## [1] "life"      "animals"   "tetrapods" "dogs"      "cats"      "mammals"
##
## Rooted; no branch lengths.

rncl can read files with singleton without issues.

library(rncl)
tmpFile <- tempfile()
cat(singTree, file=tmpFile)
rncl::read_newick_phylo(file=tmpFile)
## Phylogenetic tree with 5 tips and 4 internal nodes.
##
## Tip labels:
## [1] "A" "B" "C" "D" "E"
## Node labels:
## [1] "life"      "tetrapods" "dogs"      "mammals"
##
## Rooted; no branch lengths.

You may notice that the tree returned doesn’t include the node labels associated with the singleton nodes whereas phytools does. This is because the exported function for rncl collapses the singletons before returning the object while phytools does not.

rncl is also efficient on large trees. For instance here is how the three functions perform on a 3000+ tip tree:

library(microbenchmark)
microbenchmark(tr_ape(), tr_phytools(), tr_rncl(), times=10)
<pre><code>## Unit: milliseconds
##           expr        min         lq      mean    median        uq
##       tr_ape()  543.02340  551.91244  583.0824  572.0407  588.4832
##  tr_phytools() 3947.90586 4029.03325 4337.5111 4168.4067 4643.2461
##      tr_rncl()   84.62412   93.07477  103.2693  100.4018  114.4568
##        max neval
##   706.5012    10
##  5069.9927    10
##   124.0237    10

rncl is about 7 times faster than ape and 50 times faster than phytools.

rncl and phylobase

Most of the current code in rncl actually comes from the phylobase package. Given the need to have an efficient parser for the package that will interface with the Open Tree of Life's API, it made sense to take NCL out of phylobase and have it in a package of its own. The main difference is that rncl can build the edge matrix directly from NCL outputs, and does not rely on ape's parser for newick strings, making it faster and allowing for the singleton nodes.

Not having ncl inside phylobase will also make the development of phylobase easier as it will include a lot less code to be compiled. However, phylobase will remain very efficient at parsing NEXUS files that contain data associated with tips of trees but will depend on rncl to do so. These changes will be part of the next version of phylobase (0.8.0) that I am planning to release in January.

What's next?

Currently rncl only parses trees and data associated with tips in the trees. It should be relatively easy to extend it to also parse sequence data. rncl could also be used to generate NEXUS and Newick files directly from phylogenetic objects in R.

In the mean time, please report bugs and request features on github where the development of rncl takes place.

Comments