A curated dataset of peste des petits ruminants virus sequences for molecular epidemiological analyses
Peste des petits ruminants (PPR) is a highly contagious and devastating viral disease infecting predominantly sheep and goats. Tracking outbreaks of disease and analysing the movement of the virus often involves sequencing part or all of the genome and comparing the sequence obtained with sequences from other outbreaks, obtained from the public databases. However, there are a very large number (>1800) of PPRV sequences in the databases, a large majority of them relatively short, and not always well-documented. There is also a strong bias in the composition of the dataset, with countries with good sequencing capabilities (e.g. China, India, Turkey) being overrepresented, and most sequences coming from isolates in the last 20 years. In order to facilitate future analyses, we have prepared sets of PPRV sequences, sets which have been filtered for sequencing errors and unnecessary duplicates, and for which date and location information has been obtained, either from the database entry or from other published sources. These sequence datasets are freely available for download, and include smaller datasets which maximise phylogenetic information from the minimum number of sequences, and which will be useful for simple lineage identification. Their utility is illustrated by uploading the data to the MicroReact platform to allow simultaneous viewing of lineage date and geographic information on all the viruses for which we have information. While preparing these datasets, we identified a significant number of public database entries which contain clear errors, and propose guidelines on checking new sequences and completing metadata before submission.