Towards more reproducibility in vegetation research

The post provided by Marta Gaia Sperandii, Manuele Bazzichetto, Glenda Mendieta-Leiva, Sebastian Schmidtlein, Michael Bott, Renato Augusto Ferreira de Lima, Valério D. Pillar, Jodi N. Price, Viktoria Wagner and Milan Chytrý

Temporal trend in the percentage of articles published in the Journal of Vegetation Science (JVS) and Applied Vegetation Science (AVS) with data (left) or code (right) made available by the authors. The gray band highlights 2019, the year in which a mandatory Data Availability Statement was introduced.

This post refers to the article Towards more reproducibility in vegetation research by Sperandii et al., published in the Journal of Vegetation Science (https://doi.org/10.1111/jvs.13224)

The 2024 Editorial of the Journal of Vegetation Science (JVS) was prepared jointly by the Steering Committee of the IAVS Ecoinformatics Working Group and the JVS Chief Editors. We examined two key elements of research reproducibility: sharing data and sharing code used in research published in the Journal of Vegetation Science and Applied Vegetation Science (AVS).

Computational reproducibility is the ability to obtain consistent results using the same input data, computational steps, methods, and code from a previous study. It plays an important role in science because it: (1) ensures the credibility of scientific results; (2) improves the understanding of complex analytical workflows; (3) promotes knowledge sharing; and (4) allows the saving of research funds. In vegetation science, as in other natural sciences, standards for the computational reproducibility of analyses (hereafter referred to as “reproducibility”) have increased recently, driven in part by the widespread digitization of data and literature. The open sharing of data and code (the full source computer code used to generate research results) through repositories is a crucial step in promoting reproducibility, collaboration and scientific progress. In practice, however, reproducibility can be a challenge for researchers because it involves ensuring the permanent accessibility of data and code in repositories, creating sufficient descriptive metadata, providing unambiguous identifiers (e.g., Digital Object Identifiers, DOIs) and a versioning system. Another important aspect is that code needs to be thoroughly commented, documented and generalizable so that it can not only be reproduced but also extended to other contexts, such as different study areas or groups of organisms.

The International Association for Vegetation Science (IAVS) strives to improve the reproducibility and transparency of its publications. In 2019, a “Data Availability Statement” was introduced as a mandatory section in JVS and AVS. The new policy follows Wiley’s “Mandates Data Sharing” line, which states that “it is required, as a condition for publication, that the data supporting the results in the paper will be securely archived in a public repository with DOI or in electronic Supplementary Information related to the paper. Whenever possible, the program source code, scripts, and other artefacts used to generate the analyses presented in the paper should also be publicly archived” (quoted from the JVS and AVS Author Guidelines in November 2023). Here, we examine how data and code are shared in JVS and AVS and whether there is an increasing trend over time.

We looked at articles published in JVS and AVS to assess the level of data and code availability achieved. Specifically, we scanned all issues of the two journals published in the last 10 years (from January 2013 to October 2023), which resulted in a total of 1,902 articles (1,184 from JVS and 718 from AVS). Commentaries and obituaries were excluded from the analysis. For each article, we evaluated, separately for data and code, whether these were:

1. Available: the authors stated that data/code were made available and provided a link, either to the supplementary material or to an external repository.

2. Accessible: the link worked, and the data/code could be downloaded and opened.

For data/code that was available but not accessible, we recorded whether this was because of a broken link, a private repository or other reasons.

Our working definition of data was limited to the numerical or categorical information necessary to reproduce at least the main analyses (and results) presented in the article, and thus included either raw data or derived data used as a starting point of the respective analysis. The latter category has limitations in terms of reproducibility, but in practice, it has proved difficult to separate raw data from derived data because there is a continuum between them.

The percentage of articles (standardized by the total number of articles published per year) with available data ranged from 1.5% in 2014 (AVS) to 82.3% in 2022 (JVS). The percentage of articles with available code ranged from 0% in 2013 (AVS) to 26.3% in 2023 (JVS). Both journals showed a steep increase in data availability from 2019 onwards. In comparison, the percentage of code available has increased only moderately since 2019. The differences between the journals showed that JVS had a slightly higher proportion of articles with available data than AVS in all years except 2015–2017 (Figure 1). Similarly, the proportion of articles with available code was slightly higher in JVS than in AVS in almost all years (Figure 1). The data for 2023 (to mid-October) revealed an overall drop in the availability of code and data (except for code in JVS). Overall, accessibility was quite high, with 26.5% of the available data also being accessible (25.8% for JVS and 27.6 for AVS) and 8.5% of the available code also being accessible (9.6% for JVS and 6.5 for AVS). For articles with available but inaccessible data, this was mostly due to the link to the data being broken (52.9% of cases), although the use of private repositories was also a common cause (35.3% of cases). For code, the most frequent reason for inaccessible code was a broken link (69.2%), whereas the use of private repositories was much less common (7.7%).

Our analysis shows a remarkable increase in the availability of data and a much slower increase in code availability since 2019. The trend was particularly strong for data sharing, which experienced a much steeper increase (JVS: from an average of 8 articles with available data in the period 2015–2018 to 67.5 in the period 2020–2023; AVS: 6.5 to 46 articles) than code (JVS: from an average of 6.75 articles with available code in the period 2015–2018 to 16.8 in the period 2020–2023; AVS: 1.25 to 9.5 articles). The slower increase in code availability could be because: (1) data analysis is often not code-based, but rather relies on the use of point-and-click software; (2) less-strict rules of the journals, which make data sharing mandatory, but encourage code sharing only “whenever possible”; (3) authors being more familiar with data-sharing than code-sharing practices; and (4) data sharing being less time-consuming than code sharing, which typically requires time to clean up, annotate and organize scripts. In addition, there may be other barriers to archiving code, such as uncertainty about how to prepare code for sharing, complexity of workflow, fear that revealing the details of data management and analysis might cause others to spot inaccuracies or errors, or embarrassment due to perceived lack of style in code writing, for which Gomes et al. (2022) discuss possible solutions.

The steep increase in data archiving that we have highlighted in our analysis represents a commendable effort towards improving reproducibility. Yet, this should not stop us from striving for further improvements, especially in terms of code sharing. One way to achieve this is to provide more information and training on data- and code-sharing practices. There are convenient options for making data and code repositories permanent, accessible and unambiguously identifiable (through DOI), and several of these options are free.

Advocating the open sharing of scientific data and code via repositories is a pragmatic choice. It facilitates research efficiency by encouraging collaboration among researchers. The commitment to transparency is a practical step towards more reproducibility in the future. However, there are also issues of governance, protection of local community knowledge and restrictions on third-party data sharing, as well as equity in data use that should be considered and respected by authors and editors.

Brief personal summary: Marta Gaia Sperandii, Manuele Bazzichetto, Glenda Mendieta-Leiva, Sebastian Schmidtlein, and Renato Augusto Ferreira de Lima are members of the Steering Committee of the Ecoinformatics Group of the International Association for Vegetation Science. Valério D. Pillar, Jodi N. Price, Viktoria Wagner and Milan Chytrý are Chief Editors of the Journal of Vegetation Science.