This repository contains insights into to the risk-assessment process for validating R packages at Novartis prior to their use on our SCE (Scientific Computing Environment) for analysis and reporting in clinical trials, i.e. GxP activities.
The version of R and associated set of packages are updated periodically as part of operational releases. The validation process prevents the need for double programming of the functionality of the package but users remain accountable for understanding the implementation of the methodology (e.g. solvers and defaults).
The outcome of the package validation is stored in the ./data directory and the most recent released data is rendered below:
Rows: 1148 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): package, source, version, r_version, rating
date (1): assessment_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Risk assessment timing
As part of the software update cycle, packages are risk assessed and if their risk level is deemed acceptable, included in the next software release on our SCE.
The risk assessment dates shown above capture 2 scenarios:
Risk assessments carried out as part of the previous release
New risk assessments carried out on the lead up to the latest release
The package versions reported here are the versions of the package installed in the R module on the SCE, which may be a slightly later version of the package than that originally assessed.
We conducted an analysis of changes in risk assessment ratings over time and found that the risk rating changed in only minority of packages (7.4%). In these small number of cases the changes were primarily non-impactful, e.g. a package transitioning from medium risk to low risk, with changes primarily driven by fluctuation in download rates. These data helped inform our strategy for the risk assessment of existing packages - we re-assess the risk for major milestones like updates to the R version but for more incremental releases we carry over the risk assessment from the previous release.
Risk assessment calculations are mostly automated, with some intervention from reviewers when metrics require manual verification or calculation (e.g. trusted source type metrics), and to give overall approval of a package’s risk assessment.
Low or medium risk packages are automatically approved for installation on our SCE.
If a package has been deemed to be high risk, the requester is required to perform additional checks to verify the functionality of the package through performance qualification (PQ) tests.
If a package receives a rating of “ineligible”, it has failed all basic pre-requisites and cannot be installed as part of this process.
Risk rating criteria
Each risk rating has a set of defined criteria. If a package meets at least one of these criteria, then it is assigned that particular risk rating. In the section below, we briefly describe the criteria at a high-level, and in the next section we expand on the details of some of the more complex ones.
Prerequisites
Does the package fulfill the following criteria?
Source code is publicly available (e.g., via CRAN, Bioconductor, GitHub, …)
Package has not been archived more than 180 days ago on CRAN or Bioconductor
If the answer is “no” to both of these questions, then the package is ineligible.
Low risk packages
The package meets at least one of these low-risk criteria:
Does the package belong to any of the following groups?
base or recommended package
tidyverse package
imports/suggestions of a base or recommended package
Does the package come from a recognizable source or reputable institution?
Does the package have at least 50,000 downloads per month? Use the average over the last 6 months or since release
Does the package have at least 5 reverse dependencies?
Is the package a “low risk support package”, e.g., one devoted solely to format, display or visualize the output or is simply a data package?
Medium risk packages
The package meets at least one of these medium-risk criteria:
Does the package have at least 1,000 downloads per month?
Are there any peer-reviewed publications about the package in reputable journals?
Is there unit testing covering more than 60%?
High risk packages
The package meets at least one of these high-risk criteria:
Was the first CRAN / Bioconductor production release of the package more than 1 year ago?
Does the package contain any additional documentation? (vignettes, manuals, website, newsfeed)
Acceptance criteria guidelines
Questions 2, 5, and 7 rely on more complex decision criteria for determining a package’s risk rating, and more details of these are found below.
Trusted sources and institutions
We maintain a list of trusted sources, consisting of individuals who are considered to be experienced developers with numerous contributions to the open source community, and institutions who are considered to be well-established.
Users can request that new sources are added to this list, and each request is evaluated individually. As a minimum, an individual must fulfill one of the following criteria to be included:
a h-index greater than or equal to 20
have 5 or more active packages on CRAN or Bioconductor
For institutions, the both of the following criteria must be met:
have existed for 5 or more years
have 5 or more active packages on CRAN or Bioconductor
For both institutions and individuals, the criteria above are a minimum, and meeting these criteria does not guarantee that the request for their addition will be granted.
Journal publication criteria
For a package to be deemed medium-risk due to having an associated journal article publication, the journal must be one from a list of approved journals. A journal must fulfill all of the following minimum criteria:
publisher maturity of 5 or more years
peer review is an obligatory part of the publishing process
In addition, a journal must fulfill at least one of the following conditions:
A trustworthy journal should have an above average citation potential (>= 1.0 on the SCImago Journal Rank)
Impact factor of 2 or more
Low-risk support package
Low-risk support packages are those that do not have the capacity to alter the scientific analysis or conclusion. This includes pure display or data packages. Packages that merely provide some limited level of interactivity (e.g., zoom, hover text) may still be considered low-risk support packages, but packages that contain complex or impactful mathematical operations, generate or manipulate code used in scientific analysis, or do any kind of computations cannot be considered to be low-risk support packages.