Core GRADE
Frequently Asked Questions
If you have a question about Core GRADE that you would like answered
please email guyatt@mcmaster.ca.
Core GRADE provides the essentials for using GRADE to address paired comparisons of treatments in systematic reviews, clinical practice guidelines, and health technology assessments.
Core GRADE 1, overview
Doi:10.1136/bmj-2024-081903
https://www.bmj.com/content/389/bmj-2024-081903
Core GRADE 2, target and precision
Doi:10.1136/bmj-2024-081904
https://www.bmj.com/content/389/bmj-2024-081904
Core GRADE 3, inconsistency
Doi:10.1136/bmj-2024-081905
https://www.bmj.com/content/389/bmj-2024-081905
Core GRADE 4, risk of bias
Doi:10.1136/bmj-2024-083864
https://www.bmj.com/content/389/bmj-2024-083864
Core GRADE 5, indirectness
Doi:10.1136/bmj-2024-083865
https://www.bmj.com/content/389/bmj-2024-083865
Core GRADE 6, summary of findings
Doi:10.1136/bmj-2024-083866
https://www.bmj.com/content/389/bmj-2024-083866
Core GRADE 7, evidence to decision
Doi:10.1136/bmj-2024-083867
Question from Lukman Thalib <l.thalib@griffith.edu.au>
I am currently working on updating a systematic review. As you are aware, Cochrane now requires a GRADE assessment for all outcomes, including those based on a single study. While we are fully supportive of the principles and value of GRADE, we are encountering some challenges with its application with single studies.
For example, in one outcome, there is only a single RCT evaluating a newer intervention, with very few events overall—only a handful in the control arm and just one in the intervention arm. This resulted in a small relative risk, and while the confidence interval did not cross 1, the data are clearly sparse and highly uncertain. However, as we cannot rate down for heterogeneity or publication bias in a single-study context, the resulting GRADE assessment over-estimated the certainty of evidence—something that feels misleading given the limited data and early nature of the intervention.
Our group lead feels strongly that GRADE should primarily be applied to a body of evidence rather than to individual studies in isolation, particularly in situations like this. However, Cochrane’s expectations are quite firm.
Given your leadership being the developer of GRADE, we would greatly appreciate any insights or guidance you might be able to offer. Specifically, do you believe GRADE can or should be adapted in such contexts, or are there ways within the current framework to appropriately reflect the uncertainty in these situations?
Response
Lukman, if you are thinking of undertaking a review of the intervention, it means that people are thinking of using it. And if they are, they should be aware of the best estimates of effect and the certainty of those estimates. That is true however many studies are available.
The challenge is to get the certainty rating correct. If your intuition tells you that the certainty is lower than what your impression of the GRADE rating would indicate then your impression of the GRADE rating is incorrect (a fundamental principle of GRADE is that GRADE ratings should reflect the educated, thoughtful gestalts of the people making the ratings).
From what you tell me, the evidence is clearly low certainty and quite possibly very low. Further, from what you tell me the point estimate of the effect is large. Whenever the effect is large (and for binary outcomes this can mean a relative risk reduction greater than 30%) this means you should consider invoking assessment with the optimal information size (see https://www.bmj.com/content/389/bmj-2024-081904). As described in the article, the OIS is the sample size
calculation that would be undertaken when planning a single randomised controlled trial. For binary outcomes, these involve
specifying the acceptable error rates: α (typically 0.05)and β (typically 0.20), the control group event rate (chosen from the context), and a modest relative risk reduction, typically 20% or 25%. From what you tell me, it is a sure thing that your study will fail the OIS criterion.
Moreover, if the sample size is a long way from the OIS (and for the OIS calculation you can use a conservative 20% RRR) then you can rate down twice for imprecision. From what you tell me about the situation that is almost certainly the case. So now we’ve arrived at low certainty of evidence.
Your enlightened gestalt may well be that this represents very low certainty. In that case, we need to think how to apply GRADE to come to the correct conclusion. One possibility is that you think the number of events is so small you are very uncertain whether there is any effect at all. If the sample size is sufficiently far from the OIS you may consider rating down three levels for imprecision.
Alternatively, there may be other reasons for your judgment that the evidence is very low certainty. Are their risk of bias concerns? Or do you suspect that there may be one or more other studies out there that failed to show an effect and so were not published (no reason not to invoke publication bias concerns when there is only a single study)? Rating down for either of these reasons (or for the combination if there are concerns in both but neither sufficient to rate down by itself) would also get you to a GRADE rating of very low certainty if that is your judgment of what is appropriate.
Question from Xiaomei Yao <yaoxia@mcmaster.ca>
I have a question regarding Table 1 in Core GRADE Paper 1. The table lists “Large baseline risk in unvaccinated people” as an example of varying baseline risk leading to different absolute effects, and “Large beneficial effects in unvaccinated people” as an example of varying relative risk.
For a single factor like vaccination, it seems it could be considered either a baseline risk factor or an effect modifier. My question is:
At the project planning stage, how do we decide whether to categorize a factor as influencing baseline risk or relative risk?
Should we consider listing the same factor under both categories?
Response
Let’s consider the two issues separately.
First, baseline risk.
One tries to think of easily identifiable patient characteristics that are likely to associated with big differences in risk of a patient-important outcome that one is considering. Typical such characteristics would be age, disease severity, and presence of comorbidity. For such factors, one considers whether it is plausible that an investigator – ideally in well done cohort studies, but if that’s not available, less well done cohort studies or randomized trials – might have provided data for your groups of patients (the old and the young; greater and less disease severity; with and without comorbidity). If the answer is yes, and you suspect differences in baseline risk might be large enough so that the optimal action (and in a guideline the recommendations) might differ for the different groups of patients one includes this as an a priori hypothesis and seeks the evidence.
Second, relative effects
One considers patient characteristics that might modify the relative effect of the intervention. Might the relative effect be larger in the old versus the young, those with greater versus less disease severity, or with comorbidity versus without (or in each case, vice versa: larger in the young versus the old, etc.).
In considering these possible effect modifications (synonyms: subgroup effects, interactions) one remembers that relative subgroup effects are much, much rarer than differences in baseline risk. Bearing that in mind, one considers the direction of any effect modification that might be present. Is there a compelling biological rationale that the relative effect will be larger in the old than the young (or the opposite). Is there a compelling biological rationale that the relative effect will be larger in those with greater versus less severe diseae (or the opposite). Are you ready to state explicitly the direction of your subgroup hypothesis? If you think: well, age might modify the effect, but it could go either way (bigger effect in the old or the young), you reject this candidate relative subgroup hypothesis. Similarly with disease severity or comorbidity. In other words: no compelling biological reason that dictates the direction of effect, dismiss that hypothesis.
The upshot of this process is that it is more likely that one will have one or more hypotheses about the impact of baseline risk on magnitude of effect than hypotheses about relative effect. It is possible, however, as in the example, that the same variable will generate a hypothesis about baseline risk (larger baseline risk in the unvaccinated) and a hypothesis about relative effect modification (larger relative effect in the unvaccinated).
Question from Shirpada Rao Rao <Shripada.Rao@health.wa.gov.au>
In a hypothetical robust systematic review that included two large and well conducted RCTs, the results were as follows for the outcome of mortality: Relative Risk: 0.88, 95% CI 0.75 to 1.13. The total sample size was 2950.
On GRADE-PRO, should we downgrade evidence by one level, two level or three levels for imprecision? Or should we not downgrade at all?
Response
Dr. Rao, not sure why you refer to GRADEpro. If I were going to use software to assist with the presentation I’d use the MAGICapp
But neither software programs provides answers to interpretation issues that go beyond the basics (and may sometimes mislead with respect to these). So I’d be disinclined to refer to the programs when discussing methods issues such as the ones you raise.
To begin to address your issues, the first step in deciding whether to rate down for imprecision is to decide what it is in which we are rating our certainty – that is, the target of certainty rating. The process of deciding on the target begins with setting a threshold. Using Core GRADE, the two possible thresholds are the null and the minimally important difference (MID). I’ll address the null first.
When one chooses the null as the threshold one begins by rating the certainty in a non-zero effect (i.e. that is the target). In this case, if the point estimate represents the truth, then the intervention results in a 12% relative risk reduction, clearly a non-zero effect. One then examines the confidence interval to see whether it overlaps the chosen threshold, the null. In this case, it does. When there is such an overlap, one will always rate down for imprecision at least once. So, that’s settled, and the only remaining issue is whether one rates down once or twice. Deciding whether to rate down once or twice involves a judgement of whether, considering imprecision alone, one would conclude “the intervention probably provides a non-zero effect” (in this case a non-zero benefit) or whether the more appropriate conclusion is less certain “the intervention possibly provides a non-zero effect”.
That is a matter of intuitive judgment regarding the message one feels is most appropriate for one’s target audience. Personally, I am reluctant to convey the “probably” message when there is a substantial possibility of harm, and to me the 13% increase in relative risks constitutes a substantial possibility of harm. Therefore, if there were no other domains (risk of bias, inconsistency, indirectness, publication bias) in which one has rated down, I’ve be inclined to rate down twice.
However, as Core GRADE 1 emphasizes, one must ultimately step back and take a gestalt look at the final certainty rating. If one had already decided to rate down for one of the other domains, then the best possible certainty rating would be low, and rating down twice for imprecision would result in a rating of very low. Looking at the whole picture of the evidence, one may feel that low is the more appropriate certainty rating rather than very low. If that were the case, then rating down only once for imprecision would be the way to ensure the most appropriate final certainty rating.
Consider now the alternative choice of threshold, the MID. Using the MID for the threshold would involve deciding on a baseline risk, applying the relative risk point estimate and confidence intervals to that baseline risk, and proceeding from there. Since you’ve provided only the relative effect, it seems implicit that you are interested in rating certainty in a non-zero effect, and I won’t proceed farther in the MID direction here.
