D. Strengths and Limitations of RCTs

For demonstrating the internal validity of a causal relationship between an intervention and one or more outcomes of interest, the well-designed, blinded (where feasible), appropriately powered, well-conducted, and properly reported RCT has dominant advantages over other study designs. Among these, the RCT minimizes selection bias in that any enrolled patient has the same probability, due to randomization, of being assigned to an intervention group or control group. This also minimizes the potential impact of any known or unknown confounding factors (e.g., risk factors present at baseline), because randomization tends to distribute such confounders evenly across the groups to be compared.

When the sample size of an RCT is calculated to achieve sufficient statistical power, it minimizes the probability that the observed treatment effect will be subject to random error. Further, especially with larger groups, randomization enables patient subgroup comparisons between intervention and control groups. The primacy of the RCT remains even in an era of genomic testing and expanding use of biomarkers to better target selection of patients for adaptive clinical trials of new drugs and biologics, and advances in computer-based modeling that may replicate certain aspects of RCTs (Ioannidis 2013).

Box III-5. Jadad Instrument to Assess the Quality of RCT Reports

This is not the same as being asked to review a paper. It should not take more than 10 minutes to score a report and there are no right or wrong answers.

Please read the article and try to answer the following questions:

  1. Was the study described as randomized (this includes the use of words such as randomly, random, and randomization)?
  2. Was the study described as double blind?
  3. Was there a description of withdrawals and dropouts?

Scoring the items:

Either give a score of 1 point for each “yes” or 0 points for each “no.” There are no in-between marks.

Give 1 additional point if: For question 1, the method to generate the sequence of randomization was described and it was appropriate (table of random numbers, computer generated, etc.)

and/or: If for question 2, the method of double blinding was described and it was appropriate (identical placebo, active placebo, dummy, etc.)

Deduct 1 point if: For question 1, the method to generate the sequence of randomization was described and it was inappropriate (patients were allocated alternately, or according to date of birth, hospital number, etc.)

and/or: for question 2, the study was described as double blind but the method of blinding was inappropriate (e.g., comparison of tablet vs. injection with no double dummy)

Guidelines for Assessment

1. Randomization: A method to generate the sequence of randomization will be regarded as appropriate if it allowed each study participant to have the same chance of receiving each intervention and the investigators could not predict which treatment was next. Methods of allocation using date of birth, date of admission, hospital numbers, or alternation should not be regarded as appropriate.

2. Double blinding: A study must be regarded as double blind if the word “double blind” is used. The method will be regarded as appropriate if it is stated that neither the person doing the assessments nor the study participant could identify the intervention being assessed, or if in the absence of such a statement the use of active placebos, identical placebos, or dummies is mentioned.

3. Withdrawals and dropouts: Participants who were included in the study but did not complete the observation period or who were not included in the analysis must be described. The number and the reasons for withdrawal in each group must be stated. If there were no withdrawals, it should be stated in the article. If there is no statement on withdrawals, this item must be given no points.

Reprinted from: Jadad AR, et al. Assessing the quality of reports of randomized clinical trials: Is blinding necessary? Control Clin Trials. 1996;17:1-12, Copyright © (1996) with permission from Elsevier.

Box III-6. The Cochrane Collaboration’s Tool for Assessing Risk of Bias

Selection bias.

Domain Support for Judgment Review authors’ judgement
Allocation concealment. Describe the method used to conceal the allocation sequence in sufficient detail to determine whether intervention allocations could have been foreseen in advance of, or during, enrolment. Selection bias (biased allocation to interventions) due to inadequate concealment of allocations prior to assignment.

Performance bias

Domain Support for Judgment Review authors’ judgement
Blinding of participants and personnel Assessments should be made for each main outcome (or class of outcomes). Describe all measures used, if any, to blind study participants and personnel from knowledge of which intervention a participant received. Provide any information relating to whether the intended blinding was effective. Performance bias due to knowledge of the allocated interventions by participants and personnel during the study.

Detection bias.

Domain Support for Judgment Review authors’ judgement
Blinding of outcome assessment Assessments should be made for each main outcome (or class of outcomes). Describe all measures used, if any, to blind outcome assessors from knowledge of which intervention a participant received. Provide any information relating to whether the intended blinding was effective. Detection bias due to knowledge of the allocated interventions by outcome assessors.

Attrition bias.

Domain Support for Judgment Review authors’ judgement
Incomplete outcome data Assessments should be made for each main outcome (or class of outcomes). Describe the completeness of outcome data for each main outcome, including attrition and exclusions from the analysis. State whether attrition and exclusions were reported, the numbers in each intervention group (compared with total randomized participants), reasons for attrition/exclusions where reported, and any re-inclusions in analyses performed by the review authors. Attrition bias due to amount, nature or handling of incomplete outcome data.

Reporting bias.

Domain Support for Judgment Review authors’ judgement
Selective reporting. State how the possibility of selective outcome reporting was examined by the review authors, and what was found. Reporting bias due to selective outcome reporting.

Other bias.

Domain Support for Judgment Review authors’ judgement
Other sources of bias. State any important concerns about bias not addressed in the other domains in the tool.If particular questions/entries were pre-specified in the review’s protocol, responses should be provided for each question/entry. Bias due to problems not covered elsewhere in the table.

Reprinted with permission: Higgins JPT, Altman DG, Sterne, JAC, eds. Chapter 8: Assessing risk of bias in included studies. In: Higgins JPT, Green S, eds. Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration, 2011."

Box III-7. Criteria for Assessing Internal Validity of Individual Studies: Randomized Controlled Trials and Cohort Studies, USPSTF

Criteria:

  • Initial assembly of comparable groups:
    • For RCTs: adequate randomization, including first concealment and whether potential confounders were distributed equally among groups.
    • For cohort studies: consideration of potential confounders with either restriction or measurement for adjustment in the analysis; consideration of inception cohorts.
  • Maintenance of comparable groups (includes attrition, cross-overs, adherence, contamination).
  • Important differential loss to follow-up or overall high loss to follow-up.
  • Measurements: equal, reliable, and valid (includes masking of outcome assessment).
  • Clear definition of interventions.
  • All important outcomes considered.
  • Analysis: adjustment for potential confounders for cohort studies, or intention to treat analysis for RCTs.

Definitions of ratings based on above criteria:

Good: Meets all criteria: Comparable groups are assembled initially and maintained throughout the study (follow-up at least 80 percent); reliable and valid measurement instruments are used and applied equally to the groups; interventions are spelled out clearly; all important outcomes are considered; and appropriate attention to confounders in analysis. In addition, for RCTs, intention to treat analysis is used.

Fair: Studies will be graded “fair” if any or all of the following problems occur, without the fatal flaws noted in the "poor" category below: Generally comparable groups are assembled initially but some question remains whether some (although not major) differences occurred with follow-up; measurement instruments are acceptable (although not the best) and generally applied equally; some but not all important outcomes are considered; and some but not all potential confounders are accounted for. Intention to treat analysis is done for RCTs.

Poor: Studies will be graded “poor” if any of the following fatal flaws exists: Groups assembled initially are not close to being comparable or maintained throughout the study; unreliable or invalid measurement instruments are used or not applied at all equally among groups (including not masking outcome assessment); and key confounders are given little or no attention. For RCTs, intention to treat analysis is lacking.

Source: US Preventive Services Task Force Procedure Manual. AHRQ Pub. No. 08-05118-EF, July 2008.

Box III-8. Criteria for Assessing Internal Validity of Individual Studies: Diagnostic Accuracy Studies, USPSTF

Criteria:

  • Screening test relevant, available for primary care, adequately described.
  • Study uses a credible reference standard, performed regardless of test results.
  • Reference standard interpreted independently of screening test.
  • Handles indeterminate results in a reasonable manner.
  • Spectrum of patients included in study.
  • Sample size.
  • Administration of reliable screening test.

Definitions of ratings based on above criteria:

Good: Evaluates relevant available screening test; uses a credible reference standard; interprets reference standard independently of screening test; reliability of test assessed; has few or handles indeterminate results in a reasonable manner; includes large number (more than 100) broad-spectrum patients with and without disease.

Fair: Evaluates relevant available screening test; uses reasonable although not best standard; interprets reference standard independent of screening test; moderate sample size (50 to 100 subjects) and a "medium" spectrum of patients.

Poor: Has fatal flaw such as: Uses inappropriate reference standard; screening test improperly administered; biased ascertainment of reference standard; very small sample size or very narrow selected spectrum of patients.

Source: US Preventive Services Task Force Procedure Manual. AHRQ Pub. No. 08-05118-EF, July 2008.

Box III-9. Global Rating of External Validity (Generalizability) of Individual Studies, US Preventive Services Task Force

External validity is rated "good" if:

  • The study differs minimally from the US primary care population/situation/providers and only in ways that are unlikely to affect the outcome; it is highly probable (>90%) that the clinical experience with the intervention observed in the study will be attained in the US primary care setting.

External validity is rated "fair" if:

  • The study differs from the US primary care population/situation/providers in a few ways that have the potential to affect the outcome in a clinically important way; it is only moderately probable (50%-89%) that the clinical experience with the intervention in the study will be attained in the US primary care setting.

External validity is rated "poor" if:

  • The study differs from the US primary care population/ situation/ providers in many way that have a high likelihood of affecting the clinical outcomes; the probability is low (<50%) that the clinical experience with the intervention observed in the study will be attained in the US primary care setting.

Source: US Preventive Services Task Force Procedure Manual. AHRQ Pub. No. 08-05118-EF, July 2008.

Box III-10. Characteristics of Individual Studies That May Affect Applicability (AHRQ)

Population

  • Narrow eligibility criteria and exclusion of those with comorbidities
  • Large differences between demographics of study population and community patients
  • Narrow or unrepresentative severity, stage of illness, or comorbidities
  • Run in period with high-exclusion rate for non-adherence or side effects
  • Event rates much higher or lower than observed in population-based studies

Intervention

  • Doses or schedules not reflected in current practice
  • Intensity and delivery of behavioral interventions that may not be feasible for routine use
  • Monitoring practices or visit frequency not used in typical practice
  • Older versions of an intervention no longer in common use
  • Co-interventions that are likely to modify effectiveness of therapy
  • Highly selected intervention team or level of training/proficiency not widely available

Comparator

  • Inadequate dose of comparison therapy
  • Use of substandard alternative therapy

Outcomes

  • Composite outcomes that mix outcomes of different significance
  • Short-term or surrogate outcomes

Setting

  • Standards of care differ markedly from setting of interest
  • Specialty population or level of care differs from that seen in community

Source: Atkins D, et al. Chapter 6. Assessing the Applicability of Studies When Comparing Medical Interventions. In: Methods Guide for Effectiveness and Comparative Effectiveness Reviews. AHRQ Publication No. 10(12)-EHC063-EF. Rockville, MD: Agency for Healthcare Research and Quality. September 2013.

As described below, despite its advantages for demonstrating internal validity of causal relationships, the RCT is not the best study design for all evidence questions. Like all methods, RCTs have limitations. RCTs can have particular limitations regarding external validity. The relevance or impact of these limitations varies according to the purposes and circumstances of study. In order to help inform health care decisions in real-world practice, evidence from RCTs and other experimental study designs should be augmented by evidence from other types of studies. These and related issues are described below.

RCTs can cost in the tens or hundreds of millions of dollars, and exceeding $1 billion in some instances. Costs can be particularly high for phase III trials of drugs and biologics conducted to gain market approval by regulatory agencies. Included are costs of usual care and the additional costs of conducting research. Usual care costs include those for, e.g., physician visits, hospital stays, laboratory tests, radiology procedures, and standard medications, which are typically covered by third-party payers. Research-only costs (which would not otherwise occur for usual care) include patient enrollment and related management; investigational technologies; additional tests and procedures done for research purposes; additional time by clinical investigators; data infrastructure, management, collection, analysis, and reporting; and regulatory compliance and reporting (DiMasi 2003; Morgan 2011; Roy 2012). Costs are higher for trials with large numbers of enrollees, large numbers of primary and secondary endpoints (requiring more data collection and analysis), and longer duration. Costs are generally high for trials that are designed to detect treatment effects that are anticipated to be small (therefore requiring large sample sizes to achieve statistical significance) or that require extended follow-up to detect differences in, e.g., survival and certain health events.

A clinical trial is the best way to assess whether an intervention works, but it is arguably the worst way to assess who will benefit from it (Mant 1999).

Most RCTs are designed to investigate the effects of a uniformly delivered intervention in a specific type of patient in specific circumstances. This helps to ensure that any observed difference in outcomes between the investigational treatment and comparator is less likely to be confounded by variations in the patient groups compared, the mode of delivering the intervention, other previous and current treatments, health care settings, and other factors. However, while this approach strengthens internal validity, it can weaken external validity.

Patients who enroll in an RCT are typically subject to inclusion and exclusion criteria pertaining to, e.g., age, comorbidities, other risk factors, and previous and current treatments. These criteria tend to yield homogeneous patient groups that may not represent the diversity of patients that would receive the interventions in real practice. RCTs often involve special protocols of care and testing that may not be characteristic of general care, and are often conducted in university medical centers or other special settings. Findings from these RCTs may not be applicable to different practice settings for variations in the technique of delivering the intervention.

When RCTs are conducted to generate sufficient evidence for gaining market approval or clearance, they are sometimes known as “efficacy trials” in that they may establish only short-term efficacy (rather than effectiveness) and safety in a narrowly selected group of patients. Given the patient composition and the choice of comparator, results from these RCTs can overstate how well a technology works as well as under-represent the diversity of the population that will ultimately use the technology.

Given the high costs of RCTs and sponsors’ incentives to generate findings, such as to gain market approval for regulated technologies, these trials may be too small (i.e., have insufficient statistical power) or too short in duration to detect rare or delayed outcomes, including adverse events, and other unintended impacts. On the other hand, even in large, long-term RCTs (as well as other large studies), an observed statistically significant difference in adverse events may arise from random error, or these events may simply happen to co-occur with the intervention rather than being caused by it (Rawlins 2008). As such, the results from RCTs may be misleading or insufficiently informative for clinicians, patients, and payers who make decisions pertaining to more heterogeneous patients and care settings.

Given their resource constraints and use to gain market approval for regulated technologies, RCTs may be designed to focus on a small number of outcomes, especially shorter-term intermediate endpoints or surrogate endpoints rather than ultimate endpoints such as mortality, morbidity, or quality of life. As such, findings from these RCTs may be of limited use to clinicians and patients. Of course, the use of validated surrogate endpoints is appropriate in many instances, including when the health impact of interventions for some health care conditions will not be realized for years or decades, e.g., screening for certain cancers, prevention of risky health behaviors, and management of hypertension and dyslipidemia to prevent strokes and myocardial infarction in certain patient groups.

RCTs are traditionally designed to test a null hypothesis, i.e., the assumption by investigators that there is no difference between intervention and control groups. This assumption often does not pertain for several reasons. Among these, the assumption may be unrealistic when findings of other trials (including phase II trials for drugs and biologics) of the same technology have detected a treatment effect. Further, it is relevant only when the trial is designed to determine if one intervention is better than another, in contrast to whether they can be considered equivalent or one is inferior to the other (Rawlins 2008). Testing of an “honest” null hypothesis in an RCT is consistent with the principle of equipoise, which refers to a presumed state of uncertainty regarding whether any one of alternative health care interventions will confer more favorable outcomes, including balance of benefits and harms (Freedman 1987). However, there is controversy regarding whether this principle is realistic and even whether it is always ethical (Djulbegovic 2009; Fries 2004; Veatch 2007).

RCTs depend on principles of probability theory whose validity may be diminished in health care research, including certain aspects of the use of p-values and multiplicity, which refers to analyses of numerous endpoints in the same data set, stopping rules for RCTs that involve “multiple looks” at data emerging from the trial, and analysis of numerous subgroups. Each of these types of multiplicity involve iterative (repeated) tests of statistical significance based on conventional p-value thresholds (e.g., <0.05). Such iterative tests are increasingly likely to result in at least one false-positive finding, whether for an endpoint, a decision to stop a trial, or patient subgroup in which there appears to be a statistically significant treatment effect (Rawlins 2008; Wang 2007).

Using a p-value threshold (e.g., p<0.01 or p<0.05) as the basis for accepting a treatment effect can be misleading. There is still a chance (e.g., 1% or 5%) that the difference is due to random error. Also, a statistically significant difference detected with a large sample size may have no clinical significance. On the other hand, a finding of no statistical significance (e.g., p>0.01 or p>0.05) does not prove the absence of a treatment effect, including because the sample size of the RCT may have been too small to detect a true treatment effect. The reliance of most RCTs on p-values, particularly that the probability that a conclusion is in error can be determined from the data in a single trial, ignores evidence from other sources or the plausibility of the underlying cause-and-effect mechanism (Goodman 2008).

As noted below, other study designs are preferred for many types of evidence questions, even in some instances when the purpose is to determine the causal effect of a technology. For investigating technologies for treating rare diseases, the RCT may be impractical for enrolling and randomizing sufficient numbers of patients to achieve the statistical power to detect treatment effects. On the other hand, RCTs may be unnecessary for detecting very large treatment effects, especially where patient prognosis is well established and historical controls suffice.

To conduct an RCT may be judged unethical in some circumstances, such as when patients have a largely fatal condition for which no effective therapy exists. Use of a placebo control alone can be unethical when an effective standard of care exists and withholding it poses great health risk to patients, such as for HIV/AIDS prevention and therapy and certain cancer treatments. RCTs that are underpowered (i.e., with sample sizes too small to detect a true treatment effect or that yield statistically significant effects that are unreliable) can yield overestimated treatment effects and low reproducibility of results, thereby raising ethical concerns about wasted resources and patients’ commitments (Button 2013).

results matching ""

    No results matching ""