Effect size varies based on calculation method and may affect interpretation of treatment effect: an illustration using randomised clinical trials in osteoarthritis

Background To illustrate how (standardised) effect sizes (ES) vary based on calculation method and to provide considerations for improved reporting. Methods Data from three trials of tanezumab in subjects with osteoarthritis were analyzed. ES of tanezumab versus comparator for WOMAC Pain (outcome) was defined as least squares difference between means (mixed model for repeated measures analysis) divided by a pooled standard deviation (SD) of outcome scores. Three approaches to computing the SD were evaluated: Baseline (the pooled SD of WOMAC Pain values at baseline [pooled across treatments]); Endpoint (the pooled SD of these values at the time primary endpoints were assessed); and Median (the median pooled SD of these values based on the pooled SDs across available timepoints). Bootstrap analyses were used to compute 95% confidence intervals (CI). Results ES (95% CI) of tanezumab 2.5 mg based on Baseline, Endpoint, and Median SDs in one study were − 0.416 (− 0.796, − 0.060), − 0.195 (− 0.371, − 0.028), and − 0.196 (− 0.373, − 0.028), respectively; negative values indicate pain improvement. This pattern of ES differences (largest with Baseline SD, smallest with Endpoint SD, Median SD similar to Endpoint SD) was consistent across all studies and doses of tanezumab. Conclusion Differences in ES affect interpretation of treatment effect. Therefore, we advocate clearly reporting individual elements of ES in addition to its overall calculation. This is particularly important when ES estimates are used to determine sample sizes for clinical trials, as larger ES will lead to smaller sample sizes and potentially underpowered studies.


Background
Effect sizes (ES) provide information about the magnitude of differences between groups in interventional studies [1,2].While treatment differences should be based primarily on the original metric of the outcome (e.g., difference in mean scores between two treatments), the ES when standardised and expressed in standard deviation units can lend further interpretation to the magnitude of effect.Standardised ES are also used to calculate sample sizes for studies and to support comparisons of effects across studies [3,4].Comparing standardised ES across interventions or studies, however, must be done with caution as ES may vary depending on study design, outcome measures, and approach to calculation of the standard deviation (SD) [5].
The (standardised) ES metric for a parallel-group clinical trial is defined as the difference in mean scores between two treatments (numerator) divided by the SD of these two treatments (denominator) [6].However, there are different approaches to defining the SD to be used when computing ES.Therefore, it is of interest to assess the impact of different approaches to defining the SD on ES using data from well-controlled clinical studies.Here, we report results from three phase 3 trials of tanezumab, an antibody to nerve growth factor, in participants with painful knee and hip osteoarthritis.We focus on the ES for the pain response, as it is the outcome most-commonly evaluated.
ES were defined as least squares mean difference (from the MMRM model) in each score divided by a pooled SD of the outcome scores.Three different approaches to computing the pooled SD (in the denominator of the ES) were used: the pooled SD of WOMAC Pain values at baseline (combined across treatments); the pooled SD of these values at the time when the primary endpoints were assessed (Week 16 for Studies 1 and 3, Week 24 for Study 2); and the median pooled SD of these values based on the pooled SDs across all available timepoints (baseline, intermediate post-baseline timepoints, and primary timepoint at the end of a trial).Specifically, the median pooled SD was computed as the median of pooled SD from baseline to Week 16 (Studies 1 and 3) or Week 24 (Study 2).
Given there is no convenient closed-form solution to derive standard errors and confidence intervals (CI) for ES statistics, the non-parametric bootstrap approach is recommended to compute a 95% CI for an ES and was applied to individual WOMAC Pain patient data [11].One thousand data sets were sampled from individual patient WOMAC Pain data.The bootstrap was done at the patient-level; if a patient was selected, all WOMAC Pain data (at all visits) for this patient were selected.The bootstrap was performed with replacement, using the same number of patients as the original sample.The bootstrap sample data set was used to compute pooled SDs.For each study, each treatment comparison, and each approach to calculate SD (baseline, endpoint, and median), the 95% CI (2.5% percentile, 97.5% percentile) of the ES were reported.

Standard deviations
The pooled baseline SDs were the smallest and the pooled SDs at the time when the primary endpoints were assessed were the largest for the WOMAC Pain endpoint in all studies.The SDs for the median of pooled SD were similar to those determined at the primary endpoint (Table 1).SDs across studies were comparable (Table 1).

Discussion
Different approaches to calculating pooled SD affect the magnitude of ES, which in turns affects interpretation of treatment effect and complicates comparisons across different studies.Our results showed that ES derived from pooled SDs, at the time when the primary endpoints were assessed and from the median pooled SDs from baseline to the time when the primary endpoints were assessed, were similar for all endpoint comparisons in all three studies.However, ES derived from pooled SDs at baseline were larger than the ES derived from the other two SDs for all endpoint comparisons in all studies.
All three approaches to calculate SD attempt to estimate "true" variability of the measured outcome in the sample.Use of only baseline data for the SD represents natural variability in the sample, which is not affected by introduction of a treatment (assuming the outcome was not an entry criterion).SDs based at the primary endpoint are calculated by pooling data by treatment and, thus, effectively exclude the treatment effect (as the pooled SD is based on a weighted average of each treatment's SD of scores rather than an overall SD of scores lump summed as one grouping from both treatment groups; see Supplementary Text 1 for more detail).Using median SD from the set of pooled SDs represents an attempt to use a representative value of variability.
For patient-reported outcome studies, ES using baseline SD or SD of individual changes are typically used for within group pre-versus post-intervention comparisons.For ES comparison between treatment groups, the pooled SD from scores of the treatment groups at baseline, pooled SD from scores of the treatment groups at time of post-treatment assessment, or pooled SD from   scores of individual changes (when mean change from baseline is the outcome) have been applied [5,12].For a clinical trial where the outcome measures also serve as inclusion/exclusion criteria, the population studied at baseline will not represent an unbiased sample.Indeed, the goal of entry criteria is to define a homogeneous population, and it is expected that baseline SD values will be smaller.Furthermore, since response to treatment varies across individuals, SDs based on data after treatment initiation will likely be confounded by effects of treatment and time.Therefore, pooled SD at baseline and pooled SD at post-treatment assessment could be different, which would lead to the differences in ES presented here.Different factors have been shown to have an impact on the ES of scores in randomised controlled trials [13].However, our analyses have shown the methods used to calculate the SD directly affect the calculated ES.ES derived from baseline SD tend to be more optimistic (i.e., larger) than ES derived from SD post-treatment.It is noteworthy that the commonly used Cohen thresholdsin which an ES < 0.20 indicates trivial effect; while small, moderate, large, or very large effect is represented by ES of ≥ 0.20 and < 0.50, ≥ 0.50 and < 0.80, ≥ 0.80 and < 1.30, or ≥ 1.30, respectively [14]-were developed for use in the social sciences and are based on Cohen's d when, gauging the magnitude of the difference in means between treatment groups, the pooled standard deviation of scores (pooled across treatments) are based on the same time as when the means are assessed.In contrast, the Cochrane Handbook recommends using the SD from the pooled outcome data (known as Hedges' g).Thus, when describing the magnitude of an ES, and particularly when comparing across different studies and interventions, it is essential to describe how the SD was determined in order to make appropriate comparisons.This is of even greater importance when using ES estimates to determine sample sizes for clinical trials, as larger effect sizes will lead to smaller sample sizes for equivalent power and may lead to underpowered studies.
Generally, if an outcome scale was not used as part of a study's entry criteria, we recommend using baseline SD for calculations of ES in longitudinal studies since those SDs are not affected by treatments.If an outcome scale was part of the entry criteria or was highly similar to a measurement used as part of the study's entry criteria, then baseline SD will be artificially attenuated.In this case, we recommend using the largest pooled post-baseline SD measured at different time points across two (or more) treatment arms since it would lead to the smallest (most conservative) ES.However, ES based on pooled SDs at end of study can also be reported in sensitivity analyses.

Conclusion
Standardisation of the method used to determine SD would allow researchers to more accurately compare the magnitude of treatment effects across studies, including when different measures are being used to assess the same concept of interest.In the absence of such standardisation, we advocate for reporting, in addition to ES, information about how the individual elements (e.g., means, SDs) were defined/calculated.

−
of pooled SD from baseline to the time when the primary endpoints were assessed* 231, − 0.032) CI: confidence interval; ES: effect size; NSAID: nonsteroidal anti-inflammatory drug; SD: standard deviation; WOMAC: Western Ontario and McMasters Universities Osteoarthritis Index * Week 16 for Studies 1 and 3, Week 24 for Study 2 Trial details have been published previously.

Table 1
Standard deviations used to calculate the ES

Table 2
ES of tanezumab on pain as measured by WOMAC Pain score based on bootstrap samples