Notes
Slide Show
Outline
1
Session 4
  • Analyses Adjusted for Stopping Rules
    • Reporting Results
    • Adjustment for Stopping Rules
  • Choice of Inferential Methods
    • Methods of Point Estimation
    • Orderings of the Outcome Space
    • Relative Advantages
  • Sensitivity to Poorly Specified Stopping Rules
    • Case Study I: Absence of Formal Stopping Rule
    • Case Study II: Unexpected Toxicities
  • Documentation of Design, Monitoring and Analysis
2
 
3
Reporting Results
  • At the end of the study analyze the data to provide


    • Estimate of the treatment effect
      • Single best estimate
      • Range of reasonable estimates

    • Decision of efficacy, equivalence, harm, or futility
      • Binary decision
      • Quantification of strength of evidence
4
Reporting Results
  • Methods of point estimation
    • Frequentist methods
      • Find estimates which minimize bias
      • Find estimates with minimal variance
      • Find estimates which minimize mean squared error


    • Bayesian methods
      • Use mean, median, or mode of posterior distribution of  q based on some prespecified prior

5
Reporting Results
  • Methods of point estimation (cont.)
    • Method of moments
      • Use a function of sample moments to estimate a function of moments of the sampling distribution
      • For example
        • if q is the mean of the sampling distribution, use sample mean as an estimate of q
        • if q is the variance of the sampling distribution, use sample variance as an estimate of q
6
Reporting Results
  • Methods of point estimation (cont.)
    • Maximum likelihood estimation
      • Find the value of q such that the sampling density evaluated at the observed data is maximized
      • E.g., in one sample inference about a normal mean maximize density when q equals the sample mean
7
Reporting Results
  • Methods of point estimation (cont.)
    • Median unbiased estimation
      • Assume that the observed statistic is the median of its sampling distribution
      • E.g., if observed T=t, then find q such that
8
Reporting Results
  • Methods of point estimation (cont.)
    • Bias adjusted
      • Assume that the observed statistic is the mean of its sampling distribution
      • E.g., if observed T=t, then find q such that
9
Reporting Results
  • Methods of point estimation (cont.)
    • Variance improvement for unbiased estimators
      • Use Rao-Blackwell improvement theorem to find expectation of unbiased estimate conditioned on sufficient statistic
      • E.g., for S unbiased and T sufficient
10
Reporting Results
  • Methods of interval estimation
    • Confidence interval
      • 100(1-a)% confidence interval for q is (qL , qU) where

11
Reporting Results
  • Methods of interval estimation (cont.)
    • Bayesian methods


      • Use central 100(1-a)% of posterior distribution of  q based on some prespecified prior

12
Reporting Results
  • Criteria for decisions


    • Hypothesis tests
      • Reject hypothesis that q= q0 with a level a  test if T > ca where

13
Reporting Results
  • Criteria for decisions (cont.)


    • Bayesian Methods
      • Reject hypothesis that q= q0 based on posterior distribution, e.g.,

14
Reporting Results
  • Quantification of Evidence for Decisions
    • Hypothesis testing


      • P value



    • Bayesian Methods


      • Posterior probability

15
 
16
Adjustment for Stopping Rules
  • Fixed sample methods for testing and estimation are well developed


    • Many methods of point estimation yield same estimate (including Bayesian with noninformative prior)


    • Confidence intervals easily computed


    • Testing well developed


17
Adjustment for Stopping Rules
  • Stopping rule greatly affects sampling distribution for estimates of treatment effect
    • Data which lead to normally distributed sampling distributions under fixed sample testing lead to skewed, multimodal densities with jump discontinuities under sequential testing


    • Treatment effect is no longer a shift parameter


    • Exact shape of sampling distribution therefore depends upon stopping rule and alternative
18
Adjustment for Stopping Rules
  • Sampling Densities for Estimate of Treatment Effect
19
Adjustment for Stopping Rules
  • Failure to adjust estimates and P values for stopping rule is tantamount to repeated significance testing
      • P values will tend to be wrong
      • Estimates will tend to be biased toward extreme
      • Confidence intervals will have the wrong coverage probabilities

  • (No effect on Bayesian analysis)
20
Coverage probability of unadjusted CI
21
Adjustment for Stopping Rules
  • Frequentist inferential techniques can still be used, providing we can compute the sampling density for the test statistic under arbitrary choices for q


    • In these techniques, the stopping rule is just viewed as a sampling distribution


    • cf: binomial versus geometric sampling

22
Adjustment for Stopping Rules
  • P values adjusted for stopping rule


    • Probability of observing more extreme results under the null hypothesis


    • Compute sampling distribution of test statistic under the null


    • Requires a definition of “extreme” across analysis times
      • Ordering of the outcome space
23
Adjustment for Stopping Rules
  • Point estimates adjusted for stopping rule


    • Maximum likelihood estimate is unadjusted estimate
      • Generally biased
      • Tends to have large mean squared error

    • Find estimates that decrease the bias and mean squared error
24
Adjustment for Stopping Rules
  • Confidence interval adjusted for stopping rule


    • Based on duality of testing and CI


    • Exact coverage probability under normal probability model


    • Requires definition of an ordering of the outcome space
25
 
26
Methods of Point Estimation
  • Point estimates adjusted for stopping rule


    • Maximum likelihood estimate is unadjusted estimate
      • Generally biased
      • Large mean squared error
27
Methods of Point Estimation
  • Point estimates adjusted for stopping rule


    • Bias adjusted mean (Whitehead, 1986)
      • Assume observed outcome is mean of true distribution
      • Requires knowing number and timing of future analyses
      • Generally still biased
      • Often least mean squared error
28
Methods of Point Estimation
  • Point estimates adjusted for stopping rule


    • Median unbiased estimate (Whitehead, 1984)
      • Assume observed outcome is median of true distribution
      • Requires an ordering of the outcome space
      • Some orderings require knowledge of number and timing of future analyses
      • Generally still biased for mean
29
Methods of Point Estimation
  • Point estimates adjusted for stopping rule


    • UMVUE-like estimate
      • Uses Rao-Blackwell improvement theorem
      • Unbiased for normal probability model
      • Does not require knowledge of number and timing of future analyses
30
 
31
Orderings of the Outcome Space
  • Ordering of the outcome space


    • Orderings of outcomes within an analysis time intuitive
      • Based on the value of Tj at that analysis


    • Need to define ordering between outcomes at successive analyses
      • How does sample mean of 3.5 at second analysis compare to sample mean of 3 at first analysis (when estimate more variable)?
32
Orderings of the Outcome Space
  • Analysis time ordering (Jennison and Turnbull, 1983; Tsiatis, Rosner, and Mehta, 1984)


    • Results leading to earlier stopping are more extreme
      • Linearizes the outcome space


      • Does not require knowledge of future analysis times


      • Not defined for two-sided tests with early stopping for both null and alternative
33
Orderings of the Outcome Space
  • Analysis time ordering


34
Orderings of the Outcome Space
  • Sample mean ordering (Duffy and Santner, 1987; Emerson and Fleming, 1990)


    • Consider only magnitude of sample mean
      • Requires knowledge of future analysis times


      • Tends to result in narrower CI and less biased median unbiased estimates
35
Orderings of the Outcome Space
  • Sample mean ordering contours
36
 
37
Relative Advantages
  • Properties of methods for inference


    • Point estimates differ in bias reduction, mean squared error


    • Confidence intervals differ in
      • average width of CI
      • inclusion of various point estimates
      • need for knowledge about future analyses


    • (ref: Emerson and Fleming, 1990)
38
Relative Advantages
  • Choice of methods for inference
    • Fixed sample tests
      • All frequentist methods described here agree with each other

    • Group sequential tests
      • No method is uniformly better
      • Usually fairly good agreement between various methods
      • Failure to agree can be informative regarding time trends in data
39
Relative Advantages
  • Point estimates: General tendencies for bias from least to most


    • (best)
    • UMVUE-like (in normal model)
    • Bias adjusted mean
    • Median unbiased with sample mean ordering
    • Median unbiased with analysis time ordering
    • Maximum likelihood estimate
    • (worst)
40
Relative Advantages
  • Point estimates: General tendencies for bias
    • O’Brien-Fleming                               Pocock
41
Relative Advantages
  • Point estimates: General tendencies for mean squared error (MSE) from least to most


    • (best)
    • Bias adjusted mean
    • Median unbiased with sample mean ordering
    • UMVUE-like
    •  Median unbiased with analysis time ordering
    • Maximum likelihood estimate
    • (worst)
42
Relative Advantages
  • Point estimates: General tendencies for (MSE)
    • O’Brien-Fleming                                Pocock
43
Relative Advantages
  • Point estimates: Dependence on timing of future analyses


    • (None)
    • UMVUE-like
    •  Median unbiased with analysis time ordering
    • Maximum likelihood estimate


    • (Some)
    • Bias adjusted mean
    • Median unbiased with sample mean ordering
44
Relative Advantages
  • Point estimates: Spectrum of group sequential designs for which defined


    • (All)
    • Bias adjusted mean
    • Median unbiased with sample mean ordering UMVUE-like
    • Maximum likelihood estimate


    • (Not two-sided tests with stopping under both hypotheses)
    • Median unbiased with analysis time ordering
45
Relative Advantages
  • Interval estimates: General tendencies toward narrower confidence intervals


    • (Narrowest)
    • Sample mean ordering based
    • Analysis time ordering based
    • (Widest)


46
Relative Advantages
  • Interval estimates: Average length of confidence intervals
    • O’Brien-Fleming                           Pocock
47
Relative Advantages
  • Interval estimates: Dependence on timing of future analyses


    • (None)
    • Analysis time ordering based


    • (Some)
    • Sample mean ordering based
48
Relative Advantages
  • Interval estimates: Coverage probability for CI using estimated schedule of analyses
    • O’Brien-Fleming                           Pocock
49
Relative Advantages
  • Interval estimates: Spectrum of group sequential designs for which defined


    • (All)
    • Sample mean ordering based


    • (Not two-sided tests with stopping under both hypotheses)
    • Analysis time ordering based
50
Relative Advantages
  • Interval estimates: Possible exclusion of point estimates
    • (Tends to occur with less than 0.5% probability)

    • Analysis time ordering might not include
      • Bias adjusted mean
      • Sample mean ordering based MUE
      • Maximum likelihood estimate

    • Sample mean ordering might not include
      • UMVUE-like
      • Analysis time ordering based MUE

51
Relative Advantages
  • P values


    • Tend to agree for the sample mean and analysis time orderings for making typical decisions regarding statistical significance
52
Relative Advantages
  • P values: Spectrum of group sequential designs for which defined


    • (All)
    • Sample mean ordering based


    • (Not two-sided tests with stopping under both hypotheses)
    • Analysis time ordering based
53
 
54
Poorly Specified Stopping Rule Approach
  • Based on statistical inference


    • Consider class of stopping rules parameterized by
      • level of significance
      • boundary shape functions
      • number and timing of analyses

    • Adjust estimates, P values for stopping rules


    • Evaluate sensitivity of conclusions to choice of stopping rules within that class
55
Poorly Specified Stopping Rule Approach
  • Determining class of stopping rules to consider


    • Consider interim results of study at potential analysis times that did not result in stopping
      • True stopping rule must have been more extreme

    • Consider interim results of study at analysis times that did result in stopping
      • True stopping rule must have been less extreme
56
 
57
Case Study 1: Idarubicin in AML
  • Idarubicin in Acute Myelogenous Leukemia


    • Patients randomized to receive Idarubicin (Ida) or Daunorubicin (Dnr) in equal numbers


    • Primary response: Induction of complete remission


    • Secondary response: Survival


58
Case Study 1: Idarubicin in AML
  • Initial design


    • Fixed sample study


    • Two-sided level 0.05 hypothesis test


    • 80% power to detect absolute difference in response rates of 0.20


    • 90 patients per treatment arm
59
Case Study 1: Idarubicin in AML
  • Chronology
    • Several informal analyses of the data


    • Formal analysis of the data when N=45 per arm
      • CR rate - Ida: 35/45 (78%); Dnr: 25/45 (56%)
      • Retrospective adoption of O’Brien-Fleming design
      • Trial continued

    • Formal analysis of the data when N=65 per arm
      • CR rate - Ida: 51/65 (78%); Dnr: 38/65 (58%)
      • Trial stopped
60
Case Study 1: Idarubicin in AML
  • FDA Questions
    • Was the O’Brien-Fleming design truly the one used?
      • Number and timing of analyses
      • Level of test
      • Boundary shape function

    • (Can we trust retrospective imposition of the stopping rule?) (Case Study 2)


    • (Interpretation of secondary endpoint of survival?)
61
Case Study 1: Idarubicin in AML
  • Selection of Class of Stopping Rules for Sensitivity Analysis
    • Study did not stop with treatment difference of 0.22 when N= 45 / arm


    • Study did stop with treatment difference of 0.20 when N= 65 / arm


    • Consider stopping boundaries that are between those two points


62
Case Study 1: Idarubicin in AML
  • Observed results
63
Case Study 1: Idarubicin in AML
  • Stopping rules
    • Impossible      Possible Impossible
64
Case Study 1: Idarubicin in AML
  • Parameterization
    • Number and timing of analyses


    • Boundary shape function


    • Level of significance
      • Worst case: just barely continued at N= 45
      • Best case: just barely stopped at N= 65
65
Case Study 1: Idarubicin in AML
  • Sensitivity analysis


    • (45, 65, 90) Best Case Worst Case
      • Level .958 .868
      • P value .008 .015
      • Estimate .184 .181
      • 95% CI (.034,.325) (.018,.348)

66
Case Study 1: Idarubicin in AML
  • Sensitivity analysis


    • (25, 45, 65, 90) Best Case Worst Case
      • Level .958 .868
      • P value .008 .016
      • Estimate .184 .175
      • 95% CI (.034,.325) (.017,.348)

67
Case Study 1: Idarubicin in AML
  • Sensitivity analysis


    • (12, 25, 35, 45, 65, 90) Best Case Worst Case
      • Level .958 .866
      • P value .008 .017
      • Estimate .182 .171
      • 95% CI (.034,.325) (.015,.347)

68
 
69
Case Study 2: Unexpected Toxicities
  • Background
    • Clinical trial of G-CSF to reduce a certain type of toxicity in cancer chemotherapy


    • Early in trial, high rates of another toxicity noted


    • Ed Korn at NCI consulted re early stopping


    • Much later, Ed Korn invites panel to address this problem as an unknown at the JSM
70
Clinical Setting
  • Clinical trial of Granulocyte Colony Stimulating Factor (G-CSF)
    • Oral mucositis toxicity with 5-FU/LV chemotherapy


    • Observation of decreased incidence when G-CSF was given for other indications


    • Hence clinical trial planned to address role in reducing oral mucositis
71
Clinical Setting
  • Clinical trial design
    • Fixed sample design


    • 35 patients to receive G-CSF in first chemo cycle; nothing in second


    • Primary endpoint: difference in oral mucositis between cycles
72
Clinical Setting
  • Chronology
    • 3 of 4 first patients experience life threatening leukopenia


    • A fifth patient currently under treatment


    • Question: When should we be concerned enough to stop the trial?
73
Clinical Setting
  • Biological issues
    • G-CSF stimulates division of leukocytes; chemotherapy kills rapidly dividing cells


    • Leukopenia was a secondary endpoint


    • Current trial included patients with prior chemotherapy unlike previous trials
      • 2 of 3 toxicities were with prior chemotx
      • 2 of 2 patients with prior chemotx had toxicities

74
Clinical Setting
  • Acceptable levels of toxicity


    • Only 12 / 176 (6.8%) of patients on 5-FU / LV in previous study experienced leukopenia


    • Clinical researcher: Maybe 50% toxicity rate would be acceptable
75
Approach to Problem
  • Outline


    • Selection of a group sequential stopping rule


    • Analysis of results


    • Sensitivity of analysis to data driven selection of stopping rule


    • Bayesian analysis
76
Selection of Stopping Rule
  • Selection of hypotheses


    • Power to detect toxicity rate greater than 50%


    • Null hypothesis: toxicity rate less than 20%
      • arbitrary choice
      • allows for prior chemotherapy
77
Selection of Stopping Rule
  • Schedule of analyses


    • First analysis at N = 5


    • Additional analyses every 5 patients to maximum of 35
78
Selection of Stopping Rule
  • Structure of stopping rule


    • Early stopping only for excess toxicity


    • Boundaries defined for number of toxicities


    • Consider boundary shape functions of
      • O’Brien-Fleming
      • Pocock
79
Binary Endpoint
  • Issues in small studies with binary endpoint


    • Size, power not attained exactly


    • Large sample approximations not appropriate


    • Implementation of boundary relationships approximate
      • rounding vs truncation of boundaries
80
Sampling Density
  • Group Sequential Test Statistic
    • Observations           Xi ~ B(1,p)
    • Analysis times             N1, N2, N3, ..., NJ
    • Continuation sets        (aj, bj)


    • Statistics
81
Sampling Density
  • After Armitage, McPherson, and Rowe (1969)


82
Candidate Designs
  • Threshold for rejecting the null hypothesis


    • OBF Poc
    • Boundaries N1 =    5    6    4
    • N2 =  10    7    6
    • N3 =  15    8    8
    • N4 =  20    9 10
    • N5 =  25 10 11
    • N6 =  30 11 13
    • N7 =  35 12 14



83
Candidate Designs
  • Operating characteristics


    • OBF Poc
    • Hypotheses Null .183 .209
    • Alternative .488 .542


    • ASN p = 0.2 34.7 34.6
    • p = 0.5 18.6 17.5



84
Candidate Designs
  • Inference at the boundaries


    • Earliest possible stopping time


    •    OBF Poc
    • SM  / NM    7 / 10 4 / 5
    • P val (p = 0.2) (SM)    .0009 .0067
    • Estimate (BAM)    .675 .753
    • 95% CI (SM) .347, .859        .283, .915



85
Candidate Designs
  • Inference at the boundaries


    • Smallest rejection of Null


    •    OBF Poc
    • SM  / NM    12 / 35 14 / 35
    • P val (p = 0.2) (SM)    .0447 .0196
    • Estimate (BAM)    .321 .354
    • 95% CI (SM) .183, .488        .209, .542



86
Candidate Designs
  • Inference at the boundaries


    • Largest nonrejection of Null


    •    OBF Poc
    • SM  / NM    11 / 35 13 / 35
    • P val (p = 0.2) (SM)    .0774 .0259
    • Estimate (BAM)    .297 .333
    • 95% CI (SM) .167, .462        .199, .518



87
Stopping Rule
  • Pocock bounds


    • Conservatism of O’Brien-Fleming less desirable for new therapy


    • Fifth patient (no prior chemotherapy) had toxicity


    • Trial stopped (modified)
88
Clinical Trial Results
    • Toxicities 4 / 5
    • P values
      • Sample Mean .00674
      • Analysis Time .00672

    • Point Estimates
      • Bias adjusted mean .753
      • UMVUE .800
      • MUE (Sample Mean) .784
      • MUE (Analysis Time) .767
      • MLE .800

    • Confidence Intervals
      • Sample mean .283, .915
      • Analysis time .284, .947
89
Data Driven Selection of Stopping Rule
  • Model hybrid test
    • Y is number of toxicities in first four patients


    • If Y < c, stay with fixed sample design
    • If Y > c, switch to group sequential test




    • First term computed under FST or (shifted) GST according to value of c
90
Sensitivity Analysis Results
  • Size, Power of Hybrid Tests


    • Threshold for    Size    Power
    • switch to GST (p = 0.2) (p = 0.5)
    • 0 / 4 (GST)     .0196    .9292
    •     1 / 4     .0205    .9343
    • 2 / 4     .0218    .9466
    •     3 / 4     .0208    .9551
    •     4 / 4     .0155    .9557
    • 5 / 4 (FST)     .0142    .9552
91
Selected Bayesian Analysis Results
  • Uniform prior


    • Obs Tox              E(p|S)         Pr(p<0.2|S)         Pr(p>0.5|S)
    •    0 / 1 .333 .360 .250
    •    1 / 1 .667 .040 .750
    •    2 / 2 .500 .104 .500
    •    2 / 2 .750 .008 .875
    •    2 / 3 .600 .027 .688
    •    3 / 3 .800 .002 .938
    •    3 / 4 .667 .007 .812
    •    3 / 5 .571 .017 .656
    •    4 / 5 .714 .002 .891
92
Selected Bayesian Analysis Results
  • Ad hoc prior (uniform mass on null: 0.5)


    • Obs Tox              E(p|S)         Pr(p<0.2|S)         Pr(p>0.5|S)
    •    0 / 1 .209 .628 .091
    •    1 / 1 .471 .176 .460
    •    2 / 2 .365 .289 .271
    •    2 / 2 .590 .050 .671
    •    2 / 3 .483 .104 .476
    •    3 / 3 .664 .013 .808
    •    3 / 4 .565 .032 .646
    •    3 / 5 .490 .061 .485
    •    4 / 5 .623 .009 .770
93
Selected Bayesian Analysis Results
  • Ad hoc prior (uniform mass on null: 0.8)


    • Obs Tox              E(p|S)         Pr(p<0.2|S)         Pr(p>0.5|S)
    •    0 / 1 .135 .871 .031
    •    1 / 1 .354 .462 .300
    •    2 / 2 .256 .619 .145
    •    2 / 2 .532 .174 .584
    •    2 / 3 .404 .316 .363
    •    3 / 3 .645 .049 .778
    •    3 / 4 .530 .116 .590
    •    3 / 5 .438 .208 .410
    •    4 / 5 .611 .035 .750
94
Final comments
  • Hybrid rule could have been more complicated to account for later decisions to switch


  • Sensitivity analysis suggests appropriate inference in this case (could use as a criterion for GST)


  • Adjusted inference possible, but more complex


  • Bayesian analysis of some interest, but it is questionable that a proper prior could ever be selected to detect unexpected toxicities
95
 
96
Documentation of Design
  • Specification of stopping rule
      • Null, design alternative hypotheses
      • Type I error (alpha, beta parameters)
      • Power to detect design alternative
      • One-sided,  two-sided hypotheses (epsilon parameters)
      • Boundary scale for design family
      • Boundary shape function parameters (P, R, A) for each boundary
      • Constraints (minimum, maximum, exact)
97
Documentation of Design
  • Documentation of stopping rule
      • Specification of stopping rule
      • Estimation of sample size requirements
      • Example of stopping boundaries under estimated schedule of analyses
        • sample mean scale
        • other scales?
      • Inference at the boundaries
      • Futility, Bayesian properties?
98
Documentation of Implementation
  • Specification of implementation methods
      • Method for determining analysis times
      • Operating characteristics to be maintained
        • power (up to some maximum N?)
        • maximal sample size
      • Method for measuring study time
      • Boundary scale for making decisions
      • Boundary scale for constraining boundaries at previously conducted analyses
      • (Conditions stopping rule might be modified)
99
Documentation of Analysis
  • Specification of analysis methods
      • Method for determining P values
      • Method for point estimation
      • Method for confidence intervals
      • (Handling additional data that accrues after decision to stop)