Notes
Slide Show
Outline
1
Session 2
  • Group Sequential Stopping Rules
    • Need for Monitoring a Trial
    • Criteria for Early Stopping
    • Inadequacy of Fixed Sample Methods
    • Stopping Rules
  • Families of Designs
    • Boundary Scales
    • Unified Family (Sample Mean Scale)
    • Error Spending Family
    • Comparison of Parameterizations
  • Evaluation of Group Sequential Designs
2
 
3
Need for Monitoring a Trial
  • Fixed sample two-sided tests
    • Test of a two-sided alternative (q+ > q0  > q- )
      • Upper Alternative:    H+: q ³ q+     (superiority)
      • Null:                          H0: q = q0       (equivalence)
      • Lower Alternative:    H -: q £ q-       (inferiority)

    • Data analyzed once at the end of all data accrual


    • Decisions:
      • Reject H0 , H -   (for H+)   ÜÞ       T ³ cU
      • Reject H+ , H -   (for H0)   ÜÞ   cL £ T £  cU
      • Reject H+ , H0    (for H -)   ÜÞ       T £  cL
4
Need for Monitoring a Trial
  • Ethical concerns


    • Patients already on trial


      • Avoid continued administration of harmful treatments


      • Maintain validity of informed consent
5
Need for Monitoring a Trial
  • Ethical concerns (cont.)


    • Patients not yet on trial


      • Start treatment with best therapy


      • Ensure informed consent valid

6
Need for Monitoring a Trial
  • Ethical concerns (cont.)


    • Patients never on trial


      • Facilitate rapid introduction of beneficial treatments


      • Warn about risks of existing treatments

7
Need for Monitoring a Trial
  • Efficiency considerations


    • Fewer patients may be needed on average
      • Decreases costs associated with number of patients


    • Time savings
      • Decreases costs associated with monitoring patients
8
Need for Monitoring a Trial
  • Futility considerations: Efficiency and Ethics
    • Efficiency
      • Stop a study when it is known (or reasonably certain) that no effect will be demonstrated
      • Can perform more studies with limited resources
    • Ethics
      • Is it ever ethical to expose patients to experimental treatments when no meaningful information will be gained?
      • Can devote resources to study of more promising agents
9
 
10
Criteria for Stopping a Trial
  • Sufficient evidence available to be confident of rejecting specific hypotheses


    • Stopping early for
      • Efficacy (superiority)
      • Harm (inferiority)
      • Equivalence
11
Criteria for Stopping a Trial
  • Futility of demonstrating effect that would change behavior


    • Stopping early for futility
      • Not sufficiently superior
      • Not dangerously harmful
12
Criteria for Stopping a Trial
  • And there is no advantage in continuing


    • Even if confident of ultimate decision about primary endpoint, may want to continue trial to gain more information on
      • Safety
      • Longer term follow-up
      • Gather additional data on secondary outcomes
13
Criteria for Stopping a Trial
  • Statistical basis for stopping criteria


    • Curtailment
      • Boundary has been reached early


      • E.g., one arm study with binary endpoint
        • Critical value for rejection of null might be observation of K events

        • Kth event may occur well before all subjects accrued
14
Criteria for Stopping a Trial
  • Statistical basis for stopping criteria (cont.)


    • Stochastic Curtailment
      • High probability that a particular decision will be made at final analysis


      • Calculate probability of exceeding some critical value conditional on data observed so far


      • Probability calculated based on hypothesized treatment effect (which hypothesis?) or current estimate
15
Criteria for Stopping a Trial
  • Statistical basis for stopping criteria (cont.)


    • Predictive probability of final statistic


      • A special form of stochastic curtailment


      • Uses a Bayesian prior distribution on the treatment effect
16
Criteria for Stopping a Trial
  • Statistical basis for stopping criteria (cont.)


    • Group sequential test


      • Sufficient evidence to make decision in classical frequentist framework


      • Type I and II errors controlled at desired levels
17
Criteria for Stopping a Trial
  • Statistical basis for stopping criteria (cont.)


    • Bayesian analysis


      • Compute the probability that the treatment effect is in some specified range


      • Calculations based on a user specified prior distribution for the treatment effect (which is treated as a random variable)
18
 
19
Inadequacy of Fixed Sample Methods
  • Sequential monitoring of a trial
    • Data are analyzed after accrual of each observation


      • (Group sequential monitoring: analysis after groups of observations accrued)


      • Analyses must take into account the repeated analyses of the same data
        • Sampling distribution of the test statistic is altered
        • Frequentist properties are altered
20
Inadequacy of Fixed Sample Methods
  • Setting for demonstration of the problem


    • Observations:
  •      X1, X2, X3, …, XN
  •          Xi ~ N(m, s2)


    • Hypothesis:
  •                  H0 : m = m0
21
Inadequacy of Fixed Sample Methods
  • Test Statistic


    • Sample mean computed after each observation:
22
Inadequacy of Fixed Sample Methods
  • Fixed sample decision rule


    • Hypothesis test when all data accrued:
      • Reject H0 when
23
Inadequacy of Fixed Sample Methods
  • Sample path for sample mean
24
Inadequacy of Fixed Sample Methods
  • Sample path for sample mean
25
Inadequacy of Fixed Sample Methods
  • Repeated significance testing
    • Continuous monitoring:
      • Reject H0 the first time
26
Inadequacy of Fixed Sample Methods
  • Simulated trials when H0 is true:
27
Inadequacy of Fixed Sample Methods
  • Simulated trials when H0 is true:
28
Inadequacy of Fixed Sample Methods
  • Repeated significance testing
    • Monitoring after each of J groups of observations:
      • Analyses at N1, N2, …, NJ
      • Reject H0 the first time
29
Inadequacy of Fixed Sample Methods
  • Simulated trials when H0 is true:
30
Inadequacy of Fixed Sample Methods
  • Simulated trials when H0 is true:
31
Inadequacy of Fixed Sample Methods
  • Simulate 100,000 Trials under the Null Hypothesis
    • Three equally spaced level .05 analyses

  •                            Proportion Significant
  •                        1st


  •                      .05038
32
Inadequacy of Fixed Sample Methods
  • Simulate 100,000 Trials under the Null Hypothesis
    • Three equally spaced level .05 analyses

  •                            Proportion Significant
  •                        1st      2nd


  •                      .05038   .05022
33
Inadequacy of Fixed Sample Methods
  • Simulate 100,000 Trials under the Null Hypothesis
    • Three equally spaced level .05 analyses

  •                            Proportion Significant
  •                        1st      2nd      3rd


  •                      .05038   .05022   .05056
34
Inadequacy of Fixed Sample Methods
  • Simulate 100,000 Trials under the Null Hypothesis
    • Three equally spaced level .05 analyses

  • Pattern of                 Proportion Significant
  • Significance           1st


  • 1st only             .03046
  • 1st, 2nd             .00807
  • 1st, 3rd             .00317
  • 1st, 2nd, 3rd        .00868





  • Any pattern          .05038
35
Inadequacy of Fixed Sample Methods
  • Simulate 100,000 Trials under the Null Hypothesis
    • Three equally spaced level .05 analyses

  • Pattern of                 Proportion Significant
  • Significance           1st      2nd


  • 1st only             .03046
  • 1st, 2nd             .00807   .00807
  • 1st, 3rd             .00317
  • 1st, 2nd, 3rd        .00868   .00868
  • 2nd only                      .01921
  • 2nd, 3rd                      .01426



  • Any pattern          .05038   .05022
36
Inadequacy of Fixed Sample Methods
  • Simulate 100,000 Trials under the Null Hypothesis
    • Three equally spaced level .05 analyses

  • Pattern of                 Proportion Significant
  • Significance           1st      2nd      3rd


  • 1st only             .03046
  • 1st, 2nd             .00807   .00807
  • 1st, 3rd             .00317            .00317
  • 1st, 2nd, 3rd        .00868   .00868   .00868
  • 2nd only                      .01921
  • 2nd, 3rd                      .01426   .01426
  • 3rd only                               .02445


  • Any pattern          .05038   .05022   .05056
37
Inadequacy of Fixed Sample Methods
  • Simulate 100,000 Trials under the Null Hypothesis
    • Three equally spaced level .05 analyses

  • Pattern of                 Proportion Significant
  • Significance           1st      2nd      3rd      Ever


  • 1st only             .03046                      .03046
  • 1st, 2nd             .00807   .00807             .00807
  • 1st, 3rd             .00317            .00317    .00317
  • 1st, 2nd, 3rd        .00868   .00868   .00868    .00868
  • 2nd only                      .01921             .01921
  • 2nd, 3rd                      .01426   .01426    .01426
  • 3rd only                               .02445    .02445


  • Any pattern          .05038   .05022   .05056    .10830
38
Inadequacy of Fixed Sample Methods
  • Group sequential test: Pocock (1977) level .05
    • Three equally spaced level .022 analyses

  • Pattern of                 Proportion Significant
  • Significance           1st      2nd      3rd      Ever


  • 1st only             .01520                      .01520
  • 1st, 2nd             .00321   .00321             .00321
  • 1st, 3rd             .00113            .00113    .00113
  • 1st, 2nd, 3rd        .00280   .00280   .00280    .00280
  • 2nd only                      .01001             .01001
  • 2nd, 3rd                      .00614   .00614    .00614
  • 3rd only                               .01250    .01250


  • Any pattern          .02234   .02216   .02257    .05099
39
Inadequacy of Fixed Sample Methods
  • Critical values depend on spacing of analyses
    • Level .022 analyses at 10%, 20%, 100% of data

  • Pattern of                 Proportion Significant
  • Significance           1st      2nd      3rd      Ever


  • 1st only             .01509                      .01509
  • 1st, 2nd             .00521   .00521             .00521
  • 1st, 3rd             .00068            .00068    .00068
  • 1st, 2nd, 3rd        .00069   .00069   .00069    .00069
  • 2nd only                      .01473             .01473
  • 2nd, 3rd                      .00165   .00165    .00165
  • 3rd only                               .01855    .01855


  • Any pattern          .02167   .02228   .02157    .05660
40
Inadequacy of Fixed Sample Methods
  • The critical values can be varied across analyses
    • Level 0.10 O’Brien-Fleming (1979); equally spaced tests at .003, .036, .087
  • Pattern of                 Proportion Significant
  • Significance           1st      2nd      3rd      Ever


  • 1st only             .00082                      .00082
  • 1st, 2nd             .00036   .00036             .00036
  • 1st, 3rd             .00037            .00037    .00037
  • 1st, 2nd, 3rd        .00127   .00127   .00127    .00127
  • 2nd only                      .01164             .01164
  • 2nd, 3rd                      .02306   .02306    .02306
  • 3rd only                               .06223    .01855


  • Any pattern          .00282   .03633   .08693    .09975
41
Inadequacy of Fixed Sample Methods
  • Error spending function: Pocock (1977) level .05
  • Pattern of                 Proportion Significant
  • Significance           1st      2nd      3rd      Ever


  • 1st only             .01520                      .01520
  • 1st, 2nd             .00321   .00321             .00321
  • 1st, 3rd             .00113            .00113    .00113
  • 1st, 2nd, 3rd        .00280   .00280   .00280    .00280
  • 2nd only                      .01001             .01001
  • 2nd, 3rd                      .00614   .00614    .00614
  • 3rd only                               .01250    .01250


  • Any pattern          .02234   .02216   .02257    .05099
  • Incremental error    .02234   .01615   .01250
  • Cumulative error     .02234   .03849   .05099


42
 
43
Stopping Rules
  • Basic Strategy


    • Find stopping boundaries at each analysis such that desired operating characteristics (e.g., type I and type II statistical errors) are attained



44
Stopping Rules
  • Issues
      • Conditions under which the trial might be stopped early
      • When to perform analyses
      • Test statistic to use
      • Relative position of boundaries at successive analyses
      • Desired operating characteristics
45
Stopping Rules
  • Choice of Test Statistic


    • Let Tn(X1, ..., Xn) be any test statistic such that Tn tends to be large for larger values of q


    • (Later we will consider possible choices for Tn)
46
Stopping Rules
  • Conditions for Early Stopping: One-sided tests
    • Test of a greater alternative (q+ > q0)
      • Null:               H0: q £ q0
      • Alternative:    H1: q ³ q+

    • Possibilities for early stopping:
      • Stop only for the null (when  Tn small)
      • Stop only for the alternative (when  Tn large)
      • Stop either for the null or for the alternative


47
Stopping Rules
  • Conditions for Early Stopping: One-sided tests
    • Test of a lesser alternative (q- < q0)
      • Null:               H0: q ³ q0
      • Alternative:    H1: q £ q-


    • Possibilities for early stopping:
      • Stop only for the null (when  Tn large)
      • Stop only for the alternative (when  Tn small)
      • Stop either for the null or for the alternative
48
Stopping Rules
  • One-sided Test Boundaries: Sample Mean Statistic
49
Stopping Rules
  • Conditions for Early Stopping: Two-sided tests
    • Test of a two-sided alternative (q+ > q0  > q- )
      • Upper Alternative:    H+: q ³ q+
      • Null:                          H0: q = q0
      • Lower Alternative:    H -: q £ q-

    • Possibilities for early stopping:
      • Stop only for the null (when  Tn intermediate)
      • Stop only for the alternative (when  Tn small or large)
      • Stop either for the null or for the alternative


50
Stopping Rules
  • Two-sided Test Boundaries: Sample Mean Statistic
51
Stopping Rules
  • General stopping rule
    • Maximum of four boundaries
      • ‘d’ boundary: upper outer boundary
      • ‘c’ boundary: upper inner boundary
      • ‘b’ boundary: lower inner boundary
      • ‘a’ boundary: lower outer boundary


    • Early stopping
      • Tn greater than ‘d’ boundary
      • Tn between ‘b’ and ‘c’ boundaries
      • Tn less than ‘a’ boundary
52
Stopping Rules
  • One-sided tests of greater hypotheses


    • Always have ‘b’ and ‘c’ boundaries are equal
      • so no early stopping for intermediate Tn


    • Early stopping
      •  If ‘a’ boundary at  -¥: no early stopping for null
      •  If ‘d’ boundary at  ¥: no early stopping for alternative
53
Stopping Rules
  • One-sided Test Boundaries: Sample Mean Statistic
54
Stopping Rules
  • One-sided tests of lesser hypotheses


    • Always have ‘b’ and ‘c’ boundaries are equal
      • so no early stopping for intermediate Tn


    • Early stopping
      •  If ‘a’ boundary at  -¥: no early stopping for alternative
      •  If ‘d’ boundary at  ¥: no early stopping for null
55
Stopping Rules
  • One-sided Test Boundaries: Sample Mean Statistic
56
Stopping Rules
  • Two-sided tests
    • Early stopping
      • If ‘a’ boundary at  -¥: no early stopping for lower alternative


      • If ‘b’ and ‘c’ boundaries equal: no early stopping for null


      • If ‘d’ boundary at  ¥: no early stopping for upper alternative
57
Stopping Rules
  • Two-sided Test Boundaries: Sample Mean Statistic
58
Stopping Rules
  • Representation of two-sided hypothesis tests
    • Two-sided tests take on appearance of two superposed hypothesis tests
      • Lower test
        • H0-: q ³ q0-  versus H-: q £ q-
      • Upper test
        • H0+: q £ q0+ versus H+: q ³ q+

      • Classic two-sided test:
        • q0- = q0+ = q0
        • q- = - q+
59
Stopping Rules
  • Generalization of hypothesis tests
    • Require only  q- £ q0+ £ q0- £ q+


    • Correspondence between hypotheses and boundaries
      • ‘a’ boundary rejects H0-: q ³ q0-
      • ‘b’ boundary rejects H-: q £ q-
      • ‘c’ boundary rejects H+: q ³ q+
      • ‘d’ boundary rejects H0+: q £ q0+
60
Stopping Rules
  • Correspondence to classical tests of H0: q = q0
      • One-sided tests of greater alternative (upper and lower tests coincident)
        • q- < q0- = q0  (define q0+ = q- and q+ = q0-)
      • One-sided tests of lesser alternative (upper and lower tests coincident)
        • q0 = q0+ < q+ (define q- = q0+ and q0- = q+)

      • Two-sided tests
        • q- < q0- = q0  = q0+ < q+ (with q- = - q+ )
61
Stopping Rules
  • Parameterize hypotheses by shift parameters e L, e U
      •  0 £ eL £ 1 is shift of q0- away from q+  toward q0
        • q0- = q+ - eL ´ (q+ - q0)


      • 0 £ eU £ 1 is shift of q0+ away from q-  toward q0
        • q0+ = q- + eU ´ (q0 - q-)


      • Constraint: 1 £ eL + eU £ 2


      • Test can be thought of as (eL + eU)-sided
62
Stopping Rules
  • Parameterization special cases
    • One-sided test of greater alternative:
      • eL  =  0   eU  =  1

    • One-sided test of lesser alternative:
      • eL  =  1   eU  =  0

    • Two-sided test:
      • eL  =  1   eU  =  1

    • One-sided equivalence (noninferiority) test:
      • eL  =  0.5   eU  =  0.5
63
Stopping Rules
  • Number and timing of analyses


    • N counts the sampling units accrued to the study


    • Up to J analyses of the data to be performed


    • Analyses performed after accruing sample sizes of N1 < N2 < L < NJ


    • (More generally, N measures statistical information)
64
Stopping Rules
  • Boundaries at the analyses


    • aj £ bj £ cj £ dj are the ‘a’, ‘b’, ‘c’, and ‘d’ boundaries at the j-th analysis (when Nj  observations)


    • At the final (J-th) analysis aJ = bJ and cJ = dJ to guarantee stopping
65
Stopping Rules
  • Boundary shape functions


    • Pj measures the proportion of information accrued at the j-th analysis
      • often Pj = Nj / NJ


    • Boundary shape function f(Pj)  is a monotonic function used to relate the dependence of boundaries at successive analyses on the information accrued to the study at that analysis
66
Stopping Rules
  • Formulation of stopping boundaries
    • At the j-th analysis


      • aj is determined by qa = q0-  and fa (Pj)


      • bj is determined by qb = q-  and fb (Pj)

      • cj is determined by qc = q+  and fc (Pj)

      • dj is determined by qd = q0+  and fd (Pj)
67
Stopping Rules
  • Parameterization of boundary shape functions






    • Distinct parameters possible for each boundary


    • Parameters A*, P*, R* typically chosen by user
    • Critical value G* usually calculated from search
68
 
69
Boundary Scales
  • Choices for test statistic Tn
    • Sum of observations
    • Point estimate of treatment effect
    • Normalized (Z) statistic
    • Fixed sample P value
    • Error spending function
    • Conditional probability
    • Predictive probability
    • Bayesian posterior probability
70
Boundary Scales
  • Choices for test statistic Tn


    • All of those choices for test statistics can be shown to be transformations of each other


    • Hence, a stopping rule for one test statistic is easily transformed to a stopping rule for a different test statistic


    • We regard these statistics as representing different scales for expressing the boundaries
71
Boundary Scales: Notation
  • One sample inference about means
    • Generalizable to most other commonly used models

72
Boundary Scales
  • Partial Sum Scale:




  • Uses:
    • Cumulative number of events
    • Convenient when computing density
73
Boundary Scales
  • Sample Mean Scale:








  • Uses:
    • Natural estimate of treatment effect


74
Boundary Scales
  • Normalized Statistic Scale:






  • Uses:
    • Commonly computed in analysis routines
75
Boundary Scales
  • Fixed Sample P value Scale:





  • Uses:
    • Commonly computed in analysis routine
    • Robust to use with other distributions for estimates of treatment effect
76
Boundary Scales
  • Bayesian Posterior Scale:
    • Prior






  • Uses:
    • Bayesian inference (unaffected by stopping)
    • Posterior probability of hypotheses
77
Boundary Scales
  • Conditional Probability Scale:
    • Threshold at final analysis
    • Hypothesized value of mean






  • Uses:
    • Conditional power
    • Futility of continuing under specific hypothesis
78
Boundary Scales
  • Conditional Probability (estimate) Scale:
    • Threshold at final analysis








  • Uses:
    • Futility of continuing using best estimate


79
Boundary Scales
  • Predictive Probability Scale:
    • Prior  distribution







  • Uses:
    • Futility of continuing study
80
Boundary Scales
  • Predictive Probability Scale:
    • Noninformative Prior







  • Uses:
    • Futility of continuing study
81
Boundary Scales
  • Error Spending (outer lower boundary) Scale:





  • Uses:
    •  Implementation of stopping rules with flexible determination of number and timing of analyses


82
Boundary Scales
  • Error Spending (inner lower boundary) Scale:





  • Uses:
    • Implementation of stopping rules with flexible determination of number and timing of analyses
83
Boundary Scales
  • Error Spending (inner upper boundary) Scale:





  • Uses:
    •  Implementation of stopping rules with flexible determination of number and timing of analyses
84
Boundary Scales
  • Error Spending (outer upper boundary) Scale:





  • Uses:
    •  Implementation of stopping rules with flexible determination of number and timing of analyses
85
Boundary Scales
  • Use in evaluating designs
    • Several of the boundary scales have interpretations that are useful in evaluating the operating characteristics of a design


      • Sample Mean Scale
      • Conditional Probability Futility Scales
      • Predictive Probability Futility Scale
      • Bayesian Posterior Probability Scale
      • (Error Spending Scale)
86
 
87
Unified Design Family
  • Unifying parameterization for the most commonly used group sequential designs (Kittelson & Emerson, 1999)
    • Rich parameterization facilitates search for stopping rule appropriate for specific applications


    • Inclusion of broad spectrum of designs means that comparisons within this family will consider full range of possible designs


    • (Default family in S+SeqTrial)
88
Unified Design Family
  • Stopping Boundaries for Sample Mean Statistic:
  •      aj =    ma   -     fa  (Pj)
  •      bj =    mb   +    fb  (Pj)
  •      cj =    mc   -     fc  (Pj)
  •      dj =    md   +    fd  (Pj)
89
Unified Design Family
  • Parameterization of boundary shape functions






    • Distinct parameters possible for each boundary


    • Parameters A*, P*, R* typically chosen by user
    • Critical value G* usually calculated from search
90
Unified Design Family
  • Choice of P parameter
    • P ³ 0:
      •  Larger positive values of P make early stopping more difficult (impossible when P infinite)
      • When A=R=0, 0.5 < P < 1 corresponds to power family parameter (D) in Wang & Tsiatis (1987): P= 1 - D
      • Reasonable range of values: 0 < P < 2.5
      • P=0 with A=R=0 possible for some (not all) boundaries, but not particularly useful
91
Unified Design Family
  • Effect of varying P>0 (when A=0, R=0)
    • Higher P leads to early conservatism
    • P > 0 has infinite boundaries when N=0
92
Unified Design Family
  • Choice of P parameter


    • P < 0:
      •  Must have R = 0 and (typically) A < 0
      • More negative values of P make early stopping more difficult
93
Unified Design Family
  • Effect of varying P<0 (when A=2, R=0)
    • More negative P leads to early conservatism
    • P < 0 has finite boundaries when N=0
94
Unified Design Family
  • Choice of R parameter
    • R > 0:
      • Larger positive values of R make early stopping easier
      • When R>0 and P=0, typically need A>0
      • Reasonable range of values: 0.1 < R < 20
      • R < 1 is convex outward
      • R > 1 is convex inward
      • When R>0 and P>0, can get change in convexity of boundaries
95
Unified Design Family
  • Effect of varying R (when A=1, P=0)
    • R < 1 leads to convex outward
    • R > 1 leads to convex inward
96
Unified Design Family
  • Effect of varying R (when A=1, P=0.5)
    • With P > 0, boundaries infinite when N=0
    • R < 1 and P > 0 has change in convexity
97
Unified Design Family
  • Choice of A parameter
      • Lower absolute values of A makes it harder to stop at early analyses
      • Valid choices of A depend upon choices of P and R
      • Useful ranges for A
        • P ³ 0, R ³ 0:      0.2  £  A  £  15
        • P £ 0, R = 0:      -15  £  A  £  -1.25



98
Unified Design Family
  • Effect of varying A (when P=0, R=1.2)
    • Values of A closer to 0 make it harder to stop early
    • Higher absolute value of A makes flatter boundaries
99
Unified Design Family
  • Parameterization of boundary shape function includes many previously described approaches


    • Wang & Tsiatis Boundary Shape Functions:
      • A*  =     0,  R*  =     0, P*  >     0
      • P* measures early conservatism
        • P*  =  0.5   Pocock (1977)
        • P*  =  1.0   O’Brien-Fleming (1979)
      • (P* = ¥ precludes early stopping)
100
Unified Design Family
  • Parameterization of boundary shape function includes many previously described approaches
    • Triangular Test Boundary Shape Functions (Whitehead)
      • A*  =     1,  R*  =     0, P*  = 1
    • Sequential Conditional Probability Ratio Test (Xiong):
      • R*  =     0.5, P*  = 0.5
101
Unified Design Family
  • Parameterization of hypothesis shifts and boundary shape function unifies what were discrete families
    • Triangular tests vs Wang and Tsiatis based families
      • Choice of A *

    • One-sided vs two-sided tests
      • Choice of e L, e U

    • Early stopping under one hypothesis vs both hypotheses
      • Choice of P *
102
Unified Design Family
  • Spectrum of designs
    • e L increases across rows
    • Pa  and/or  Pc  increases down columns
103
Unified Design Family
  • Operating characteristics


    • User specifies size aU, aL of upper and lower tests


    • User specifies power bU, bL of upper and lower tests


    • Computer search for Ga, Gb, Gc, Gd that attains those operating characteristics


    • (Sample size can be computed using some other power besides bU, bL)
104
 
105
Error Spending Family
  • Lan and DeMets (1983) approach
    • At each analysis, some of the type I error is `used up’


    • Describe a stopping rule according to the proportion of aU, aL used at each analysis
      • General case: alpha used by the j-th analysis determined by some function of the proportion of maximal information available
106
Error Spending Family
  • Lan and DeMets (1983) approach (cont.)
    • Lan and DeMets (1983) describe error spending functions comparable to O’Brien-Fleming or Pocock designs
      • O’Brien-Fleming




      • Pocock
107
Error Spending Family
  • Lan and DeMets (1983) approach (cont.)
    • Lan and DeMets (1983) describe error spending functions comparable to O’Brien-Fleming or Pocock designs for specific type I errors
108
Error Spending Family
  • Lan and DeMets (1983) approach (cont.)
    • More recently authors have focussed on error spending functions of the form






    • (Kim and DeMets, 1987; Jennison and Turnbull, 1989; Hwang, Shih, and DeCani, 1990)
109
Error Spending Family
  • Lan and DeMets (1983) approach (cont.)
    • Kim and DeMets (1987) and Jennison and Turnbull (1989) consider an error spending family corresponding to




    • Useful special cases identified by those authors:
      • P = 1 is similar to Pocock (1977)
      • P = 3 is similar to O’Brien and Fleming (1979)

110
Error Spending Family
  • Pampallona, Tsiatis, and Kim (1995) extension


    • Defines type II error spending functions


    • At each analysis, recompute maximal sample size which will maintain planned level of significance and power
111
Error Spending Family
  • Implementation of an Error Spending Family
    • Define stopping rule on error spending function scale by defining Eaj, Ebj, Ecj, Edj


    • Use framework of superposed one-sided hypothesis tests described by Kittelson and Emerson (1999) to define relationships among hypotheses rejected by each of the four possible stopping boundaries
112
Error Spending Family
  • Correspondence with type I and II error spending
    • For user specified size aU, aL of upper and lower tests and power bU, bL of upper and lower tests, error spent at the j-th analysis specified as:
113
Error Spending Family
  • Boundary shape functions
    • Boundary shape function can be defined separately for each of the four boundaries
114
Error Spending Family
  • Constraints on parameters
    • f(0) = 0 and f(1) = 1


    • If P < 0
      • R = 0,  A = 1, G = 1


    • If R > 0
      • P = 0, A = -1, G = -1


    • If P = 0 and R = 0, no early stopping
115
Error Spending Family
  • Computer search for stopping boundaries
    • Error spending family defines Eaj, Ebj, Ecj, Edj


    • Appendix of Kittelson and Emerson (1999) describes general algorithm for finding design when hypotheses known


    • At design stage, must search for standardized hypotheses that result in a valid design, and then compute sample size to map standardized design to specified alternative hypotheses.


116
Error Spending Family
  • Computer search for stopping boundaries (cont.)


    • In order to more easily obtain more efficient designs, when designing a study using error spending functions, the specified type II error spending functions are only used as upper bounds on the true type II error spending function.
117
 
118
Comparison of Parameterizations
  • General comments
    • Families also defined for other boundary scales
      • Partial sum and Z statistic scale families implemented in S+SeqTrial
      • Bayesian and Futility scale families under construction

    • If stopping rules are carefully evaluated, it does not matter too much which scale (and therefore family) is used to derive the stopping rule.
119
Comparison of Parameterizations
  • General comments (cont.)
    • The best design family to use will be the one which allows a user to most quickly find a stopping rule having desirable operating characteristics


    • The ease of use will therefore depend in part on
      • Interpretability of boundary scale
      • Interpretability of parameters
120
Comparison of Parameterizations
  • General comments (cont.)
    • My view:
      • Sample mean scale (unified family) has easier scientific interpretation than the error spending scale which has a purely statistical interpretation that, in my experience, is poorly understood by both users and researchers
      • The parameterization of the unified family produces a more useful grouping of designs on some level than does the parameterization of the error spending family
121
Comparison of Parameterizations
  • ASSERTION: Interpretability of boundary scales
    • The concept of an error spending scale is less relevant to clinical researchers
      • Type I error reflects only statistical evidence
      • May conflict with scientific importance
        • Underpowered studies: Failure to reject the null in the face of large estimates of treatment effect
        • Overpowered studies: Rejection of the null hypothesis when differences are scientifically unimportant


122
Comparison of Parameterizations
  • ASSERTION: Interpretability of boundary scales
    • The formulation of error spending scales is not well understood by the researchers developing such methods
      • Lan & DeMets (1983), Kim & DeMets (1987), Jennison & Turnbull (1989 and 2000) all describe error spending functions which mimic O’Brien-Fleming (1979) or Pocock (1977) group sequential designs
      • In fact, for different levels of type I (or type II) error, the error spending functions are different within those families of designs
123
Comparison of Parameterizations
  • Error spent at each analysis for O’Brien and Fleming (1979)  designs depends on Type I or Type II errors
124
Comparison of Parameterizations
  • Error spent at each analysis for Pocock (1977) designs depends on Type I or Type II errors
125
Comparison of Parameterizations
  • Is there a problem?
    • Parameterization of stopping rule families induces a grouping of designs:


      • Unified family: Pocock (1977) designs, O’Brien-Fleming (1979) designs, Triangular designs (Whitehead & Stratton, 1983)


      • Error spending families: All designs that spend the same proportion of type I or II error at each analysis
126
Comparison of Parameterizations
  • Is there a problem? (cont.)


    • Best parameterization might be defined according to whether such groupings correspond to similar operating characteristics
      • efficiency
      • Bayesian properties
      • futility properties
      • others

127
Comparison of Parameterizations
  • Efficiency
    • Consider ability of choice of boundary shape parameter to predict efficiency of design


      • No uniformly most powerful design


      • Efficiency measured in terms of smallest average sample size for specific hypothesis
        • Measure alternative hypothesis according to the power of the test to detect it
128
Comparison of Parameterizations
  • Methods for comparison
    • Find optimal designs in terms of average sample size (ASN) within family of Wang and Tsiatis (1987) boundary shape functions for one-sided symmetric designs (Emerson and Fleming, 1989)
      • Family found to be approximately optimal


    • Find optimal designs for various choices of type I error and statistical power
129
Comparison of Parameterizations
  • Methods for comparison (cont.)
    • For each optimal design, examine the boundary shape function on
      • Sample mean scale
      • Error spending scale
      • Futility scales
130
Comparison of Parameterizations
  • Criteria for “good” parameterizations
    • If the boundary shape function on a given scale is not independent of choice of type I and II errors, then that would argue that grouping of designs according to parameterization of that scale will not correspond to similar efficiency properties


    • As it is unlikely that boundary shape parameters for efficient designs will be constant across all choices of type I and type II errors, we can also compare the degree that boundary shape parameters change for each boundary scale


131
Comparison of Parameterizations
  • Proportion of error spent at each analysis for approximately efficient designs
    • Power varies across panels
    • Type I error varies across lines within each panel
132
Comparison of Parameterizations
  • Conditional power (using MLE) at the boundary for each analysis for approximately efficient designs
    • Power varies across panels
    • Type I error varies across lines within each panel
133
Comparison of Parameterizations
  • Comparison of optimal unified family P parameter as a function of type I errors
    • Compared to best fitting P or R parameter in error spending family
134
Comparison of Parameterizations
  • Search for stopping rule is generally iterative
      • An initial design is specified
      • Operating characteristics are examined
      • Modifications are made to the design

  • Availability of tools for evaluation of operating characteristics lessens impact of family used to define a stopping rule
      • Appropriate designs can be found from almost any starting point
135
Comparison of Parameterizations
  • To the extent that parameterization of sample mean family predicts efficiency behavior, use of that family may allow more intuitive search for suitable stopping rules


    • However, efficiency is not always of paramount concern
136
Comparison of Parameterizations
  • Interpretation of unified family boundaries as estimate of treatment effect is meaningful to clinical researcher


  • Error spending functions are less interpretable, and thus seem less useful when designing a clinical trial or evaluating its operating characteristics
    • However, error spending scale can be useful in implementing a stopping rule
137
Comparison of Parameterizations
  • It is not clear that conditional probabilities are particularly useful in the definition of a stopping rule
      • Design family does not have a particularly intuitive parameterization
      • Unconditional power considerations would seem more straightforward


138
 
139
Evaluation of Designs
  • Process of choosing a trial design


    • Define candidate design


    • Evaluate operating characteristics


    • Modify design


    • Iterate
140
Evaluation of Designs
  • Operating characteristics for fixed sample studies


    • Level of Significance (often pre-specified)
    • Sample size requirements
    • Power Curve
    • Decision Boundary
    • Frequentist inference on the Boundary
    • Bayesian posterior probabilities


141
Evaluation of Designs
  • Additional operating characteristics for group sequential studies


    • Probability distribution for sample size
    • Stopping probabilities
    • Boundaries at each analysis
    • Frequentist inference at each analysis
    • Bayesian inference at each analysis
    • Futility measures at each analysis


142
Evaluation of Designs
  • Sample size requirements
    • Number of subjects needed is a random variable


    • Quantify summary measures of sample size distribution
      • maximum (feasibility of accrual)
      • mean (Average Sample N- ASN)
      • median, quartiles

    • (Particularly consider tradeoffs between power and sample size distribution)


143
Evaluation of Designs
  • Stopping probabilities


    • Consider probability of stopping at each analysis for arbitrary alternatives


    • Consider probability of each decision (for null or alternative) at each analysis
144
Evaluation of Designs
  • Power curve
    • Probability of rejecting null for arbitrary alternatives
      • Power under null: level of significance
      • Power for specified alternative


    • Alternative rejected by design
      • Alternative for which study has high power

    • S+SeqTrial defines
      • Power curves for upper and lower boundaries
      • Alternatives having specified power for each boundary
145
Evaluation of Designs
  • Decision boundary at each analysis
    • Value of test statistic leading to rejection of null
      • Variety of boundary scales possible

    • Often has meaning for applied researchers (especially on scale of estimated treatment effect)
      • Estimated treatment effects may be viewed as unacceptable for ethical reasons based on prior notions
      • Estimated treatment effect may be of  little interest due to lack of clinical importance or futility of marketing
146
Evaluation of Designs
  • Frequentist inference on the boundary at each analysis
    • Consider P values, confidence intervals when observation corresponds to decision boundary at each analysis


    • Ensure desirable precision for negative studies
      • Confidence interval identifies hypotheses not rejected by analysis
      • Have all scientifically meaningful hypotheses been rejected?
147
Evaluation of Designs
  • Bayesian posterior probabilities at each analysis


    • Examine the degree to which the frequentist inference leads to sensible decisions under a range of prior distributions for the treatment effect
      • Posterior probability of hypotheses

    • Bayesian estimates of treatment effect
      • Median (mode) of posterior distribution
      • Credible interval (quantiles of posterior distribution

148
Evaluation of Designs
  • Futility measures
    • Consider the probability that a different decision would result if trial continued


    • Can be based on particular hypotheses, current best estimate, or predictive probabilities


    • (Perhaps best measure of futility is whether the stopping rule has changed the power curve substantially)
149
S+SeqTrial Implementation
  • Evaluation of Designs


    • Forms of output from S+SeqTrial
      • Printed output in report window or command line window
      • Plots
      • Named seqDesign object
150
S+SeqTrial Implementation
  • Evaluation of Designs (cont.)
    • Sample size requirements
      • Printed with boundaries
      • X axis with plots of boundaries
      • Plots of average sample size, quantiles of sample size distribution


    • Stopping probabilities
      • Printed with operating characteristics
      • Plots with color coded decisions
151
S+SeqTrial Implementation
  • Evaluation of Designs (cont.)
    • Power Curve
      • Hypotheses, size, power printed with boundaries
      • Tabled power with summaries
      • Plots of power curve
      • Plots versus reference power curve
152
S+SeqTrial Implementation
  • Evaluation of Fixed Sample Designs (cont.)
    • Decision Boundary
      • Printed on specified boundary scale
      • Plots


    • Frequentist inference on the boundary
      • Printed with summaries
      • Plots
153
S+SeqTrial Implementation
  • Evaluation of Fixed Sample Designs (cont.)
    • Bayesian inference
      • Posterior probabilities implemented as a boundary scale
      • Median (mode) of posterior distribution
      • Credible intervals


    • Futility measures
      • Implemented as boundary scale
      • Conditional and predictive approaches