|
1
|
- Group Sequential Stopping Rules
- Need for Monitoring a Trial
- Criteria for Early Stopping
- Inadequacy of Fixed Sample Methods
- Stopping Rules
- Families of Designs
- Boundary Scales
- Unified Family (Sample Mean Scale)
- Error Spending Family
- Comparison of Parameterizations
- Evaluation of Group Sequential Designs
|
|
2
|
|
|
3
|
- Fixed sample two-sided tests
- Test of a two-sided alternative (q+ > q0 > q- )
- Upper Alternative: H+:
q ³ q+ (superiority)
- Null: H0:
q = q0 (equivalence)
- Lower Alternative: H -:
q £ q- (inferiority)
- Data analyzed once at the end of all data accrual
- Decisions:
- Reject H0 , H -
(for H+)
ÜÞ T ³ cU
- Reject H+ , H -
(for H0)
ÜÞ cL £ T £ cU
- Reject H+ , H0
(for H -) ÜÞ T £
cL
|
|
4
|
- Ethical concerns
- Patients already on trial
- Avoid continued administration of harmful treatments
- Maintain validity of informed consent
|
|
5
|
- Ethical concerns (cont.)
- Patients not yet on trial
- Start treatment with best therapy
- Ensure informed consent valid
|
|
6
|
- Ethical concerns (cont.)
- Patients never on trial
- Facilitate rapid introduction of beneficial treatments
- Warn about risks of existing treatments
|
|
7
|
- Efficiency considerations
- Fewer patients may be needed on average
- Decreases costs associated with number of patients
- Time savings
- Decreases costs associated with monitoring patients
|
|
8
|
- Futility considerations: Efficiency and Ethics
- Efficiency
- Stop a study when it is known (or reasonably certain) that no effect
will be demonstrated
- Can perform more studies with limited resources
- Ethics
- Is it ever ethical to expose patients to experimental treatments when
no meaningful information will be gained?
- Can devote resources to study of more promising agents
|
|
9
|
|
|
10
|
- Sufficient evidence available to be confident of rejecting specific
hypotheses
- Stopping early for
- Efficacy (superiority)
- Harm (inferiority)
- Equivalence
|
|
11
|
- Futility of demonstrating effect that would change behavior
- Stopping early for futility
- Not sufficiently superior
- Not dangerously harmful
|
|
12
|
- And there is no advantage in continuing
- Even if confident of ultimate decision about primary endpoint, may want
to continue trial to gain more information on
- Safety
- Longer term follow-up
- Gather additional data on secondary outcomes
|
|
13
|
- Statistical basis for stopping criteria
- Curtailment
- Boundary has been reached early
- E.g., one arm study with binary endpoint
- Critical value for rejection of null might be observation of K events
- Kth event may occur well before all subjects accrued
|
|
14
|
- Statistical basis for stopping criteria (cont.)
- Stochastic Curtailment
- High probability that a particular decision will be made at final
analysis
- Calculate probability of exceeding some critical value conditional on
data observed so far
- Probability calculated based on hypothesized treatment effect (which
hypothesis?) or current estimate
|
|
15
|
- Statistical basis for stopping criteria (cont.)
- Predictive probability of final statistic
- A special form of stochastic curtailment
- Uses a Bayesian prior distribution on the treatment effect
|
|
16
|
- Statistical basis for stopping criteria (cont.)
- Group sequential test
- Sufficient evidence to make decision in classical frequentist
framework
- Type I and II errors controlled at desired levels
|
|
17
|
- Statistical basis for stopping criteria (cont.)
- Bayesian analysis
- Compute the probability that the treatment effect is in some specified
range
- Calculations based on a user specified prior distribution for the
treatment effect (which is treated as a random variable)
|
|
18
|
|
|
19
|
- Sequential monitoring of a trial
- Data are analyzed after accrual of each observation
- (Group sequential monitoring: analysis after groups of observations
accrued)
- Analyses must take into account the repeated analyses of the same data
- Sampling distribution of the test statistic is altered
- Frequentist properties are altered
|
|
20
|
- Setting for demonstration of the problem
- X1, X2,
X3, …, XN
- Xi ~ N(m, s2)
- H0 : m = m0
|
|
21
|
- Test Statistic
- Sample mean computed after each observation:
|
|
22
|
- Fixed sample decision rule
- Hypothesis test when all data accrued:
|
|
23
|
- Sample path for sample mean
|
|
24
|
- Sample path for sample mean
|
|
25
|
- Repeated significance testing
|
|
26
|
- Simulated trials when H0 is true:
|
|
27
|
- Simulated trials when H0 is true:
|
|
28
|
- Repeated significance testing
- Monitoring after each of J groups of observations:
- Analyses at N1, N2, …, NJ
- Reject H0 the first time
|
|
29
|
- Simulated trials when H0 is true:
|
|
30
|
- Simulated trials when H0 is true:
|
|
31
|
- Simulate 100,000 Trials under the Null Hypothesis
- Three equally spaced level .05 analyses
- Proportion
Significant
- 1st
- .05038
|
|
32
|
- Simulate 100,000 Trials under the Null Hypothesis
- Three equally spaced level .05 analyses
- Proportion
Significant
- 1st 2nd
- .05038 .05022
|
|
33
|
- Simulate 100,000 Trials under the Null Hypothesis
- Three equally spaced level .05 analyses
- Proportion
Significant
- 1st 2nd 3rd
- .05038 .05022 .05056
|
|
34
|
- Simulate 100,000 Trials under the Null Hypothesis
- Three equally spaced level .05 analyses
- Pattern of Proportion
Significant
- Significance 1st
- 1st only .03046
- 1st, 2nd .00807
- 1st, 3rd .00317
- 1st, 2nd, 3rd .00868
- Any pattern .05038
|
|
35
|
- Simulate 100,000 Trials under the Null Hypothesis
- Three equally spaced level .05 analyses
- Pattern of Proportion
Significant
- Significance 1st 2nd
- 1st only .03046
- 1st, 2nd .00807 .00807
- 1st, 3rd .00317
- 1st, 2nd, 3rd .00868 .00868
- 2nd only .01921
- 2nd, 3rd .01426
- Any pattern .05038 .05022
|
|
36
|
- Simulate 100,000 Trials under the Null Hypothesis
- Three equally spaced level .05 analyses
- Pattern of Proportion
Significant
- Significance 1st 2nd 3rd
- 1st only .03046
- 1st, 2nd .00807 .00807
- 1st, 3rd .00317 .00317
- 1st, 2nd, 3rd .00868 .00868 .00868
- 2nd only .01921
- 2nd, 3rd .01426 .01426
- 3rd only
.02445
- Any pattern .05038 .05022 .05056
|
|
37
|
- Simulate 100,000 Trials under the Null Hypothesis
- Three equally spaced level .05 analyses
- Pattern of Proportion
Significant
- Significance 1st 2nd 3rd Ever
- 1st only .03046 .03046
- 1st, 2nd .00807 .00807 .00807
- 1st, 3rd .00317 .00317 .00317
- 1st, 2nd, 3rd .00868 .00868 .00868 .00868
- 2nd only .01921 .01921
- 2nd, 3rd .01426 .01426 .01426
- 3rd only
.02445 .02445
- Any pattern .05038 .05022 .05056 .10830
|
|
38
|
- Group sequential test: Pocock (1977) level .05
- Three equally spaced level .022 analyses
- Pattern of Proportion
Significant
- Significance 1st 2nd 3rd Ever
- 1st only .01520 .01520
- 1st, 2nd .00321 .00321 .00321
- 1st, 3rd .00113 .00113 .00113
- 1st, 2nd, 3rd .00280 .00280 .00280 .00280
- 2nd only .01001 .01001
- 2nd, 3rd .00614 .00614 .00614
- 3rd only
.01250 .01250
- Any pattern .02234 .02216 .02257 .05099
|
|
39
|
- Critical values depend on spacing of analyses
- Level .022 analyses at 10%, 20%, 100% of data
- Pattern of Proportion
Significant
- Significance 1st 2nd 3rd Ever
- 1st only .01509 .01509
- 1st, 2nd .00521 .00521 .00521
- 1st, 3rd .00068 .00068 .00068
- 1st, 2nd, 3rd .00069 .00069 .00069 .00069
- 2nd only .01473 .01473
- 2nd, 3rd .00165 .00165 .00165
- 3rd only
.01855 .01855
- Any pattern .02167 .02228 .02157 .05660
|
|
40
|
- The critical values can be varied across analyses
- Level 0.10 O’Brien-Fleming (1979); equally spaced tests at .003, .036,
.087
- Pattern of Proportion
Significant
- Significance 1st 2nd 3rd Ever
- 1st only .00082 .00082
- 1st, 2nd .00036 .00036 .00036
- 1st, 3rd .00037 .00037 .00037
- 1st, 2nd, 3rd .00127 .00127 .00127 .00127
- 2nd only .01164 .01164
- 2nd, 3rd .02306 .02306 .02306
- 3rd only
.06223 .01855
- Any pattern .00282 .03633 .08693 .09975
|
|
41
|
- Error spending function: Pocock (1977) level .05
- Pattern of Proportion
Significant
- Significance 1st 2nd 3rd Ever
- 1st only .01520 .01520
- 1st, 2nd .00321 .00321 .00321
- 1st, 3rd .00113 .00113 .00113
- 1st, 2nd, 3rd .00280 .00280 .00280 .00280
- 2nd only .01001 .01001
- 2nd, 3rd .00614 .00614 .00614
- 3rd only
.01250 .01250
- Any pattern .02234 .02216 .02257 .05099
- Incremental error .02234 .01615 .01250
- Cumulative error .02234 .03849 .05099
|
|
42
|
|
|
43
|
- Basic Strategy
- Find stopping boundaries at each analysis such that desired operating
characteristics (e.g., type I and type II statistical errors) are
attained
|
|
44
|
- Issues
- Conditions under which the trial might be stopped early
- When to perform analyses
- Test statistic to use
- Relative position of boundaries at successive analyses
- Desired operating characteristics
|
|
45
|
- Choice of Test Statistic
- Let Tn(X1, ..., Xn) be any test
statistic such that Tn tends to be large for larger values
of q
- (Later we will consider possible choices for Tn)
|
|
46
|
- Conditions for Early Stopping: One-sided tests
- Test of a greater alternative (q+ > q0)
- Null: H0:
q £ q0
- Alternative: H1: q ³ q+
- Possibilities for early stopping:
- Stop only for the null (when Tn
small)
- Stop only for the alternative (when
Tn large)
- Stop either for the null or for the alternative
|
|
47
|
- Conditions for Early Stopping: One-sided tests
- Test of a lesser alternative (q- < q0)
- Null: H0:
q ³ q0
- Alternative: H1: q £ q-
- Possibilities for early stopping:
- Stop only for the null (when Tn
large)
- Stop only for the alternative (when
Tn small)
- Stop either for the null or for the alternative
|
|
48
|
- One-sided Test Boundaries: Sample Mean Statistic
|
|
49
|
- Conditions for Early Stopping: Two-sided tests
- Test of a two-sided alternative (q+ > q0 > q- )
- Upper Alternative: H+:
q ³ q+
- Null: H0:
q = q0
- Lower Alternative: H -:
q £ q-
- Possibilities for early stopping:
- Stop only for the null (when Tn
intermediate)
- Stop only for the alternative (when
Tn small or large)
- Stop either for the null or for the alternative
|
|
50
|
- Two-sided Test Boundaries: Sample Mean Statistic
|
|
51
|
- General stopping rule
- Maximum of four boundaries
- ‘d’ boundary: upper outer boundary
- ‘c’ boundary: upper inner boundary
- ‘b’ boundary: lower inner boundary
- ‘a’ boundary: lower outer boundary
- Early stopping
- Tn greater than ‘d’ boundary
- Tn between ‘b’ and ‘c’ boundaries
- Tn less than ‘a’ boundary
|
|
52
|
- One-sided tests of greater hypotheses
- Always have ‘b’ and ‘c’ boundaries are equal
- so no early stopping for intermediate Tn
- Early stopping
- If ‘a’ boundary at -¥: no early stopping for null
- If ‘d’ boundary at ¥: no early stopping for alternative
|
|
53
|
- One-sided Test Boundaries: Sample Mean Statistic
|
|
54
|
- One-sided tests of lesser hypotheses
- Always have ‘b’ and ‘c’ boundaries are equal
- so no early stopping for intermediate Tn
- Early stopping
- If ‘a’ boundary at -¥: no early stopping for alternative
- If ‘d’ boundary at ¥: no early stopping for null
|
|
55
|
- One-sided Test Boundaries: Sample Mean Statistic
|
|
56
|
- Two-sided tests
- Early stopping
- If ‘a’ boundary at -¥: no early stopping for lower
alternative
- If ‘b’ and ‘c’ boundaries equal: no early stopping for null
- If ‘d’ boundary at ¥: no early stopping for upper
alternative
|
|
57
|
- Two-sided Test Boundaries: Sample Mean Statistic
|
|
58
|
- Representation of two-sided hypothesis tests
- Two-sided tests take on appearance of two superposed hypothesis tests
- Lower test
- H0-: q ³ q0-
versus H-: q £ q-
- Upper test
- H0+: q £ q0+ versus H+: q ³ q+
- Classic two-sided test:
|
|
59
|
- Generalization of hypothesis tests
- Require only q- £ q0+
£ q0- £ q+
- Correspondence between hypotheses and boundaries
- ‘a’ boundary rejects H0-: q ³ q0-
- ‘b’ boundary rejects H-: q £ q-
- ‘c’ boundary rejects H+: q ³ q+
- ‘d’ boundary rejects H0+: q £ q0+
|
|
60
|
- Correspondence to classical tests of H0: q = q0
- One-sided tests of greater alternative (upper and lower tests
coincident)
- q- < q0- = q0 (define q0+ = q- and q+ = q0-)
- One-sided tests of lesser alternative (upper and lower tests
coincident)
- q0 = q0+ < q+ (define q- = q0+ and q0- = q+)
- Two-sided tests
- q- < q0- = q0 = q0+ < q+ (with q- = - q+ )
|
|
61
|
- Parameterize hypotheses by shift parameters e L, e U
- 0 £ eL £ 1 is shift of q0- away from q+ toward q0
- q0- = q+ - eL ´ (q+ - q0)
- 0 £ eU £ 1 is shift of q0+ away from q- toward q0
- q0+ = q- + eU ´ (q0 - q-)
- Constraint: 1 £ eL + eU £ 2
- Test can be thought of as (eL
+ eU)-sided
|
|
62
|
- Parameterization special cases
- One-sided test of greater alternative:
- One-sided test of lesser alternative:
- One-sided equivalence (noninferiority) test:
|
|
63
|
- Number and timing of analyses
- N counts the sampling units accrued to the study
- Up to J analyses of the data to be performed
- Analyses performed after accruing sample sizes of N1 < N2
< L < NJ
- (More generally, N measures statistical information)
|
|
64
|
- Boundaries at the analyses
- aj £ bj
£ cj £ dj are the ‘a’, ‘b’,
‘c’, and ‘d’ boundaries at the j-th analysis (when Nj observations)
- At the final (J-th) analysis aJ = bJ and cJ
= dJ to guarantee stopping
|
|
65
|
- Boundary shape functions
- Pj measures the
proportion of information accrued at the j-th analysis
- Boundary shape function f(Pj)
is a monotonic function
used to relate the dependence of boundaries at successive analyses on
the information accrued to the study at that analysis
|
|
66
|
- Formulation of stopping boundaries
- At the j-th analysis
- aj is determined by qa = q0- and fa
(Pj)
- bj is determined by qb = q- and fb
(Pj)
- cj is determined by qc = q+ and fc
(Pj)
- dj is determined by qd = q0+ and fd
(Pj)
|
|
67
|
- Parameterization of boundary shape functions
- Distinct parameters possible for each boundary
- Parameters A*, P*, R* typically chosen
by user
- Critical value G* usually calculated from search
|
|
68
|
|
|
69
|
- Choices for test statistic Tn
- Sum of observations
- Point estimate of treatment effect
- Normalized (Z) statistic
- Fixed sample P value
- Error spending function
- Conditional probability
- Predictive probability
- Bayesian posterior probability
|
|
70
|
- Choices for test statistic Tn
- All of those choices for test statistics can be shown to be
transformations of each other
- Hence, a stopping rule for one test statistic is easily transformed to
a stopping rule for a different test statistic
- We regard these statistics as representing different scales for
expressing the boundaries
|
|
71
|
- One sample inference about means
- Generalizable to most other commonly used models
|
|
72
|
- Partial Sum Scale:
- Uses:
- Cumulative number of events
- Convenient when computing density
|
|
73
|
- Sample Mean Scale:
- Uses:
- Natural estimate of treatment effect
|
|
74
|
- Normalized Statistic Scale:
- Uses:
- Commonly computed in analysis routines
|
|
75
|
- Fixed Sample P value Scale:
- Uses:
- Commonly computed in analysis routine
- Robust to use with other distributions for estimates of treatment
effect
|
|
76
|
- Bayesian Posterior Scale:
- Uses:
- Bayesian inference (unaffected by stopping)
- Posterior probability of hypotheses
|
|
77
|
- Conditional Probability Scale:
- Threshold at final analysis
- Hypothesized value of mean
- Uses:
- Conditional power
- Futility of continuing under specific hypothesis
|
|
78
|
- Conditional Probability (estimate) Scale:
- Threshold at final analysis
- Uses:
- Futility of continuing using best estimate
|
|
79
|
- Predictive Probability Scale:
- Uses:
- Futility of continuing study
|
|
80
|
- Predictive Probability Scale:
- Uses:
- Futility of continuing study
|
|
81
|
- Error Spending (outer lower boundary) Scale:
- Uses:
- Implementation of stopping rules
with flexible determination of number and timing of analyses
|
|
82
|
- Error Spending (inner lower boundary) Scale:
- Uses:
- Implementation of stopping rules with flexible determination of number
and timing of analyses
|
|
83
|
- Error Spending (inner upper boundary) Scale:
- Uses:
- Implementation of stopping rules
with flexible determination of number and timing of analyses
|
|
84
|
- Error Spending (outer upper boundary) Scale:
- Uses:
- Implementation of stopping rules
with flexible determination of number and timing of analyses
|
|
85
|
- Use in evaluating designs
- Several of the boundary scales have interpretations that are useful in
evaluating the operating characteristics of a design
- Sample Mean Scale
- Conditional Probability Futility Scales
- Predictive Probability Futility Scale
- Bayesian Posterior Probability Scale
- (Error Spending Scale)
|
|
86
|
|
|
87
|
- Unifying parameterization for the most commonly used group sequential
designs (Kittelson & Emerson, 1999)
- Rich parameterization facilitates search for stopping rule appropriate
for specific applications
- Inclusion of broad spectrum of designs means that comparisons within
this family will consider full range of possible designs
- (Default family in S+SeqTrial)
|
|
88
|
- Stopping Boundaries for Sample Mean Statistic:
- aj = ma - fa (Pj)
- bj = mb + fb (Pj)
- cj = mc - fc (Pj)
- dj = md + fd (Pj)
|
|
89
|
- Parameterization of boundary shape functions
- Distinct parameters possible for each boundary
- Parameters A*, P*, R* typically chosen
by user
- Critical value G* usually calculated from search
|
|
90
|
- Choice of P parameter
- P ³ 0:
- Larger positive values of P
make early stopping more difficult (impossible when P infinite)
- When A=R=0, 0.5 < P < 1 corresponds to power family parameter (D) in Wang & Tsiatis (1987):
P= 1 - D
- Reasonable range of values: 0 < P < 2.5
- P=0 with A=R=0 possible for some (not all) boundaries, but not
particularly useful
|
|
91
|
- Effect of varying P>0 (when A=0, R=0)
- Higher P leads to early conservatism
- P > 0 has infinite boundaries when N=0
|
|
92
|
- Choice of P parameter
- P < 0:
- Must have R = 0 and (typically)
A < 0
- More negative values of P make early stopping more difficult
|
|
93
|
- Effect of varying P<0 (when A=2, R=0)
- More negative P leads to early conservatism
- P < 0 has finite boundaries when N=0
|
|
94
|
- Choice of R parameter
- R > 0:
- Larger positive values of R make early stopping easier
- When R>0 and P=0, typically need A>0
- Reasonable range of values: 0.1 < R < 20
- R < 1 is convex outward
- R > 1 is convex inward
- When R>0 and P>0, can get change in convexity of boundaries
|
|
95
|
- Effect of varying R (when A=1, P=0)
- R < 1 leads to convex outward
- R > 1 leads to convex inward
|
|
96
|
- Effect of varying R (when A=1, P=0.5)
- With P > 0, boundaries infinite when N=0
- R < 1 and P > 0 has change in convexity
|
|
97
|
- Choice of A parameter
- Lower absolute values of A makes it harder to stop at early analyses
- Valid choices of A depend upon choices of P and R
- Useful ranges for A
- P ³ 0, R ³ 0: 0.2 £ A £ 15
- P £ 0, R = 0: -15 £ A £ -1.25
|
|
98
|
- Effect of varying A (when P=0, R=1.2)
- Values of A closer to 0 make it harder to stop early
- Higher absolute value of A makes flatter boundaries
|
|
99
|
- Parameterization of boundary shape function includes many previously
described approaches
- Wang & Tsiatis Boundary Shape Functions:
- A* = 0, R* = 0, P* > 0
- P* measures early conservatism
- P* = 0.5 Pocock (1977)
- P* = 1.0 O’Brien-Fleming (1979)
- (P* = ¥ precludes
early stopping)
|
|
100
|
- Parameterization of boundary shape function includes many previously
described approaches
- Triangular Test Boundary Shape Functions (Whitehead)
- Sequential Conditional Probability Ratio Test (Xiong):
|
|
101
|
- Parameterization of hypothesis shifts and boundary shape function
unifies what were discrete families
- Triangular tests vs Wang and Tsiatis based families
- One-sided vs two-sided tests
- Early stopping under one hypothesis vs both hypotheses
|
|
102
|
- Spectrum of designs
- e L increases
across rows
- Pa and/or Pc increases down columns
|
|
103
|
- Operating characteristics
- User specifies size aU,
aL of upper
and lower tests
- User specifies power bU,
bL of upper
and lower tests
- Computer search for Ga, Gb, Gc, Gd
that attains those operating characteristics
- (Sample size can be computed using some other power besides bU, bL)
|
|
104
|
|
|
105
|
- Lan and DeMets (1983) approach
- At each analysis, some of the type I error is `used up’
- Describe a stopping rule according to the proportion of aU, aL used at each analysis
- General case: alpha used by the j-th analysis determined by some
function of the proportion of maximal information available
|
|
106
|
- Lan and DeMets (1983) approach (cont.)
- Lan and DeMets (1983) describe error spending functions comparable to
O’Brien-Fleming or Pocock designs
|
|
107
|
- Lan and DeMets (1983) approach (cont.)
- Lan and DeMets (1983) describe error spending functions comparable to
O’Brien-Fleming or Pocock designs for specific type I errors
|
|
108
|
- Lan and DeMets (1983) approach (cont.)
- More recently authors have focussed on error spending functions of the
form
- (Kim and DeMets, 1987; Jennison and Turnbull, 1989; Hwang, Shih, and
DeCani, 1990)
|
|
109
|
- Lan and DeMets (1983) approach (cont.)
- Kim and DeMets (1987) and Jennison and Turnbull (1989) consider an
error spending family corresponding to
- Useful special cases identified by those authors:
- P = 1 is similar to Pocock (1977)
- P = 3 is similar to O’Brien and Fleming (1979)
|
|
110
|
- Pampallona, Tsiatis, and Kim (1995) extension
- Defines type II error spending functions
- At each analysis, recompute maximal sample size which will maintain
planned level of significance and power
|
|
111
|
- Implementation of an Error Spending Family
- Define stopping rule on error spending function scale by defining Eaj,
Ebj, Ecj, Edj
- Use framework of superposed one-sided hypothesis tests described by
Kittelson and Emerson (1999) to define relationships among hypotheses
rejected by each of the four possible stopping boundaries
|
|
112
|
- Correspondence with type I and II error spending
- For user specified size aU,
aL of upper
and lower tests and power bU, bL of upper and lower tests, error spent at the
j-th analysis specified as:
|
|
113
|
- Boundary shape functions
- Boundary shape function can be defined separately for each of the four
boundaries
|
|
114
|
- Constraints on parameters
- f(0) = 0 and f(1) = 1
- If P < 0
- If R > 0
- If P = 0 and R = 0, no early stopping
|
|
115
|
- Computer search for stopping boundaries
- Error spending family defines Eaj, Ebj,
Ecj, Edj
- Appendix of Kittelson and Emerson (1999) describes general algorithm
for finding design when hypotheses known
- At design stage, must search for standardized hypotheses that result in
a valid design, and then compute sample size to map standardized design
to specified alternative hypotheses.
|
|
116
|
- Computer search for stopping boundaries (cont.)
- In order to more easily obtain more efficient designs, when designing a
study using error spending functions, the specified type II error
spending functions are only used as upper bounds on the true type II
error spending function.
|
|
117
|
|
|
118
|
- General comments
- Families also defined for other boundary scales
- Partial sum and Z statistic scale families implemented in S+SeqTrial
- Bayesian and Futility scale families under construction
- If stopping rules are carefully evaluated, it does not matter too much
which scale (and therefore family) is used to derive the stopping rule.
|
|
119
|
- General comments (cont.)
- The best design family to use will be the one which allows a user to
most quickly find a stopping rule having desirable operating
characteristics
- The ease of use will therefore depend in part on
- Interpretability of boundary scale
- Interpretability of parameters
|
|
120
|
- General comments (cont.)
- My view:
- Sample mean scale (unified family) has easier scientific
interpretation than the error spending scale which has a purely
statistical interpretation that, in my experience, is poorly
understood by both users and researchers
- The parameterization of the unified family produces a more useful
grouping of designs on some level than does the parameterization of
the error spending family
|
|
121
|
- ASSERTION: Interpretability of boundary scales
- The concept of an error spending scale is less relevant to clinical
researchers
- Type I error reflects only statistical evidence
- May conflict with scientific importance
- Underpowered studies: Failure to reject the null in the face of large
estimates of treatment effect
- Overpowered studies: Rejection of the null hypothesis when
differences are scientifically unimportant
|
|
122
|
- ASSERTION: Interpretability of boundary scales
- The formulation of error spending scales is not well understood by the
researchers developing such methods
- Lan & DeMets (1983), Kim & DeMets (1987), Jennison &
Turnbull (1989 and 2000) all describe error spending functions which
mimic O’Brien-Fleming (1979) or Pocock (1977) group sequential designs
- In fact, for different levels of type I (or type II) error, the error
spending functions are different within those families of designs
|
|
123
|
- Error spent at each analysis for O’Brien and Fleming (1979) designs depends on Type I or Type II
errors
|
|
124
|
- Error spent at each analysis for Pocock (1977) designs depends on Type I
or Type II errors
|
|
125
|
- Is there a problem?
- Parameterization of stopping rule families induces a grouping of
designs:
- Unified family: Pocock (1977) designs, O’Brien-Fleming (1979) designs,
Triangular designs (Whitehead & Stratton, 1983)
- Error spending families: All designs that spend the same proportion of
type I or II error at each analysis
|
|
126
|
- Is there a problem? (cont.)
- Best parameterization might be defined according to whether such
groupings correspond to similar operating characteristics
- efficiency
- Bayesian properties
- futility properties
- others
|
|
127
|
- Efficiency
- Consider ability of choice of boundary shape parameter to predict
efficiency of design
- No uniformly most powerful design
- Efficiency measured in terms of smallest average sample size for
specific hypothesis
- Measure alternative hypothesis according to the power of the test to
detect it
|
|
128
|
- Methods for comparison
- Find optimal designs in terms of average sample size (ASN) within
family of Wang and Tsiatis (1987) boundary shape functions for
one-sided symmetric designs (Emerson and Fleming, 1989)
- Family found to be approximately optimal
- Find optimal designs for various choices of type I error and
statistical power
|
|
129
|
- Methods for comparison (cont.)
- For each optimal design, examine the boundary shape function on
- Sample mean scale
- Error spending scale
- Futility scales
|
|
130
|
- Criteria for “good” parameterizations
- If the boundary shape function on a given scale is not independent of
choice of type I and II errors, then that would argue that grouping of
designs according to parameterization of that scale will not correspond
to similar efficiency properties
- As it is unlikely that boundary shape parameters for efficient designs
will be constant across all choices of type I and type II errors, we
can also compare the degree that boundary shape parameters change for
each boundary scale
|
|
131
|
- Proportion of error spent at each analysis for approximately efficient
designs
- Power varies across panels
- Type I error varies across lines within each panel
|
|
132
|
- Conditional power (using MLE) at the boundary for each analysis for
approximately efficient designs
- Power varies across panels
- Type I error varies across lines within each panel
|
|
133
|
- Comparison of optimal unified family P parameter as a function of type I
errors
- Compared to best fitting P or R parameter in error spending family
|
|
134
|
- Search for stopping rule is generally iterative
- An initial design is specified
- Operating characteristics are examined
- Modifications are made to the design
- Availability of tools for evaluation of operating characteristics
lessens impact of family used to define a stopping rule
- Appropriate designs can be found from almost any starting point
|
|
135
|
- To the extent that parameterization of sample mean family predicts
efficiency behavior, use of that family may allow more intuitive search
for suitable stopping rules
- However, efficiency is not always of paramount concern
|
|
136
|
- Interpretation of unified family boundaries as estimate of treatment
effect is meaningful to clinical researcher
- Error spending functions are less interpretable, and thus seem less
useful when designing a clinical trial or evaluating its operating
characteristics
- However, error spending scale can be useful in implementing a stopping
rule
|
|
137
|
- It is not clear that conditional probabilities are particularly useful
in the definition of a stopping rule
- Design family does not have a particularly intuitive parameterization
- Unconditional power considerations would seem more straightforward
|
|
138
|
|
|
139
|
- Process of choosing a trial design
- Define candidate design
- Evaluate operating characteristics
- Modify design
- Iterate
|
|
140
|
- Operating characteristics for fixed sample studies
- Level of Significance (often pre-specified)
- Sample size requirements
- Power Curve
- Decision Boundary
- Frequentist inference on the Boundary
- Bayesian posterior probabilities
|
|
141
|
- Additional operating characteristics for group sequential studies
- Probability distribution for sample size
- Stopping probabilities
- Boundaries at each analysis
- Frequentist inference at each analysis
- Bayesian inference at each analysis
- Futility measures at each analysis
|
|
142
|
- Sample size requirements
- Number of subjects needed is a random variable
- Quantify summary measures of sample size distribution
- maximum (feasibility of accrual)
- mean (Average Sample N- ASN)
- median, quartiles
- (Particularly consider tradeoffs between power and sample size
distribution)
|
|
143
|
- Stopping probabilities
- Consider probability of stopping at each analysis for arbitrary
alternatives
- Consider probability of each decision (for null or alternative) at each
analysis
|
|
144
|
- Power curve
- Probability of rejecting null for arbitrary alternatives
- Power under null: level of significance
- Power for specified alternative
- Alternative rejected by design
- Alternative for which study has high power
- S+SeqTrial defines
- Power curves for upper and lower boundaries
- Alternatives having specified power for each boundary
|
|
145
|
- Decision boundary at each analysis
- Value of test statistic leading to rejection of null
- Variety of boundary scales possible
- Often has meaning for applied researchers (especially on scale of
estimated treatment effect)
- Estimated treatment effects may be viewed as unacceptable for ethical
reasons based on prior notions
- Estimated treatment effect may be of
little interest due to lack of clinical importance or futility
of marketing
|
|
146
|
- Frequentist inference on the boundary at each analysis
- Consider P values, confidence intervals when observation corresponds to
decision boundary at each analysis
- Ensure desirable precision for negative studies
- Confidence interval identifies hypotheses not rejected by analysis
- Have all scientifically meaningful hypotheses been rejected?
|
|
147
|
- Bayesian posterior probabilities at each analysis
- Examine the degree to which the frequentist inference leads to sensible
decisions under a range of prior distributions for the treatment effect
- Posterior probability of hypotheses
- Bayesian estimates of treatment effect
- Median (mode) of posterior distribution
- Credible interval (quantiles of posterior distribution
|
|
148
|
- Futility measures
- Consider the probability that a different decision would result if
trial continued
- Can be based on particular hypotheses, current best estimate, or
predictive probabilities
- (Perhaps best measure of futility is whether the stopping rule has
changed the power curve substantially)
|
|
149
|
- Evaluation of Designs
- Forms of output from S+SeqTrial
- Printed output in report window or command line window
- Plots
- Named seqDesign object
|
|
150
|
- Evaluation of Designs (cont.)
- Sample size requirements
- Printed with boundaries
- X axis with plots of boundaries
- Plots of average sample size, quantiles of sample size distribution
- Stopping probabilities
- Printed with operating characteristics
- Plots with color coded decisions
|
|
151
|
- Evaluation of Designs (cont.)
- Power Curve
- Hypotheses, size, power printed with boundaries
- Tabled power with summaries
- Plots of power curve
- Plots versus reference power curve
|
|
152
|
- Evaluation of Fixed Sample Designs (cont.)
- Decision Boundary
- Printed on specified boundary scale
- Plots
- Frequentist inference on the boundary
- Printed with summaries
- Plots
|
|
153
|
- Evaluation of Fixed Sample Designs (cont.)
- Bayesian inference
- Posterior probabilities implemented as a boundary scale
- Median (mode) of posterior distribution
- Credible intervals
- Futility measures
- Implemented as boundary scale
- Conditional and predictive approaches
|