Q&A - PPV, NPV, Sens, Spec with censored data
QUESTION:
How can you estimate sensitivity, specificity, positive predictive value,
negative predictive value, and prevalence with censored data?
ANSWER:
First, let's pretend I had complete data (no censoring, no missing data) on the
random variables CRP and T0 measuring C-reactive protein levels and time to
death, respectively.
We will use the event CRP > c as a "positive test".
We will use T0 < k as "disease".
Now, with cross-sectional sampling (i.e., I did not pre-specify either the
number of subjects with disease in my sample, nor did I pre-specify the number
of subjects with a positive test in my sample), I can DIRECTLY estimate all of
the following:
Prevalence of disease is Pr(T0 < k). With no censoring, I would just count up
the number of people who had T0 < k and divide by the number of people.
Prev = #(T0 < k) / #(in sample)
Sensitivity of test is Pr(CRP > c | T0 < k). With no censoring, I would
consider only the people who had T0 < k. Then of those people, I would count
how many had CRP > c, and divide by the total number of people with T0 < k. So
Sens = #(CRP > c and T0 < k) / #(T0 < k)
Similarly, specificity of test is Pr(CRP < c | T0 > k). With no censoring, I
would consider only the people who had T0 > k. Then of those people, I would
count how many had CRP < c, and divide by the total number of people with T0 >
k. So
Spec = #(CRP < c and T0 > k) / #(T0 > k)
Predictive value of positive is Pr(T0 < k | CRP > c). With no censoring, I
would consider only the people who had CRP > c. Then of those people, I would
count how many had T0 < k, and divide by the total number of people with CRP >
c. So
PVP = #(CRP > c and T0 < k) / #(CRP > c)
Predictive value of negative is Pr(T0 > k | CRP < c). With no censoring, I would consider only the people who had CRP < c. Then of those people, I would count how many had T0 > k, and divide by the total number of people with CRP < c. So
PVN = #(CRP < c and T0 > k) / #(CRP < c)
Okay. So far so good. Now what happens if we have censored data for T0? What can we estimate directly in this case?
-- We estimate Pr(T0 > t) DIRECTLY using the KM curve. (We can estimate the
prevalence).
-- We estimate Pr(T0 > t | CRP > c) DIRECTLY by restricting our data analysis
to people with CRP > c, and then using a KM curve just in that subset. (We can
estimate PVP and PVN.)
-- We cannot estimate Pr(CRP > c | T0 > t) DIRECTLY if there are some subjects
censored before time t, because we do not have any way of restricting our
analysis to the people who have T0>t. (Some of the subjects censored before
time t might have also died before time t.) (We cannot directly estimate
sensitivity and specificity.)
Can we INDIRECTLY estimate Sens and Spec in this case by using Bayes Rule in
this setting?
Pr (Dis | Pos) Pr(Pos) Pr (Pos | Dis) =
--------------------------------------------------
Pr (Dis | Pos) Pr(Pos) + Pr(Dis | Neg) Pr(Neg)
Pr (Hlth | Neg) Pr(Neg)
Pr (Neg | Hlth) = --------------------------------------------------
Pr (Hlth | Pos) Pr(Pos) + Pr(Hlth | Neg) Pr(Neg)
Providing we had cross-sectional sampling, we can estimate everything on the
right hand side, so that would allow us to get a sensitivity and a specificity
through these indirect means. (If we had sampled according to CRP results--
purposely getting a certain number of positive and negative subjects-- we could
not do this, because estimating the Pr(Pos) and Pr(Neg) would not have been
possible.)
Now if we want to consider what would happen in a population with a different
mix of diseased and non-diseased...
If we presume that sensitivity and specificity is always the same, then that
means the PVP and PVN we estimate from this data is dependent upon the
prevalence of disease in our sample. Hence, those estimates are completely
irrelevant to our new population.
But can use the estimated sensitivity and specificity for the timeframes
involving the censored data, and then use Bayes Rule to get the PVP and PVN for
the new prevalences. This works providing our new population would have the
same sensititivity and specificity (an assumption we usually make, but it is
not always strictly true-- see the example in class)
And lastly we need to consider the missing CRP data. If we assume they are
"missing at random", we can safely ignore those cases-- basically pretend that
those people were not in our sample. (If we cannot assume this or reliably
impute the data in some other way, we are completely lost.)
Scott
#####################################################################
Scott S. Emerson, M.D., Ph.D. Biost Dept: (O) 206-543-1044
Professor of Biostatistics (F) 206-543-3286
Department of Biostatistics Box 357232 ROC: (O) 206-221-4185
University of Washington (F) 206-543-0131
Seattle, Washington 98195 semerson@u.washington.edu
#####################################################################