BIOST 517 Q & A 10/11/2005 Classification of scientific questions QUESTIONS 1 & 2 (of three): 1. When clustering observations you mention that "all variables are treated symmetrically: No delineation between outcomes and groups." I'm not sure if I understand what you mean by this...could you please clarify. 2. In the genomics/proteomics example you say it is a combination of clustering cases and variables. I take this to mean that by clustering genes, which are regulated similarly, is an example of clustering variables? and that identifying groups of patients that tend to have the same pattern of gene expression is an example of clustering of cases (groups that share similar patterns in there measured variables)? ANSWER: I will use the genomics example to contrast the various approaches possible: We take, say, 100 patients and measure gene expression with, say 10,000 probes. The cluster analysis (clustering of cases) approach is to put all 10,000 probes into an analysis, with no particular focus on one of the probes. Then we try to identify cases that have similar profiles across all probes. Having decided that there are 2 or more different clusters, we might then try to guess what patients within a cluster might have in common. (Maybe they have different diseases.) The comparing distn approach that might be used to answer a similar question is to define groups by some patient characteristics, and see whether the prevalence of probe (gene) expression differs across those pre-defined groups (I believe the cluster analysis people sometimes refer to this as "supervised" learning, as opposed to "unsupervised" cluster analysis). This approach based on comparing distn across groups leads itself easily to statistical inference and quantifying our confidence in the conclusions we reach. The cluster analysis does not so much, because it is hard to decide on agreement between clusters: If we repeat the experiment, we are not likely to come up with exactly the same clusters, and we do not always agree on how to measure concordance of results across experiments. (Suppose the first expt clustered 15 patients together. If a second expt created a cluster including 18 patients, of which 10 were from the cluster of 15 identified in the first expt and 8 were of the 85 patients not from that same cluster, is this good agreement or bad agreement? Is it even the same cluster?) The other question addressed in these microarray type experiments is: Which probes cluster together? We thus might use the data on the 100 patients to identify which probes tend to always be expressed in the same patient or not expressed in the same patient. By looking at such patterns we might identify some "latent variable" and attach a name to it (e.g., the apoptosis pathway) according to some prior knowledge about the function of the genes corresponding to the probes (or many of the probes) that were clustered together. As knowledge is gained, we would tend to further explore such pathways by focusing, perhaps, on a "sentinel" probe (my words, but perhaps used by others) that are known to be associated with, say, apoptosis, and then compare the prevalence of other probe expression among the patients who do or do not show expression of the sentinel probe. This latter approach focusing on comparing distributions across groups again allows statistical inference, while the "latent variable" type analysis makes it difficult to assess whether similar results are obtained across experiments. QUESTION 3: 3. The last point on comparing distributions (slide 19), you say comparing distributions can include quantifying differences in effects across subgroups (interactions or effect modification). Could you explain how this differs from point 4b. I would imagine that it might be the same except that you further sub-divide your predefined populations and compare intra-group distributions? ANSWER: In the following, I use the term "effect" to really mean just an association. That is all we can tell statistically. (Of course, we are hoping to be able to eventually determine cause and effect.) The focus of question 4b is to determine, say, the "age effect" on blood pressure. We might look at mean SBP in old people minus the mean SBP in young people. A pure subgroup question might be to answer question 4b just among males: The mean SBP in old males minus the mean SBP in young males. Similarly, we could have answered question 4b in females: The mean SBP in old females minus the mean SBP in young females. The "effect modification" or "interaction" question (question 4c in my parlance) is not just interested in seeing what the age effect might be in a subgroup, but instead is trying to answer whether the age effect in one subgroup is different from the age effect in the other subgroup. So this might be answered by a difference between differences: (mean SBP in old males - mean SBP in young males) minus (mean SBP in old females - mean SBP in young females). I note that in my world of clinical trials, we are very often quite interested in exploring question 4b in subgroups: Does the drug work in some particular subset of the patients. But we are also interested in question 4c, because it may give us hints about either mechanisms of disease or mechanism of drug action: If the drug works differentially in, say, different subgroups, could this be an indication of different metabolic pathways or different pathogens? As you might imagine, teasing out when we are interested in a pure question within subgroups (question 4b) rather than contrasting subgroups' effects (question 4c) is not always easy. When considering subgroups, it is also often difficult to see whether we are afraid there might be confounding or there might be effect modification. We will be discussing these issues all quarter (and next). And it only gets more confusing with three-way interactions, something that we will address in Biost 518.