quietly: infile mcode ht age sga parity smoker bweight sex gesage /// using z:documents/teach/datasets/pregout.txt using "http://www.emersonstatistics.com/datasets/pregout.txt * The first line was the variable names, so I drop that case drop in 1 * Creating variable FEMALE * The variable "sex" has two problems. It is coded 1=male, 2= female, * and the name is such that I have difficulty remembering the coding. * I greatly prefer that binary variables be coded as indicator variables: * the variable is 0 or 1, with the variable named according to the group * corresponding to 1: g FEMALE= sex - 1 * Creating SMOKER, NONSMOKER, SGA, NOTSGA * The variable "smoker" was coded 1=yes, 2= no. It is far easier to work * with an indicator variable coded 0= no, 1= yes. So I create a variable * SMOKER (Stata is case-sensitive) that is an indicator of smoking (0= nonsmoker, * 1= smoker), and a variable NONSMOKER that is an indicator of being a * non-smoker (0= smoker, 1= non-smoker). I note that variable "sga" was * already coded the way I like it, but I create SGA and NOTSGA just * so the commands will agree with the way I specified variable names * in the homework questions. g SMOKER= 2 - smoker g NONSMOKER= 1 - SMOKER g SGA= sga g NOTSGA= 1 - SGA ****************** * Problem 1: Descriptive statistics ****************** * Find the number of cases in the dataset (755 it turns out) describe * Look at patterns of missing data for subjects missing smoking behavior list age ht parity gesage sga bweight FEMALE if SMOKER==. * For parity, I want to present descriptive statistics both as a count * variable and within key categories. So I create a new categorization of * parity recode parity 3/6=3, gen(parctg) * Descriptive statistics for all scientific variables by smoking category tabstat age ht parity sga gesage bweight sex FEMALE, by(SMOKER) col(stat) /// stat(n mean sd min q max) long tabulate parctg SMOKER, col * Descriptive statistics for all scientific variables by SGA category tabstat age ht parity smoker SMOKER gesage bweight sex FEMALE, by(SGA) col(stat) /// stat(n mean sd min q max) long tabulate parctg SGA, col ****************** * Problem 2: Associations between SGA and smoking using Odds Ratios (OR) ****************** * Fitting the model the way I would prefer with a 0-1 indicator logit SGA SMOKER logistic SGA SMOKER di exp(-2.055861), exp(-2.055861 + 0.6367768 * 1) di .1279826 / (1 + .1279826), .24193548 / (1 + .24193548) predict fit2a bysort SMOKER: summ fit2a * Fitting the model using the originally coded variable logit SGA smoker logistic SGA smoker di exp(-.7823069 - 0.6367768 * 1), exp(-.7823069 - 0.6367768 * 2) di .2419356 / (1 + .2419356), .12798266 / (1 + .12798266) predict fit2aa bysort smoker: summ fit2aa bysort SMOKER: summ fit2aa * Fitting the alternative models logit SGA NONSMOKER predict fit2c1 logit NOTSGA SMOKER predict fit2c2 logit NOTSGA NONSMOKER predict fit2c3 bysort SMOKER: summ fit2c1 fit2c2 fit2c3 ****************** * Problem 3: Associations between SGA and smoking using Risk Difference (RD) ****************** * Fitting the model the way I would prefer with a 0-1 indicator regress SGA SMOKER regress SGA SMOKER, robust di .1134615 + .0813437 * 1 predict fit3a bysort SMOKER: summ fit3a * Fitting the model using the originally coded variable regress SGA smoker, robust di .2761489 - .0813437 * 1, .2761489 - .0813437 * 2 predict fit3aa bysort smoker: summ fit3aa bysort SMOKER: summ fit3aa * Fitting the alternative models logit SGA NONSMOKER predict fit3c1 logit NOTSGA SMOKER predict fit3c2 logit NOTSGA NONSMOKER predict fit3c3 bysort SMOKER: summ fit2c1 fit2c2 fit2c3 ****************** * Problem 4: Associations between SGA and smoking using Risk Ratios (RR) ****************** * Fitting the model the way I would prefer with a 0-1 indicator poisson SGA SMOKER poisson SGA SMOKER, robust poisson SGA SMOKER, robust ir di exp(-2.176291), exp(-2.176291 + .5405362 * 1) predict fit4a bysort SMOKER: summ fit4a * Fitting the model using the originally coded variable poisson SGA smoker poisson SGA smoker, robust predict fit4aa di exp(-1.095219 - .5405362 * 1), exp(-1.095219 - .5405362 * 2) bysort smoker: summ fit4aa bysort SMOKER: summ fit4aa * Fitting the alternative models logit SGA NONSMOKER predict fit4c1 logit NOTSGA SMOKER predict fit4c2 logit NOTSGA NONSMOKER predict fit4c3 bysort SMOKER: summ fit4c1 fit4c2 fit4c3 ****************** * Problem 5: Two sample analyses ****************** cc SGA SMOKER cc SGA SMOKER, exact ttest SGA, by(SMOKER) ttest SGA, by(SMOKER) unequal cs SGA SMOKER ir SGA SMOKER ****************** * Problem 6: Continuously modeled age ****************** g age5= age/5 regress SGA age5, robust predict fit6a poisson SGA age5, robust ir predict fit6b logistic SGA age5 predict fit6c tabulate SGA if age==20, col summ fit6a fit6b fit6c if age==20 ****************** * Problem 7: Fitted values ****************** egen sampprop= mean(SGA), by(age) twoway (scatter sampprop fit6a fit6b fit6c age, col(green blue black pink) /// xtitle("Maternal Age (years)") ytitle("Fitted SGA Probabilities") /// t1("Fitted Probability of Small for Gestational Age (SGA)") /// legend(label(1 "Sample proportions by age") label(2 "Linear regression") /// label(3 "Poisson (RR) regression") label(4 "Logistic (OR) regression"))) ****************** * Problem 8: Log transformed age ****************** g logage= log(age) / log(1.5) logistic SGA logage predict fit8 scatter logage age, xtitle("Maternal Age (years)") ytitle("Log Maternal Age (base 1.5)") scatter age age, yscale(log) xtitle("Maternal Age (years)") ytitle("Maternal Age (years)") twoway (scatter fit6c fit8 age, col(blue black) /// xtitle("Maternal Age (years)") ytitle("Fitted SGA Probabilities") /// t1("Fitted Probability of Small for Gestational Age (SGA)") /// legend(label(1 "Untransformed age") label(2 "Log transformed age")))