quietly: infile mcode ht age sga parity smoker bweight sex gesage ///
	using z:documents/teach/datasets/pregout.txt
	using "http://www.emersonstatistics.com/datasets/pregout.txt

* The first line was the variable names, so I drop that case
drop in 1

* Creating variable FEMALE
* The variable "sex" has two problems. It is coded 1=male, 2= female,
* and the name is such that I have difficulty remembering the coding.
* I greatly prefer that binary variables be coded as indicator variables:
* the variable is 0 or 1, with the variable named according to the group
* corresponding to 1:
g FEMALE= sex - 1

* Creating SMOKER, NONSMOKER, SGA, NOTSGA
* The variable "smoker" was coded 1=yes, 2= no. It is far easier to work
* with an indicator variable coded 0= no, 1= yes. So I create a variable
* SMOKER (Stata is case-sensitive) that is an indicator of smoking (0= nonsmoker,
* 1= smoker), and a variable NONSMOKER that is an indicator of being a 
* non-smoker (0= smoker, 1= non-smoker). I note that variable "sga" was
* already coded the way I like it, but I create SGA and NOTSGA just
* so the commands will agree with the way I specified variable names
* in the homework questions.
g SMOKER= 2 - smoker
g NONSMOKER= 1 - SMOKER
g SGA= sga
g NOTSGA= 1 - SGA

******************
*   Problem 1: Descriptive statistics
******************

* Find the number of cases in the dataset (755 it turns out)
describe

* Look at patterns of missing data for subjects missing smoking behavior
list age ht parity gesage sga bweight FEMALE if SMOKER==.

* For parity, I want to present descriptive statistics both as a count
* variable and within key categories. So I create a new categorization of
* parity
recode parity 3/6=3, gen(parctg)

* Descriptive statistics for all scientific variables by smoking category
tabstat age ht parity sga gesage bweight sex FEMALE, by(SMOKER) col(stat) ///
	stat(n mean sd min q max) long
tabulate parctg SMOKER, col
	
* Descriptive statistics for all scientific variables by SGA category
tabstat age ht parity smoker SMOKER gesage bweight sex FEMALE, by(SGA) col(stat) ///
	stat(n mean sd min q max) long
tabulate parctg SGA, col

******************
*   Problem 2: Associations between SGA and smoking using Odds Ratios (OR)
******************

* Fitting the model the way I would prefer with a 0-1 indicator
logit SGA SMOKER
logistic SGA SMOKER
di exp(-2.055861), exp(-2.055861 + 0.6367768 * 1)
di .1279826 / (1 + .1279826), .24193548 / (1 + .24193548)
predict fit2a
bysort SMOKER: summ fit2a

* Fitting the model using the originally coded variable
logit SGA smoker
logistic SGA smoker
di exp(-.7823069 - 0.6367768 * 1), exp(-.7823069 - 0.6367768 * 2)
di .2419356 / (1 + .2419356), .12798266 / (1 + .12798266)
predict fit2aa
bysort smoker: summ fit2aa
bysort SMOKER: summ fit2aa

* Fitting the alternative models
logit SGA NONSMOKER
predict fit2c1
logit NOTSGA SMOKER
predict fit2c2
logit NOTSGA NONSMOKER
predict fit2c3
bysort SMOKER: summ fit2c1 fit2c2 fit2c3

******************
*   Problem 3: Associations between SGA and smoking using Risk Difference (RD)
******************

* Fitting the model the way I would prefer with a 0-1 indicator
regress SGA SMOKER
regress SGA SMOKER, robust
di .1134615 + .0813437 * 1
predict fit3a
bysort SMOKER: summ fit3a

* Fitting the model using the originally coded variable
regress SGA smoker, robust
di .2761489 - .0813437 * 1, .2761489 - .0813437 * 2
predict fit3aa
bysort smoker: summ fit3aa
bysort SMOKER: summ fit3aa

* Fitting the alternative models
logit SGA NONSMOKER
predict fit3c1
logit NOTSGA SMOKER
predict fit3c2
logit NOTSGA NONSMOKER
predict fit3c3
bysort SMOKER: summ fit2c1 fit2c2 fit2c3

******************
*   Problem 4: Associations between SGA and smoking using Risk Ratios (RR)
******************

* Fitting the model the way I would prefer with a 0-1 indicator
poisson SGA SMOKER
poisson SGA SMOKER, robust
poisson SGA SMOKER, robust ir
di exp(-2.176291), exp(-2.176291 + .5405362 * 1)
predict fit4a
bysort SMOKER: summ fit4a

* Fitting the model using the originally coded variable
poisson SGA smoker
poisson SGA smoker, robust
predict fit4aa
di exp(-1.095219 - .5405362 * 1), exp(-1.095219 - .5405362 * 2)
bysort smoker: summ fit4aa
bysort SMOKER: summ fit4aa

* Fitting the alternative models
logit SGA NONSMOKER
predict fit4c1
logit NOTSGA SMOKER
predict fit4c2
logit NOTSGA NONSMOKER
predict fit4c3
bysort SMOKER: summ fit4c1 fit4c2 fit4c3

******************
*   Problem 5: Two sample analyses
******************

cc SGA SMOKER
cc SGA SMOKER, exact
ttest SGA, by(SMOKER)
ttest SGA, by(SMOKER) unequal
cs SGA SMOKER
ir SGA SMOKER

******************
*   Problem 6: Continuously modeled age
******************

g age5= age/5
regress SGA age5, robust
predict fit6a
poisson SGA age5, robust ir
predict fit6b
logistic SGA age5
predict fit6c
tabulate SGA if age==20, col
summ fit6a fit6b fit6c if age==20

******************
*   Problem 7: Fitted values
******************

egen sampprop= mean(SGA), by(age)
twoway (scatter sampprop fit6a fit6b fit6c age, col(green blue black pink) ///
	xtitle("Maternal Age (years)") ytitle("Fitted SGA Probabilities")        ///
	t1("Fitted Probability of Small for Gestational Age (SGA)")              ///
	legend(label(1 "Sample proportions by age") label(2 "Linear regression") ///
	label(3 "Poisson (RR) regression") label(4 "Logistic (OR) regression")))
	
******************
*   Problem 8: Log transformed age
******************

g logage= log(age) / log(1.5)
logistic SGA logage
predict fit8
scatter logage age, xtitle("Maternal Age (years)") ytitle("Log Maternal Age (base 1.5)")
scatter age age, yscale(log) xtitle("Maternal Age (years)") ytitle("Maternal Age (years)")
twoway (scatter fit6c fit8 age, col(blue black) ///
	xtitle("Maternal Age (years)") ytitle("Fitted SGA Probabilities")        ///
	t1("Fitted Probability of Small for Gestational Age (SGA)")              ///
	legend(label(1 "Untransformed age") label(2 "Log transformed age")))