****************************************
** Lab 5  							  **
**									  **
** 1) Imputation of Top-coded Income  **
** 2) Smoothing (Lowess) 			  **
** 3) Regression Splines			  **
** 4) Information Criteria			  **
**									  **
****************************************


* SETUP 
set maxvar 20000
use gss7214.dta


/** Generate family income measure in constant 2014 dollars **/

* assign mid points to scales and generate new vars
recode income 1=.5 2=2 3=3.5 4=4.5 5=5.5 6=6.5 7=7.5 8=9 9=12.5 10=17.5 11=22.5 12=170 13=.a if year<1977, gen(incomemp)
recode income72 1=1 2=3 3=5 4=7 5=9 6=11.25 7=13.75 8=16.25 9=18.75 10=22.5 11=27.5 12=170 13=.a, gen(income72mp)
recode income77 1=.5 2=2 3=3.5 4=4.5 5=5.5 6=6.5 7=7.5 8=9 9=11.25 10=13.75 11=16.25 12=18.75 13=21.25 14=23.75 15=37.5 16=170 17=.a, gen(income77mp)
recode income82 1=.5 2=2 3=3.5 4=4.5 5=5.5 6=6.5 7=7.5 8=9 9=11.25 10=13.75 11=16.25 12=18.75 13=21.25 14=23.75 15=30 16=42.5 17=170 18=.a, gen(income82mp)
recode income86 1=.5 2=2 3=3.5 4=4.5 5=5.5 6=6.5 7=7.5 8=9 9=11.25 10=13.75 11=16.25 12=18.75 13=21.25 14=23.75 15=27.5 16=32.5 17=37.5 18=45 19=55 20=170 21=.a, gen(income86mp)
recode income91 1=.5 2=2 3=3.5 4=4.5 5=5.5 6=6.5 7=7.5 8=9 9=11.25 10=13.75 11=16.25 12=18.75 13=21.25 14=23.75 15=27.5 16=32.5 17=37.5 18=45 19=55 20=67.5 21=170 22=.a, gen(income91mp)
recode income98 1=.5 2=2 3=3.5 4=4.5 5=5.5 6=6.5 7=7.5 8=9 9=11.25 10=13.75 11=16.25 12=18.75 13=21.25 14=23.75 15=27.5 16=32.5 17=37.5 18=45 19=55 20=67.5 21=82.5 22=100 23=170 24=.a, gen(income98mp)
recode income06 1=.5 2=2 3=3.5 4=4.5 5=5.5 6=6.5 7=7.5 8=9 9=11.25 10=13.75 11=16.25 12=18.75 13=21.25 14=23.75 15=27.5 16=32.5 17=37.5 18=45 19=55 20=67.5 21=82.5 22=100 23=120 24=140 25=170 26=.a, gen(income06mp)
* stack all values into "incomemp" for the corresponding years
replace incomemp = income72mp if year==1972
replace incomemp = income77mp if year>=1977
replace incomemp = income82mp if year>=1982
replace incomemp = income86mp if year>=1986
replace incomemp = income91mp if year>=1991
replace incomemp = income98mp if year>=1998
replace incomemp = income06mp if year>=2006
lab var incomemp "Family income (current $)"

* estimate top income code using Pareto distribution
* The Pareto distribution is defined by the CDF
*
*  P(X \le x) = 1 - (x_c / x )^\alpha
*
* where x is a given value (assumed to be greater or equal to x_c),
* x_c is the cut-off parameter, and \alpha is the shape parameter.
*
* An interesting property of the Pareto distribution is that the 
* mean above any threshold y is given as
*
* I = E[X|X \ge y] = [ \alpha / (\alpha - 1) ] * y
*
* This property makes the Pareto distribution, together with 
* the belief that income at the top can be 
* approximated by a Pareto distribution, especially useful for 
* imputing top-coded income variables.
* 
* There are two complications, however. First, as Mike argued,
* it turns out that the imputed values consistently overestimate the 
* income of the top category (according to Mike). To remedy this problem,
* he suggests to weight down the imputed income by using the formula
*
* \hat\income = y + (1/2)[I + y]
*
* The second difficulty lies in estimating the shape parameter \alpha.
* The most common approach is to assume that the income distribution is Pareto
* above a threshold j and use a second threshold k to estimate \alpha.
* In the current code we use k as the top-coded cutoff and j as the second-last income estimate.
* Then the estimate for \alpha is given as
* 
* \hat\alpha = [\ln (C_j/C_k)]/[\ln(k/j)] = [\ln C_j - \ln C_k] / [\ln k - \ln j]
*
* where C_j is the number of individuals with earnings above the first threshold,
* C_k the number of individuals with earnings above the second threshold,
* and j and k the corresponding income.
* Note that the numerator in this expression is simply the number of individuals
* which have earnings between the first and the second threshold, and that
* the denominator is the difference in the income corresponding to the thresholds.
* (This estimate of \alpha is actually the MLE of \alpha.)
*
* With this estimate of \alpha, we obtain the predicted income of the top category, i.e.,
*
* Imputed income of top-cat = k * [\hat\alpha / (\hat\alpha - 1 ) ]
*
* For example, let us look at the year 1972.
* You can see the income intervals by typing
*
* STATA CODE: tab income72 if year==1972
*
* The second highest threshold is        j = 20 (in thousands) 
* and the top-coded threshold is         k = 25
* The number of individuals above j is C_j = 33 + 32 = 65
* The number of individuals above k is C_k = 32
* So, our estimate of \alpha is 
*
* \hat \alpha = [\ln C_j - \ln C_k]/[\ln k - \ln j] = [\ln 65 - \ln 32] / [\ln 25 - \ln 20]
*
* which might be coded as

gen V72=(ln(65) - ln(32))/(ln(30)-ln(25))

* Next, the imputed value of income is 
*
* \hat\income = k + (1/2)[I - k] = (1/2) * [k + I] = (1/2)*[k + k(\hat\alpha)/(\hat\alpha -1)] 
*             = (1/2) * k * [1 + \hat\alpha/ (\hat\alpha - 1)]
*
* (If these expressions look complicated, I encourage you to write them down on paper...
*  There is a limit to writing equations in a do-file...)
* 
* As we have coded \hat\alpha as V72, the imputed income can be calculated as

gen M72=30/2 * (1 + V72/(V72-1))

* where the 30/2 is simply (1/2) * k, as k=30.
*
* Lastly, we save our estimates of alpha and the impluted income for the top-category
gen V=V72 if year==1972
gen M=M72 if year==1972
* and replace the income of the top category for the year 1972
replace incomemp=M72 if income72==12 & year==1972
drop V72 M72

/** The chunk of code that follows repeats this for every year **/

gen V73=(ln(195) - ln(92))/(ln(25)-ln(20))
gen M73=25/2 * (1 + V73/(V73-1))
replace V=V73 if year==1973
replace M=M73 if year==1973
replace incomemp=M73 if income==12 & year==1973
drop V73 M73

gen V74=(ln(218)-ln(115))/(ln(25)-ln(20))
gen M74=25/2 * (1 + V74/(V74-1))
replace V=V74 if year==1974
replace M=M74 if year==1974
replace incomemp=M74 if income==12 & year==1974
drop V74 M74

gen V75=(ln(239)-ln(128))/(ln(25)-ln(20))
gen M75=25/2 * (1 + V75/(V75-1))
replace V=V75 if year==1975
replace M=M75 if year==1975
replace incomemp=M75 if income==12 & year==1975
drop V75 M75

gen V76=(ln(276) - ln(148))/(ln(25)-ln(20))
gen M76=25/2 * (1 + V76/(V76-1))
replace V=V76 if year==1976
replace M=M76 if year==1976
replace incomemp=M76 if income==12 & year==1976
drop V76 M76

gen V77=(ln(208) - ln(38))/(ln(50)-ln(25))
gen M77=50/2 * (1 + V77/(V77-1))
replace V=V77 if year==1977
replace M=M77 if year==1977
replace incomemp=M77 if income77==16 & year==1977
drop V77 M77

gen V78=(ln(214) - ln(33))/(ln(50)-ln(25))
gen M78=50/2 * (1 + V78/(V78-1))
replace V=V78 if year==1978
replace M=M78 if year==1978
replace incomemp=M78 if income77==16 & year==1978
drop V78 M78

gen V80=(ln(330) - ln(61))/(ln(50)-ln(25))
gen M80=50/2 * (1 + V80/(V80-1))
replace V=V80 if year==1980
replace M=M80 if year==1980
replace incomemp=M80 if income77==16 & year==1980
drop V80 M80

gen V82=(ln(244) - ln(83))/(ln(50)-ln(35))
gen M82=50/2 * (1 + V82/(V82-1))
replace V=V82 if year==1982
replace M=M82 if year==1982
replace incomemp=M82 if income82==17 & year==1982
drop V82 M82

gen V83=(ln(142+153) - ln(142))/(ln(50)-ln(35))
gen M83=50/2 * (1 + V83/(V83-1))
replace V=V83 if year==1983
replace M=M83 if year==1983
replace incomemp=M83 if income82==17 & year==1983
drop V83 M83

gen V84=(ln(288) - ln(120))/(ln(50)-ln(35))
gen M84=50/2 * (1 + V84/(V84-1))
replace V=V84 if year==1984
replace M=M84 if year==1984
replace incomemp=M84 if income82==17 & year==1984
drop V84 M84

gen V85=(ln(347) - ln(162))/(ln(50)-ln(35))
gen M85=50/2 * (1 + V85/(V85-1))
replace V=V85 if year==1985
replace M=M85 if year==1985
replace incomemp=M85 if income82==17 & year==1985
drop V85 M85

gen V86=(ln(156) - ln(99))/(ln(60)-ln(50))
gen M86=60/2 * (1 + V86/(V86-1))
replace V=V86 if year==1986
replace M=M86 if year==1986
replace incomemp=M86 if income86==20 & year==1986
drop V86 M86

gen V87=(ln(228) - ln(130))/(ln(60)-ln(50))
gen M87=60/2 * (1 + V87/(V87-1))
replace V=V87 if year==1987
replace M=M87 if year==1987
replace incomemp=M87 if income86==20 & year==1987
drop V87 M87

gen V88=(ln(197) - ln(120))/(ln(60)-ln(50))
gen M88=60/2 * (1 + V88/(V88-1))
replace V=V88 if year==1988
replace M=M88 if year==1988
replace incomemp=M88 if income86==20 & year==1988
drop V88 M88

gen V89=(ln(269) - ln(164))/(ln(60)-ln(50))
gen M89=60/2 * (1 + V89/(V89-1))
replace V=V89 if year==1989
replace M=M89 if year==1989
replace incomemp=M89 if income86==20 & year==1989
drop V89 M89

gen V90=(ln(254) - ln(172))/(ln(60)-ln(50))
gen M90=60/2 * (1 + V90/(V90-1))
replace V=V90 if year==1990
replace M=M90 if year==1990
replace incomemp=M90 if income86==20 & year==1990
drop V90 M90

gen V91=(ln(166) - ln(80))/(ln(75)-ln(60))
gen M91=75/2 * (1 + V91/(V91-1))
replace V=V91 if year==1991
replace M=M91 if year==1991
replace incomemp=M91 if income91==21 & year==1991
drop V91 M91

gen V93=(ln(268) - ln(160))/(ln(75)-ln(60))
gen M93=75/2 * (1 + V93/(V93-1))
replace V=V93 if year==1993
replace M=M93 if year==1993
replace incomemp=M93 if income91==21 & year==1993
drop V93 M93

gen V94=(ln(454) - ln(271))/(ln(75)-ln(60))
gen M94=75/2 * (1 + V94/(V94-1))
replace V=V94 if year==1994
replace M=M94 if year==1994
replace incomemp=M94 if income91==21 & year==1994
drop V94 M94

gen V96=(ln(529) - ln(329))/(ln(75)-ln(60))
gen M96=75/2 * (1 + V96/(V96-1))
replace V=V96 if year==1996
replace M=M96 if year==1996
replace incomemp=M96 if income91==21 & year==1996
drop V96 M96

gen V98=(ln(243) - ln(136))/(ln(110)-ln(90))
gen M98=90/2 * (1 + V98/(V98-1))
replace V=V98 if year==1998
replace M=M98 if year==1998
replace incomemp=M98 if income98==23 & year==1998
drop V98 M98

gen V00=(ln(264) - ln(174))/(ln(110)-ln(90))
gen M00=90/2 * (1 + V00/(V00-1))
replace V=V00 if year==2000
replace M=M00 if year==2000
replace incomemp=M00 if income98==23 & year==2000
drop V00 M00

gen V02=(ln(344) - ln(230))/(ln(110)-ln(90))
gen M02=90/2 * (1 + V02/(V02-1))
replace V=V02 if year==2002
replace M=M02 if year==2002
replace incomemp=M02 if income98==23 & year==2002
drop V02 M02

gen V04=(ln(447) - ln(295))/(ln(110)-ln(90))
gen M04=90/2 * (1 + V04/(V04-1))
replace V=V04 if year==2004
replace M=M04 if year==2004
replace incomemp=M04 if income98==23 & year==2004
drop V04 M04

gen V06=(ln(302) - ln(213))/(ln(150)-ln(130))
gen M06=130/2 * (1 + V06/(V06-1))
replace V=V06 if year==2006
replace M=M06 if year==2006
replace incomemp=M06 if income06==25 & year==2006
drop V06 M06

gen V08=(ln(167) - ln(123))/(ln(150)-ln(130))
gen M08=130/2 * (1 + V08/(V08-1))
replace V=V08 if year==2008
replace M=M08 if year==2008
replace incomemp=M08 if income06==25 & year==2008
drop V08 M08

gen V10=(ln(172) - ln(115))/(ln(150)-ln(130))
gen M10=130/2 * (1 + V10/(V10-1))
replace V=V10 if year==2010
replace M=M10 if year==2010
replace incomemp=M10 if income06==25 & year==2010
drop V10 M10

gen V12=(ln(189) - ln(144))/(ln(150)-ln(130))
gen M12=130/2 * (1 + V12/(V12-1))
replace V=V12 if year==2012
replace M=M12 if year==2012
replace incomemp=M12 if income06==25 & year==2012
drop V12 M12


/** Inflation Adjustments **/

* As the value of the dollar falls with rising inflation,
* Income is not directly comparable over long runs of time.
* The standard way to use income in a longitudinal analysis is, therefore,
* to transform income into constant dollars (an alternative is to use percentiles)
* To do so, we need the Consumer Prince Index 
* (usually the CPI-U-RS, which can be found on BLS.)
* (Mike used the CPI-U-RS less food and energy, so we will use the same series)

* Now, I could not find the numbers that Mike enters in his do-file. 
* This is probably because the BLS updates their estimates of the CPI when new 
* methodologies are developed or new information becomes available.
* On the other hand, the published indices by the BLS start from the year 1977. 
* So, let us use Mike's estimates for pre 1977 years and use the new estimates from the BLS
* for the post 1977 years
* A last thing to note is that the GSS asks about income in the previous year,
* so, we'll assign the CPI of year t to that of year t+1
* (If you have to do recodings like these, consider generating the code in excel..)
gen cpi_urs = .
* for example 61.9 is the CPI for year 1971, but as the GSS in 1972 asks for income in 
* the previous (1971) year, we'll assign its value to the year 1972
replace cpi_urs = 61.9 if year ==1972
replace cpi_urs = 63.9 if year ==1973
replace cpi_urs = 67.9 if year ==1974
replace cpi_urs = 75.4 if year ==1975
replace cpi_urs = 82.3 if year ==1976
replace cpi_urs = 87 if year ==1977
replace cpi_urs = 100 if year ==1978
replace cpi_urs = 103.5 if year ==1979
replace cpi_urs = 111 if year ==1980
replace cpi_urs = 120.9 if year ==1981
replace cpi_urs = 132.2 if year ==1982
replace cpi_urs = 142.4 if year ==1983
replace cpi_urs = 150.4 if year ==1984
replace cpi_urs = 158.1 if year ==1985
replace cpi_urs = 165 if year ==1986
replace cpi_urs = 171.7 if year ==1987
replace cpi_urs = 178.1 if year ==1988
replace cpi_urs = 185.2 if year ==1989
replace cpi_urs = 192.6 if year ==1990
replace cpi_urs = 201.4 if year ==1991
replace cpi_urs = 209.9 if year ==1992
replace cpi_urs = 216.4 if year ==1993
replace cpi_urs = 222.5 if year ==1994
replace cpi_urs = 227.7 if year ==1995
replace cpi_urs = 233.4 if year ==1996
replace cpi_urs = 239.1 if year ==1997
replace cpi_urs = 244.4 if year ==1998
replace cpi_urs = 249.6 if year ==1999
replace cpi_urs = 254.6 if year ==2000
replace cpi_urs = 260.9 if year ==2001
replace cpi_urs = 267.9 if year ==2002
replace cpi_urs = 274.1 if year ==2003
replace cpi_urs = 278.1 if year ==2004
replace cpi_urs = 283.1 if year ==2005
replace cpi_urs = 289.2 if year ==2006
replace cpi_urs = 296.4 if year ==2007
replace cpi_urs = 303.4 if year ==2008
replace cpi_urs = 310.3 if year ==2009
replace cpi_urs = 315.6 if year ==2010
replace cpi_urs = 318.6 if year ==2011
replace cpi_urs = 323.9 if year ==2012
replace cpi_urs = 330.8 if year ==2013
replace cpi_urs = 336.6 if year ==2014
lab var cpi_urs "Inflation Adjustment Factor"

* Now as you can see, the CPI is 100 for the year 1978, this means that the index
* is standardized for the value of the dollar in 1977 
* (again, because GSS asks about the previous year's income)
* To rescale the index to 2014 dollars, we simly divide all its values by the value of the year 2015 
* (here, the value for 2014 because of income question)

gen CPI_u14=cpi_u/336.6

* Thereafter we divide the nominal income "by" the CPI, to obtain constant 2014 dollars
* Note that the income in 2014 will not change as the index is 336.6/336.6 = 1 in year 2014
* On the other hand, the income of all previous years will be scaled up as the renormalized CPI 
* is smaller than 1 for these years
gen income14 = incomemp/CPI_u14
lab var income14 "Family Income (2014$)"


*** WE ARE DONE **


/** Smoothing **/

* Let us generate some variables:
* party identification: -3 =strong Democrat up to 3 =strong Republican
gen pid = partyid - 3 if partyid < 7
* log income
gen lbinc = ln(income14)/ln(2)
* female
gen female = sex ==2 if sex < .

* race (this is a little bit more complex)
gen lat=hispanic>1 if hispanic<.
replace lat = 1 if hispanic>=. & ethnic==17
replace lat = 1 if hispanic>=. & ethnic==22
replace lat = 1 if hispanic>=. & ethnic==25
replace lat = 1 if hispanic>=. & ethnic==28
replace lat = 1 if hispanic>=. & ethnic==38
replace lat = 1 if hispanic>=. & eth1==17
replace lat = 1 if hispanic>=. & eth1==22
replace lat = 1 if hispanic>=. & eth1==25
replace lat = 1 if hispanic>=. & eth1==28
replace lat = 1 if hispanic>=. & eth1==38
replace lat = 1 if hispanic>=. & eth2==17
replace lat = 1 if hispanic>=. & eth2==22
replace lat = 1 if hispanic>=. & eth2==25
replace lat = 1 if hispanic>=. & eth2==28
replace lat = 1 if hispanic>=. & eth2==38
replace lat = 1 if hispanic>=. & eth3==17
replace lat = 1 if hispanic>=. & eth3==22
replace lat = 1 if hispanic>=. & eth3==25
replace lat = 1 if hispanic>=. & eth3==28
replace lat = 1 if hispanic>=. & eth3==38
lab var lat "Hispanic Heritage"
lab def lat 0 "Other" 1 "Hispanic"
lab val lat lat

recode race 1=1 2=2 3=4
replace race = 3 if lat==1 & race!=2
lab var race "Racial ancestry"
lab def race 1 White 2 Black 3 Latino 4 Other, modify
lab val race race


* let us look into the post-2008 years 
keep if year > 2008

* we might explore the association between pid and education by using loess
tw lowess pid educ, bw(1) 
* so it seems that up to 6 years, pid remains stable at -.6 (slightly Democratic leaning),
* then from 6 to 15 years, individuals become, on average, more Republican,
* and after a BA, they become more Democrat

* overall the association between education and pid is relative weak considering that
* pid ranges from -3 to 3
tw (scatter pid educ, m(oh) mc(blue) msize(small)) ///
	(lowess pid educ, bw(1)), ylabel(-3(1)3) xlabel(0(4)20)
* note that the scatterplot doesn't really show what's going on in the data
* this is because pid and educ are both actually "discrete" variables (although we treat them as continuous)
* so no matter how many points are concentrated in one spot, STATA will show only one point

* to remedy this, we might use the jitter option	
tw (scatter pid educ, m(oh) mc(blue) msize(vsmall) jitter(4)) ///
	(lowess pid educ, bw(1)), ylabel(-3(1)3) xlabel(0(4)20)
* now, we see that most of the points are concentrated at education year 12 and 16	

* you can also save the smoothed curve using the "gen" option
lowess pid educ, bw(1) gen(spid)

* Let us compare the nonparametric smooth with parametric curves	
reg pid c.educ // linear fit
predict yhat_linear if e(sample)
reg pid c.educ##c.educ // quadratic fit 
predict yhat_quad if e(sample)
reg pid c.educ##c.educ##c.educ // cubic fit
predict yhat_cubic if e(sample)
reg pid c.educ##c.educ##c.educ##c.educ // quartic fit, not significant

* comparing these models to the nonparametric smooth, we get
# delimit
tw  (line yhat_linear educ, sort) 
	(line yhat_quad educ, sort lc(blue)) 
	(line yhat_cubic educ, sort lc(green)) 
	(line spid educ if e(sample), lc(red) sort), 
	legend(pos(3) 
		   label(1 "linear") 
		   label(2 "quadratic") 
		   label(3 "cubic")
		   label(4 "smoothed")
	)
;

lab def pid -3 "Strong Democrat" -2 "Democrat" -1 "Democrat Leaning" 0 "Independent" ///
		1 "Republican Leaning" 2 "Republican" 3 "Strong Republican"
lab val pid pid

# delimit
tw  (scatter pid educ, m(oh) mc(blue*.25) msize(vsmall) jitter(4)) 
	(line yhat_linear educ, sort) 
	(line yhat_quad educ, sort lc(blue)) 
	(line yhat_cubic educ, sort lc(green)) 
	(line spid educ if e(sample), lc(red) sort), 
	legend(pos(6) row(2) symxsize(small) size(small) holes(4)
		   label(1 "Observed (Jittered)")
		   label(2 "Linear Fit") 
		   label(3 "Quadratic Fit") 
		   label(4 "Cubic Fit")
		   label(5 "Lowess")
		  )
	ylabel(-3(1)3, labels)
	xlabel(0(4)20, angle(horizontal))
;

	

/** Controling for other variables **/

* We might do the same exercise, but this time while controling for other variables
* first we drop the predicted values
drop yhat_*

* Next we run the regressions with age, gender, income, and race as controls
reg pid c.educ c.age female lbinc i.race // linear fit
* Let's store these estimates
estimates store linear

* To get predictions for education we have to hold constant other variables at certain values
* here we set race to "white," gender to "male" and the rest to their mean
* (the choice of these categories is because they are the baseline categories for the dummies)
*
* THE FOLLOWING CODE MIGHT BE A LITTLE BIT COMPLEX. BUT BECOMING USED TO THESE KIND OF THINGS
* WILL BE VERY HELPFUL, WHEN YOU ARE PROGRAMMING IN STATA (AND ANY OTHER CODING LANGUAGE)

* So, first we save the coefficients into a matrix
mat B = e(b)
* we might look into the saved matrix
mat list B

* next we define some "local variables"
* local variables are variables (which hold usualy only a number or a string) that disappear after the execution of a command
* Thus the next chunk of code has to be executed in one run.

* If you want to refer to a local variable in the code, you have to enclose the name of the variable in ` and '.
* for example if you want refer to the local variable x, you have to use `x' in your code.
* Also, some commands in stata work only for strings. So, if you want to convert a number that is stored in a
* local variable into a string, you will need to write "`x'".
* you can search "help local" in stata for more details

* Lastly, after you run the command "summarize" in STATA, STATA automatically saves some values such as the mean, min, and max
* into local variables, which you can access after running the command.
* For example, you might type
*
* 		sum age
*		return list
*
* to see which values are saved under what names. The list will also show that the mean of "age" is saved in a scalar
* named r(mean). After running the summaize command, you can therefore use these values for other computations.
* From the example above, you might type
*
*		sum age
*		display r(mean)
*
* which will show you the mean of age
*
*
* OKAY, now, here is the code we want to run:

* first we define a local variable "cons_sum" and set its value to zero
local cons_sum = 0
* The "foreach" command is a looping command:
* at each iteration of the loop, age and lbinc (in turn) are substituted into the local variable `v'
foreach v of varlist age lbinc {
	* first we quietely summarize `v' (the "mean" option tells STATA to summarize only the mean)
	qui sum `v', mean
	* we define a new local variable that contains the mean of `v' 
	* the name of this local variable will be mean_`v', e.g., if v is "age", the name of the local 
	* variable is thus "mean_age"
	local mean_`v' = r(mean)
	* we display the mean
	di "mean of variable `v' : `mean_`v''"
	* (note that we have used `mean_`v'' to refer to the new local variable)
	* we extract the column number of B that corresponds to the variable `v' and save into `z'
	* This is simply the coefficient for `v' in the previous regression
	local z = colnumb(B, "`v'")
	* we multiply this coefficient by the mean of `v'
	local cons_sum = `cons_sum' + B[1,`z'] * `mean_`v''
}
* lastly, we add the constant to `cons_sum'
local z = colnumb(B, "_cons")
local cons_sum = `cons_sum' + B[1,`z']

* Now, what is contained in `cons_sum' is the follwing:
*
* cons_sum = constant + coef_age * mean_age + coef_lbinc * mean_lbinc + coef_female * 0 + coef_race * 0
*
* In other words, cons_sum contains the predicted outcome for individuals with
* average age, average income, who are male, and who are white (and for which educ is zero)
*
* Now, the last step is to let education vary to get predictions of how the outcome changes with education
* we generate the prediction of the outcome (yhat_linear),
* by generating the variable
gen yhat_linear = `cons_sum' + B[1,1]*educ if e(sample)

* This variable represents, therefore, the predicted change in the outcome (pid) when 
* age and income is hold constant at their mean, and female and race is set, respectively, to "male" and "white"



** quadratic fit **
reg pid c.educ##c.educ age female lbinc i.race // quadratic fit 
* again we store the estimates
estimates store quadratic
* and use the same routine to get predictions
mat B = e(b)
mat list B

local cons_sum = 0
foreach v of varlist age lbinc {
	qui sum `v', mean
	local mean_`v' = r(mean)
	local z = colnumb(B, "`v'")
	local cons_sum = `cons_sum' + B[1,`z'] * `mean_`v''
}
local z = colnumb(B, "_cons")
local cons_sum = `cons_sum' + B[1,`z']
* NOTE: now we have to use both the linear term and the quadratic term!!
local l = colnumb(B, "educ")
local q = colnumb(B, "c.educ#c.educ")
gen yhat_quadratic = `cons_sum' + B[1,`l']*educ + B[1, `q']*educ^2 if e(sample)

** cubic fit **
reg pid c.educ##c.educ##c.educ age female lbinc i.race // cubic fit
estimates store cubic

mat B = e(b)
mat list B

local cons_sum = 0
foreach v of varlist age lbinc {
	qui sum `v', mean
	local mean_`v' = r(mean)
	local z = colnumb(B, "`v'")
	local cons_sum = `cons_sum' + B[1,`z'] * `mean_`v''
}
local z = colnumb(B, "_cons")
local cons_sum = `cons_sum' + B[1,`z']
* NOW WE NEED THE LINEAR, QUADRATIC, AND CUBIC TERM
local l = colnumb(B, "educ")
local q = colnumb(B, "c.educ#c.educ")
local c = colnumb(B, "c.educ#c.educ#c.educ")
gen yhat_cubic = `cons_sum' + B[1,`l']*educ + B[1, `q']*educ^2 + B[1,`c']*educ^3 if e(sample)

* comparing these models, we get
# delimit
tw  (line yhat_linear educ, sort) 
	(line yhat_quad educ, sort lc(blue)) 
	(line yhat_cubic educ, sort lc(green)), 
	legend(pos(3) 
		   label(1 "linear") 
		   label(2 "quadratic") 
		   label(3 "cubic")
	) 
;


* To get a "feeling" of how the pattern in the data look like after controlling for a set of variables,
* we might also use lowess on two sets of residuals: 
* 1) residuals after regressing pid on the controls, and
* 2) residuals after regressing educ on the controls

* The procedure is as follows:

reg pid age female lbinc i.race // regress outcome on controls
predict e_cont if e(sample), res // save residuals
reg educ age female lbinc i.race // regress education on controls
predict e_educ if e(sample), res // save residuals
lowess e_cont e_educ, bw(1) nograph gen(res_smooth) // Run Lowess

* Now, looking at the pattern ...
tw line res_smooth e_educ,sort
tw (scatter e_cont e_educ, mc(blue*.25) m(oh) msize(vsmall)) ///
	(line res_smooth e_educ, sort)
* it seems that the first part of the lowess curve is estimated with only a few observations.

* (I HAVE ADDED A SHORT DISCUSSION AT THE END OF THE DO-FILE TO EXPLAIN, INTUITIVELY, WHY THIS PROCEDURE WORKS)

* Notice that both the x and the y-axis are residuals and not the original scales.
* So, let us also look how the smoothed residuals relate to education
tw scatter res_smooth educ, mc(blue*.5) m(oh) msize(vsmall) jitter(2)
* This plot is only "suggestive" and "might" be misleading since it does not take into account
* the correlation between educ and the control variables
* Still, it let's us guess where the major changes happens

* It appears that the major change in the relationship between pid and education occurs
* either at 12 or 16 of the education scale
* Let's try out splines to mimic the curvature of the function
* first we start with an additional `boost' for those who have more a hs diploma
gen educ_from12 = 0
replace educ_from12 = educ - 11 if educ > 11 & educ <.
* a look how the additional variable is defined
scatter educ_from12 educ

* Now, let's run the regression and look at the results
reg pid educ educ_from12 age female lbinc i.race 
* the slope for education is .0197374 for those with education lower than 12
* for those with more than a hs diploma the slope is .0197374 + ( -.0814923)* (educ - 11)
* The significant coefficient on from_12 shows us that there is indeed an additional (negative) boost
* after 12 years of education.

* lastly, let us store the estimates of the models
estimates store from_12

* Next, we run the same model for a change in the slopes after obtaining a BA
gen educ_from16 = 0
replace educ_from16 = educ - 15 if educ > 15 & educ < .
* run regression
reg pid educ educ_from16 age female lbinc i.race
estimates store from_16

* We can also get the predictions of the model using the code from above:
mat B = e(b)
mat list B

local cons_sum = 0
foreach v of varlist age lbinc {
	qui sum `v', mean
	local mean_`v' = r(mean)
	local z = colnumb(B, "`v'")
	local cons_sum = `cons_sum' + B[1,`z'] * `mean_`v''
}
local z = colnumb(B, "_cons")
local cons_sum = `cons_sum' + B[1,`z']

local base = colnumb(B, "educ")
local boost = colnumb(B, "educ_from16")
gen yhat_spline16 = `cons_sum' + B[1,`base']*educ + B[1, `boost']*educ_from16 if e(sample)

* The prediction looks something like...

tw line yhat_spline16 educ, sort ylabel(-.5(.25).5) 


* A last complication when running all these models is to decide which model to choose as the final model
* For the models that we've run so far, these might be some guidelines ...
* 1) note that the linear model and the splines are "nested". 
*    Thus a significant coefficient on the "boosting" term indicates that
*    its inclusion is statistically justified.
*    Also, the linear, quadratic, and cubic functions are nested as well.
*    So, again, a significance test is a good way to go.
* 2) models with different spline functions, or with the cubic term
*    are, however, not nested. In these situations,
*    information criteria such as BIC or AIC are often used.

* So, as both spline functions are significant, 
* and the cubic term in the cubic regression is significant as well, 
* we can rule out the linear and the quadratic model

* To select a model among the remaining ones, we might use
estimates stats cubic from_12 from_16
* Lower AIC and BIC values are "better". So, it seems that the spline with BA-boost seems to be prefered 


** WHY RUNNING REGRESSIONS OF RESIDUALS ON RESIDUALS?

* To get a "feeling" of how the pattern in the data look like after controlling for a set of variables,
* we might use lowess two sets of residuals.
* The procedure is as follows:

reg pid age female lbinc i.race if !missing(educ) // regress outcome on controls
* we have to specify !missing(educ) so that we use the same observations in both regressions
predict e_cont if e(sample), res // save residuals
reg educ age female lbinc i.race if !missing(pid) // regress education on controls
predict e_educ if e(sample), res // save residuals
lowess e_cont e_educ, bw(1) gen(res_smooth) // run loess on both residuals and save predictions

tw line res_smooth e_educ,sort
tw (scatter e_cont e_educ, mc(blue*.25) m(oh) msize(vsmall)) ///
	(line res_smooth e_educ, sort)


/** Why does this work? **/

* Intuitively, the very concept of "controlling for" or "holding constant" control variables
* is nothing else than "partialling out" the effect of these variables 
* from "both" 1) the focal predictor and 2) the outcome
* 
* The residual of a regression, on the other hand, "is" the variation of the outcome that remains
* after the "effect" of the predictors are paritalled out
* 
* Hence, if we residualize both the outcome and the focal predictor
* and regressing these residuals on one another, we recover the origian regression coefficient
* from a multiple regression!
*
* Or, at least, this is the intuition...
* For a more detailed explanation, you might search for the Frisch-Waugh Theorem.
* (The theorem should not be too technical if you are familiar with basic linear algebra..)

* Just to demonstrate that it works, you might try the following:
* regress pid on education and controls, and save coefficient in a scalar
reg pid educ age female lbinc i.race
mat B = e(b)
scalar educ_b = B[1,1]

* regress pid on controls, save residuals 
qui reg pid age female lbinc i.race if !missing(educ)
predict e1 if e(sample), res
* regress education on controls, save residuals
qui reg educ age female lbinc i.race if !missing(pid)
predict e2 if e(sample), res
* regress residuals of pid on residuals on education, save coefficient in scalar
qui reg e1 e2
mat B=e(b)
scalar educ_res = B[1,1]
* compare the two coefficients
di educ_b
di educ_res
* Voila!