Clase 20: Regresiones¶
En esta clase revisaremos los conceptos econométricos de Mínimos Cuadrados Ordinarios (OLS por sus siglas en inglés) y Modelo Autorregresivo (AR por sus siglas en inglés).
Para esto vamos a utilizar la librería statsmodels
: https://www.statsmodels.org/stable/install.html
Mínimos Cuadrados Ordinarios (Ordinary Least Squares)¶
Buscamos estimar la Función de Regresión Poblacional (FRP), sin embargo esta no es observable directamente. Para esto vamos a utilizar una estimación que llamaremos la Función de Regresión Muestral (FRM):
FRP: Yi=β1+β2Xi+μi
FRM: Yi=ˆβ1+ˆβ2Xi+ˆμi=ˆYi−ˆμi, donde ˆYi es el valor estimado (media condicional) de Yi.
Entonces, en OLS vamos a minimizar el error ˆμi=Yi−^Yi $∑u2i=∑(Yi−ˆYi)2=∑(Yi−ˆβ1−ˆβ2Xi)2$
/home/felix/miniconda3/lib/python3.9/site-packages/pandas/io/stata.py:1417: UnicodeWarning:
One or more strings in the dta file could not be decoded using utf-8, and
so the fallback encoding of latin-1 is being used. This can happen when a file
has been incorrectly encoded by Stata or some other software. You should verify
the string values returned are correct.
warnings.warn(msg, UnicodeWarning)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-b7adf0787577> in <module>
----> 1 plt.plot(df2.ytrabajocor, 'g')
2 plt.xlabel("Observación")
3 plt.ylabel('Ingreso del trabajo')
NameError: name 'plt' is not defined


Datos para la regresión
const | esc | |
---|---|---|
0 | 1.0 | 12.0 |
3 | 1.0 | 15.0 |
5 | 1.0 | 16.0 |
7 | 1.0 | 15.0 |
11 | 1.0 | 14.0 |
OLS Regression Results
==============================================================================
Dep. Variable: ytrabajocor R-squared: 0.183
Model: OLS Adj. R-squared: 0.183
Method: Least Squares F-statistic: 1.601e+04
Date: Mon, 26 Jul 2021 Prob (F-statistic): 0.00
Time: 22:21:34 Log-Likelihood: -1.0554e+06
No. Observations: 71466 AIC: 2.111e+06
Df Residuals: 71464 BIC: 2.111e+06
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -3.694e+05 7751.498 -47.662 0.000 -3.85e+05 -3.54e+05
esc 7.572e+04 598.451 126.532 0.000 7.45e+04 7.69e+04
==============================================================================
Omnibus: 64620.243 Durbin-Watson: 1.630
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3238679.663
Skew: 4.257 Prob(JB): 0.00
Kurtosis: 34.861 Cond. No. 43.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
No. Observations: número de observaciones ocupadas en la regresión.
R-squared: El R2 es la proporción de la varianza en la variable dependiente que puede ser predicha por las variables independientes. En este caso un 18% de la varianza es explicada por las variables independientes.
Adj R-squared (R2 ajustado): el R2 se ajusta para penalizar que se agreguen regresores extraños.
F-statistic (F(Z,)): Es la razón entre el error cuadrático medio del modelo y el error cuadrático medio de los residuos. Nos sirve para determinar la importancia general del modelo.
Criterios de información (AIC y BIC): nos sirven para ver el trade-off entre complejidad del modelo y bondad de ajuste de este.
En las filas están las variables con las que regresionamos: constante y educación. En las columnas tenemos:
El coeficiente de las variables independientes y la constante. Acá estamos estimando los βi: Yi=ˆβ1+ˆβ2Xi+ˆμi
std err: error estándar asociado a los coeficientes.
t: estadístico t usado para testear la significancia de los coeficientes.
P>|t|: p-value de dos colas usado para testear hipótesis nula de que el coeficiente es cero (α=0.05).
[0.025 - 0.975]: intervalo de confianza de coeficiente con α=0.05.
/home/felix/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py:1599: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[key] = infer_fill_value(value)
/home/felix/miniconda3/lib/python3.9/site-packages/pandas/core/indexing.py:1720: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_column(loc, value, pi)
const | esc | edad | sexo2 | esc2 | |
---|---|---|---|---|---|
0 | 1.0 | 12.0 | 34 | 1.0 | 144.0 |
3 | 1.0 | 15.0 | 45 | 0.0 | 225.0 |
5 | 1.0 | 16.0 | 57 | 0.0 | 256.0 |
7 | 1.0 | 15.0 | 56 | 1.0 | 225.0 |
11 | 1.0 | 14.0 | 54 | 1.0 | 196.0 |
<ipython-input-18-79fecfcd28f3>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df3['exp'] = df3.edad-df3.esc
<ipython-input-18-79fecfcd28f3>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df3['exp2'] = df3.exp**2
<ipython-input-18-79fecfcd28f3>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df3['exp_sexo'] = df3.exp*df3.sexo2
const | esc | exp | exp2 | sexo2 | exp_sexo | |
---|---|---|---|---|---|---|
0 | 1.0 | 12.0 | 22.0 | 484.0 | 1.0 | 22.0 |
3 | 1.0 | 15.0 | 30.0 | 900.0 | 0.0 | 0.0 |
5 | 1.0 | 16.0 | 41.0 | 1681.0 | 0.0 | 0.0 |
7 | 1.0 | 15.0 | 41.0 | 1681.0 | 1.0 | 41.0 |
11 | 1.0 | 14.0 | 40.0 | 1600.0 | 1.0 | 40.0 |
OLS Regression Results
==============================================================================
Dep. Variable: ytrabajocor R-squared: 0.265
Model: OLS Adj. R-squared: 0.265
Method: Least Squares F-statistic: 5151.
Date: Tue, 26 Oct 2021 Prob (F-statistic): 0.00
Time: 11:08:12 Log-Likelihood: -1.1887e+05
No. Observations: 71466 AIC: 2.378e+05
Df Residuals: 71460 BIC: 2.378e+05
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 9.7314 0.030 326.486 0.000 9.673 9.790
esc 0.1522 0.002 100.943 0.000 0.149 0.155
exp 0.0820 0.001 71.234 0.000 0.080 0.084
exp2 -0.0011 1.56e-05 -68.238 0.000 -0.001 -0.001
sexo2 -0.1503 0.021 -7.152 0.000 -0.191 -0.109
exp_sexo -0.0136 0.001 -23.036 0.000 -0.015 -0.012
==============================================================================
Omnibus: 31679.342 Durbin-Watson: 1.650
Prob(Omnibus): 0.000 Jarque-Bera (JB): 197590.127
Skew: -2.052 Prob(JB): 0.00
Kurtosis: 10.037 Cond. No. 1.12e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
const | esc | edad | sexo2 | |
---|---|---|---|---|
0 | 1.0 | 12.0 | 34 | 1.0 |
3 | 1.0 | 15.0 | 45 | 0.0 |
5 | 1.0 | 16.0 | 57 | 0.0 |
7 | 1.0 | 15.0 | 56 | 1.0 |
11 | 1.0 | 14.0 | 54 | 1.0 |
OLS Regression Results
==============================================================================
Dep. Variable: ytrabajocor R-squared: 0.212
Model: OLS Adj. R-squared: 0.212
Method: Least Squares F-statistic: 6423.
Date: Mon, 26 Jul 2021 Prob (F-statistic): 0.00
Time: 23:38:58 Log-Likelihood: -1.2134e+05
No. Observations: 71466 AIC: 2.427e+05
Df Residuals: 71462 BIC: 2.427e+05
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 10.5982 0.028 380.688 0.000 10.544 10.653
esc 0.1696 0.001 124.958 0.000 0.167 0.172
edad 0.0039 0.000 10.597 0.000 0.003 0.005
sexo2 -0.5787 0.010 -58.179 0.000 -0.598 -0.559
==============================================================================
Omnibus: 32963.117 Durbin-Watson: 1.642
Prob(Omnibus): 0.000 Jarque-Bera (JB): 198868.520
Skew: -2.170 Prob(JB): 0.00
Kurtosis: 9.925 Cond. No. 269.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Comparar resultados predichos
Para ver ejemplos de gráficos en modelos de regresión pueden utilizar https://www.statsmodels.org/stable/examples/notebooks/generated/regression_plots.html
Por ejemplo, una visualización de las regresiones parciales se puede obtener mediante plot_partregress_grid
. Esto muestra la relación entre la respuesta y la variable explicativa después de eliminar el efecto de todas las otras variables exógenas.

Para ver información de la regresión con respecto a una variable en específico podemos ocupar plot_regress_exog
. Acá tenemos
Fitted plot
Residuo
Regresión parcial
Component-Component plus Residual (CCPR) plot: nos sirve para ver la relación entre uan variable independiente particular y la variable de respuesta, dado que otras variables indpendientes también están presentes en el modelo: Res+ˆβiXi versus Xi
Ejemplo caso con menos ruido: ejemplo de juguete
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.974
Model: OLS Adj. R-squared: 0.973
Method: Least Squares F-statistic: 1807.
Date: Mon, 26 Jul 2021 Prob (F-statistic): 1.73e-77
Time: 23:31:42 Log-Likelihood: -66.869
No. Observations: 100 AIC: 139.7
Df Residuals: 97 BIC: 147.6
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0904 0.141 -0.641 0.523 -0.370 0.189
x1 1.0865 0.065 16.670 0.000 0.957 1.216
x2 -0.0098 0.006 -1.557 0.123 -0.022 0.003
==============================================================================
Omnibus: 0.781 Durbin-Watson: 2.028
Prob(Omnibus): 0.677 Jarque-Bera (JB): 0.817
Skew: -0.006 Prob(JB): 0.665
Kurtosis: 2.557 Cond. No. 144.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
prstd, iv_l, iv_u = wls_prediction_std(res)
#Retorno:
#-standard error of prediction same length as rows of exog
#-lower und upper confidence bounds
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x, y, 'o', label="data")
ax.plot(x, res.fittedvalues, 'r--.', label="OLS")
ax.plot(x, iv_u, 'r--')
ax.plot(x, iv_l, 'r--')
ax.legend(loc='best');

Modelo Autorregresivo (AR)¶
Podemos estimar un modelo del tipo Yt=β0+β1Yt−1+μt
En python podemos utilizar AutoReg
: https://www.statsmodels.org/stable/examples/notebooks/generated/autoregressions.html
Periodo | PIB | |
---|---|---|
0 | 1986-01-01 | 7899.666547 |
1 | 1986-04-01 | 8220.056395 |
2 | 1986-07-01 | 8355.276664 |
3 | 1986-10-01 | 8555.927503 |
4 | 1987-01-01 | 8626.993492 |
AutoReg Model Results
==============================================================================
Dep. Variable: PIB No. Observations: 141
Model: AutoReg(1) Log Likelihood -1085.062
Method: Conditional MLE S.D. of innovations 562.000
Date: Tue, 27 Jul 2021 AIC 12.706
Time: 08:19:50 BIC 12.769
Sample: 1 HQIC 12.731
141
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 257.5324 123.602 2.084 0.037 15.278 499.787
PIB.L1 0.9985 0.005 203.603 0.000 0.989 1.008
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 1.0015 +0.0000j 1.0015 0.0000
-----------------------------------------------------------------------------
Podemos actualizar el modelo a Yt=β0+β1Yt−1+β2Yt−2+β3Yt−3+β4Yt−4+β5Yt−5+μt
AutoReg Model Results
==============================================================================
Dep. Variable: PIB No. Observations: 141
Model: AutoReg(5) Log Likelihood -1047.475
Method: Conditional MLE S.D. of innovations 535.438
Date: Tue, 27 Jul 2021 AIC 12.669
Time: 08:21:12 BIC 12.819
Sample: 5 HQIC 12.730
141
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 423.0718 133.111 3.178 0.001 162.179 683.965
PIB.L1 0.8942 0.083 10.802 0.000 0.732 1.056
PIB.L2 0.2591 0.123 2.105 0.035 0.018 0.500
PIB.L3 -0.3180 0.120 -2.658 0.008 -0.553 -0.083
PIB.L4 -0.5179 0.223 -2.323 0.020 -0.955 -0.081
PIB.L5 0.6816 0.195 3.488 0.000 0.299 1.065
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -0.8762 -0.6971j 1.1197 -0.3930
AR.2 -0.8762 +0.6971j 1.1197 0.3930
AR.3 0.7558 -0.7735j 1.0814 -0.1268
AR.4 0.7558 +0.7735j 1.0814 0.1268
AR.5 1.0006 -0.0000j 1.0006 -0.0000
-----------------------------------------------------------------------------
Para un resumen de la visualización de los residuos estandarizados podemos ocupar plot_diagnostics
fig = plt.figure(figsize=(16,9))
fig = res.plot_diagnostics(fig=fig, lags=30)
#Standarized residual:
#-residuo: Valor observado - valor predicho
#-standarized residual: residuo normalizado por el error estandar del modelo (RSE)
#Histogram plus: estima la distribución de los residuos estandarizados. Nos muestra la distribución normal (línea verde) como punto de comparación.
#Normal Q-Q : muestra si el residuo está distribuido normalmente. Lo que se espera es que los puntos se encuentren sobre la línea roja o cercanos a ella.
#Correlograma: nos muestra un gráfico de autocorrelaciones

Autocorrelaciones: https://www.statsmodels.org/stable/generated/statsmodels.graphics.tsaplots.plot_acf.html
Actividad¶
Estime la ecuación de mincer: Ingresoi=β0+β1esci+β2expi+β3exp2i+β4sexoi+μi donde :
ingreso: log(ingreso del trabajo)
esc: escolaridad
exp: experiencia = edad - escolaridad
exp2: expediencia al cuadrado
sexo: variable dicotómica si es hombre (0) o mujer (1).
Estime un modelo AR del tipo Δyt=Δyt−1 donde:
yt es el PIB trimestral en el periodo t.
Δyt es la variación del PIB trimestral en el periodo t.
Δyt−1 es la variación del PIB trimestral en el periodo t-1.