Passing thoughts ..: R Statistical Tool Assignment-6

Topic of Discussion - Introduction to Pool-Fixed-Random model estimate of Panel Data

Panel Data also known as longitudinal or cross sectional time series data is a data set where behavior of entities are observed across time. These entities could be countries/ cities/ individuals/ companies..etc
Panel Data helps us to control variables which we cannot observe/ measure like cultural factors or differences in business practices across companies.

An example of a panel data ...

Gasoline

country year lgaspcar lincomep lrpmg lcarpcap
1 AUSTRIA 1960 4.173244 -6.474277 -0.33454761 -9.766840
2 AUSTRIA 1961 4.100989 -6.426006 -0.35132761 -9.608622
3 AUSTRIA 1962 4.073177 -6.407308 -0.37951769 -9.457257
4 AUSTRIA 1963 4.059509 -6.370679 -0.41425139 -9.343155
5 AUSTRIA 1964 4.037689 -6.322247 -0.44533536 -9.237739
6 AUSTRIA 1965 4.033983 -6.294668 -0.49706066 -9.123903
7 AUSTRIA 1966 4.047537 -6.252545 -0.46683773 -9.019822
8 AUSTRIA 1967 4.052911 -6.234581 -0.50588340 -8.934403
9 AUSTRIA 1968 4.045507 -6.206894 -0.52241255 -8.847967
10 AUSTRIA 1969 4.046355 -6.153140 -0.55911051 -8.788686
11 AUSTRIA 1970 4.080888 -6.081712 -0.59656122 -8.728200
12 AUSTRIA 1971 4.106720 -6.043626 -0.65445914 -8.635898
13 AUSTRIA 1972 4.128018 -5.981052 -0.59633184 -8.538338
14 AUSTRIA 1973 4.199381 -5.895153 -0.59444681 -8.487289
15 AUSTRIA 1974 4.018495 -5.852381 -0.46602693 -8.430404
16 AUSTRIA 1975 4.029018 -5.869363 -0.45414221 -8.382815
17 AUSTRIA 1976 3.985412 -5.811703 -0.50008372 -8.322232
18 AUSTRIA 1977 3.931676 -5.833288 -0.42191563 -8.249563
19 AUSTRIA 1978 3.922750 -5.762023 -0.46960312 -8.211041
20 BELGIUM 1960 4.164016 -6.215091 -0.16570961 -9.405527
21 BELGIUM 1961 4.124356 -6.176843 -0.17173098 -9.303149
22 BELGIUM 1962 4.075962 -6.129638 -0.22229138 -9.218070
23 BELGIUM 1963 4.001266 -6.094019 -0.25046225 -9.114932
24 BELGIUM 1964 3.994375 -6.036461 -0.27591057 -9.005491
25 BELGIUM 1965 3.951531 -6.007252 -0.34493695 -8.862581

As you can see the data has a country/state element in the index whose data is given across a particular time horizon..unlike a time series data which only has a time element in the index for a particular country/state/individual..etc

Class Objective:- Panel Data Analysis

Panel (data) analysis is a statistical method, widely used in social science, epidemiology, and econometrics, which deals with two-dimensional panel data.^[1] The data are usually collected over time and over the same individuals and then a regression is run over these two dimensions. Multidimensional analysis is an econometric method in which data are collected over more than two dimensions (typically, time, individuals, and some third dimension)

A common panel data regression model looks like $y_{it}=a+bx_{it}+\epsilon_{it}$ , where y is the dependent variable, x is the independent variable, a and b are coefficients, i and t are indices for individuals and time. The error term is very important in this analysis. Assumptions about the error term determine whether we speak of fixed effects or random effects. In a fixed effects model, the error term is assumed to vary non-stochastically over i or t making the fixed effects model analogous to a dummy variable model in one dimension. In a random effects model, the error term is assumed to vary stochastically over i or t requiring special treatment of the error variance matrix

Important Facts.....

1-The key assumption in the Pooled model is that there are no unique attributes of individuals / countries/ any object over the measurement set and there are no universal effects over time.

2- In the Fixed model there is a presence of an unique attribute for a particular individual that is not random and which do not vary across time. This model is suitable in cases if we want to draw inferences about particular individuals.

3- In case of the random model there are unique , time constant attributes of individuals / countries that are effects of random variation and do not correlate with individual regressors. This model is suitable if we want to draw inferences about the entire population and not the sample only

Given problem is the panel data named "Produc" which was already embedded in the PLM package;

We have to find which of the models Pool/ fixed/ Random shall be applicable to the Panel Data

Tools: pFtest(fixed,pool) ; plmtest(pool) ; phtest( fixed, random)

Commands :-

> data("Produc", package="plm")
> head(Produc)

> pool<-plm(log(pcap)~log(hwy)+log(water)+log(util)+log(pc)+log(gsp)+log(emp)+log(unemp), data=Produc,model=("pooling"),index=c("state","year"))
> summary(pool)

# the pooling model is the regular OLS(ordinary least squares) regression model.

# log(pcap) is the outcome variable while the others are predictor variables;

# index=c("state,"year") is the panel setting for the analysis

# Pr(>|t|)= Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (95%, you could choose also an alpha of 0.10), if this is the case then you can say that the variable has a significant influence on your dependent variable (y)

# if the p-value that comes out of the test is <0.05 then our model is ok. This is a test to see whether all the coefficients in the model are different from zero.

>fixed<-plm(log(pcap)~log(hwy)+log(water)+log(util)+log(pc)+log(gsp)+log(emp)+log(unemp), data=Produc,model=("within"),index=c("state","year"))

> summary(fixed)

> random<-plm(log(pcap)~log(hwy)+log(water)+log(util)+log(pc)+log(gsp)+log(emp)+log(unemp), data=Produc,model=("random"),index=c("state","year"))

> summary(random)

> pFtest(fixed,pool)

> plmtest(pool)

# the plmtest or the Lagrange Multiplier test helps us to decide between random effects regression and the simple OLS regression. The Null hypothesis states that there is no variance across the data set i.e. there is no significant difference across units, i.e. no panel effect. Here the p value comes out to be as <2.26*10^-16 which is way significant than our criteria. So in this case we can reject the null hypothesis and say that the random effects of regression is more suited in this model.

> phtest(fixed,random) # to decide between fixed or random models we run the Hausman test or the phtest(fixed, random) with the null hypothesis being that there is no correlation between the error and the regressors. So the null hypothesis is the Random Model. If the p value that comes out of the test is significant i.e. <0.05 then we go for a fixed model or else a random model.