Friday, 29 March 2013

R statistical Tool Assignment-7

In this class we were introduced to the method of creating 3-Dimensional plots of a particular Data set 

Assignment 1-Create 3 vectors, x, y, z and choose any random values for them, ensuring they are of equal length,T<- cbind(x,y,z) Also create a 3-dimensional plot for the same.
 Command:-
> #generating a Random sample for the purpose using the function Rnorm . The code is "rnorm( n, mean, sd ); if the Mean and SD are not specified they are assumed to be as 0 and 1 respectively. The function generates random numbers whose distribution is normal. The n specifies the number of Random nos you want while the optional arguments are the mean and the S.D values.

> sample<-rnorm(50,25,6)

> sample 

> x<-sample(sample,10)

> x

> y<-sample(sample,10)

> y

> z<-sample(sample,10)

> z

> T<-cbind(x,y,z)

> T



> plot3d(T)

> plot3d(T, col=rainbow(1000))

> # here "col" denotes the color of the data points;


> plot3d(T,col=rainbow(1000),type="s")



Assignment-2 :- Create 2 random variables Create 3 plots: 
> #1. X-Y ,X-Y|Z (introducing a variable z and cbind it to x and y with 5 diff categories) 

> x<-rnorm(1500,100,10)

> y<-rnorm(1500,85,5)

> z1<-sample(letters,5)

> z2<-sample(z1,1500,replace=TRUE)

> z<-as.factor(z2)

> t<-cbind(x,y,z)

> qplot(x,y)


> qplot(x,z)

> qplot(x,z,alpha=l(1/10))

> qplot(x,z,geom=c("point","smooth"))

> qplot(x,y,color=z)

> qplot(log(x),log(y),color=z)









Friday, 15 March 2013

R Statistical Tool Assignment-6

Topic of Discussion - Introduction to Pool-Fixed-Random model estimate of Panel Data 


Panel Data also known as longitudinal or cross sectional time series data is a data set where behavior of entities are observed across time. These entities could be countries/ cities/ individuals/ companies..etc 
Panel Data helps us to control variables which we cannot observe/ measure like cultural factors or differences in business practices across companies. 


An example of a panel data ...


Gasoline


      country    year  lgaspcar    lincomep      lrpmg           lcarpcap
1    AUSTRIA 1960 4.173244 -6.474277 -0.33454761  -9.766840
2    AUSTRIA 1961 4.100989 -6.426006 -0.35132761  -9.608622
3    AUSTRIA 1962 4.073177 -6.407308 -0.37951769  -9.457257
4    AUSTRIA 1963 4.059509 -6.370679 -0.41425139  -9.343155
5    AUSTRIA 1964 4.037689 -6.322247 -0.44533536  -9.237739
6    AUSTRIA 1965 4.033983 -6.294668 -0.49706066  -9.123903
7    AUSTRIA 1966 4.047537 -6.252545 -0.46683773  -9.019822
8    AUSTRIA 1967 4.052911 -6.234581 -0.50588340  -8.934403
9    AUSTRIA 1968 4.045507 -6.206894 -0.52241255  -8.847967
10   AUSTRIA 1969 4.046355 -6.153140 -0.55911051  -8.788686
11   AUSTRIA 1970 4.080888 -6.081712 -0.59656122  -8.728200
12   AUSTRIA 1971 4.106720 -6.043626 -0.65445914  -8.635898
13   AUSTRIA 1972 4.128018 -5.981052 -0.59633184  -8.538338
14   AUSTRIA 1973 4.199381 -5.895153 -0.59444681  -8.487289
15   AUSTRIA 1974 4.018495 -5.852381 -0.46602693  -8.430404
16   AUSTRIA 1975 4.029018 -5.869363 -0.45414221  -8.382815
17   AUSTRIA 1976 3.985412 -5.811703 -0.50008372  -8.322232
18   AUSTRIA 1977 3.931676 -5.833288 -0.42191563  -8.249563
19   AUSTRIA 1978 3.922750 -5.762023 -0.46960312  -8.211041
20   BELGIUM 1960 4.164016 -6.215091 -0.16570961  -9.405527
21   BELGIUM 1961 4.124356 -6.176843 -0.17173098  -9.303149
22   BELGIUM 1962 4.075962 -6.129638 -0.22229138  -9.218070
23   BELGIUM 1963 4.001266 -6.094019 -0.25046225  -9.114932
24   BELGIUM 1964 3.994375 -6.036461 -0.27591057  -9.005491
25   BELGIUM 1965 3.951531 -6.007252 -0.34493695  -8.862581

As you can see the data has a country/state element in the index whose data is given across a particular time horizon..unlike a time series data which only has a time element in the index for a particular country/state/individual..etc



Class Objective:- Panel Data Analysis 

Panel (data) analysis is a statistical method, widely used in social scienceepidemiology, and econometrics, which deals with two-dimensional panel data.[1] The data are usually collected over time and over the same individuals and then a regression is run over these two dimensions. Multidimensional analysis is an econometric method in which data are collected over more than two dimensions (typically, time, individuals, and some third dimension)

A common panel data regression model looks like y_{it}=a+bx_{it}+\epsilon_{it}, where y is the dependent variablex is the independent variablea and b are coefficients, i and t are indices for individuals and time. The error term is very important in this analysis. Assumptions about the error term determine whether we speak of fixed effects or random effects. In a fixed effects model,  the error term is assumed to vary non-stochastically over  i or  t making the fixed effects model analogous to a dummy variable model in one dimension. In a random effects model,  the error term  is assumed to vary stochastically over  i or  t requiring special treatment of the error variance matrix

Important Facts.....
1-The key assumption in the Pooled model is that there are no unique attributes of individuals / countries/ any object over the measurement set and there are no universal effects over time.

2- In the Fixed model there is a presence of an unique attribute for a particular individual that is not random and which do not vary across time. This model is suitable in cases if we want to draw inferences about particular individuals.

3- In case of the random model there are unique , time constant attributes of individuals / countries that are effects of random variation and do not correlate with individual regressors. This model is suitable if we want to draw inferences about the entire population and not the sample only 

Given problem is the panel data named "Produc" which was already embedded in the PLM package; 
We have to find which of the models Pool/ fixed/ Random shall be applicable to the Panel Data



Tools: pFtest(fixed,pool) ; plmtest(pool) ; phtest( fixed, random) 

Commands :- 

> data("Produc", package="plm")
> head(Produc)
> pool<-plm(log(pcap)~log(hwy)+log(water)+log(util)+log(pc)+log(gsp)+log(emp)+log(unemp), data=Produc,model=("pooling"),index=c("state","year"))
> summary(pool)
# the pooling model is the regular OLS(ordinary least squares) regression model.
log(pcap) is the outcome variable while the others are predictor variables;
# index=c("state,"year") is the panel setting for the analysis 
Pr(>|t|)= Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (95%, you could choose also an alpha of 0.10), if this is the case then you can say that the variable has a significant influence on your dependent variable (y)
# if the p-value that comes out of the test is <0.05 then our model is ok. This is a test to see whether all the coefficients in the model are different from zero.  

>fixed<-plm(log(pcap)~log(hwy)+log(water)+log(util)+log(pc)+log(gsp)+log(emp)+log(unemp), data=Produc,model=("within"),index=c("state","year"))

> summary(fixed)

> random<-plm(log(pcap)~log(hwy)+log(water)+log(util)+log(pc)+log(gsp)+log(emp)+log(unemp), data=Produc,model=("random"),index=c("state","year"))
> summary(random)

> pFtest(fixed,pool)

> plmtest(pool)
# the plmtest or the Lagrange Multiplier test  helps us to decide between random effects regression and the simple OLS regression. The Null hypothesis states that there is no variance across the data set i.e. there is no significant difference across units, i.e. no panel effect. Here the p value comes out to be as <2.26*10^-16 which is way significant than our criteria. So in this case we can reject the null hypothesis and say that the random effects of regression is more suited in this model. 

> phtest(fixed,random) # to decide between fixed or random models we run the Hausman test or the phtest(fixed, random) with the null hypothesis being that there is no correlation between the error and the regressors. So the null hypothesis is the Random Model. If the p value that comes out of the test is significant i.e. <0.05 then we go for a fixed model or else a random model.  












Wednesday, 13 February 2013

R Statistical Tool Assignment-5

This class on 12th of Feb was basically conceptual with understanding of concepts like taking a time series data and 

-Finding its returns;
-Conducting a ACF plot to check the stationarity of the Data; 
-Analysing the data through Augmented Dickey Fuller -test. 
-Calculating the historical Volatility and standard deviation of a data set
-standardizing a given data set.

Assignment: 
Create log of returns data  and calculate its historical volatility
Formulae: 
1) logSt-logSt-1/logSt-1
OR
2) log(St-St-1/St-1)
Create ACF Plot for log returns and do the ADF test and analyse on it
Data is as follows:
NSE Index –Jan 2012 –Jan 2013
NIFTY data –Closing prices

Commands:-

> niftychart<-read.csv(file.choose(),header=T)
> closingval<-niftychart$Close
> closingval.ts<-ts(closingval,frequency=252)
> plot(log( closingval.ts))
> minusone.ts<-lag(closingval.ts,K=-1)
> plot(log( minusone.ts))
> z<-log(closingval.ts)-log(minusone.ts)
> z

> returns<-z/log(minusone.ts)
> plot(returns,main="Plot of Log Returns;CNX NSE Nifty Jan-2012 to Jan-2013" )



> acf(returns,main=" The Auto Correlation Plot;   Dotted line shows 95% confidence interval ")


The ACF plot shows that all the correlations lie within our expectations of a 95% confidence interval so there is a fairly good chance of considering the Data to be "STATIONARY"


> adf.test(returns)



 Now with the ADF test and its P-value we can confirm that the Data is "Stationary"

# Now calculating the Historical volatility of the Data

> T<-252^0.5
> histvolatality<-sd(returns)/T
> histvolatality


Tuesday, 5 February 2013

R Statistical Tool Assignment-4

This class on 5th of Feb revolved around reading a particular Data Set. Converting that Data into a time series format and then calculating the returns from it.
Data set- CNX Mid-cap Index downloaded from NSE from August 2012-January 2013-10th reading to 95th reading. 


Commands:-
> z<-read.csv(file.choose(),header=T)
> Close<-z$Close
> Close
> Close.ts<-ts(Close)
> Close.ts<-ts(Close,deltat= 1/252)
z1<-ts(data=Close.ts[10:95],frequency=1,deltat=1/252) 
> z1.ts<-ts(z1)
> z1.ts
> z1.diff<-diff(z1)
> z2<-lag(z1.ts,K=-1)
> Returns<-z1.diff/z2
> plot(Returns,main=" Returns from 10 th to 95th day of NSE Mid-cap Index ")
z3<-cbind(z1.ts,z1.diff,Returns)
> plot(z3,main=" Data from 10th-95th day ; Difference ; Returns")





Assignment:-2

Question: 1-700 data is available, Predict the data from 701-850, use the GLM estimation using LOGIT Analysis for the same

commands

> z<-read.csv(file.choose(),header=T)
> z1<-z[1:700,1:9]
> head(z1)
> z1$ed<-factor(z1$ed)
> z1.est<-glm(default ~ age + ed + employ + address + income + debtinc + creddebt + othedebt, data=z1, family ="binomial")
> summary(z1.est)
> forecast<-z[701:850,1:8]
> forecast$ed<-factor(forecast$ed)
> forecast$probability<-predict(z1.est,newdata=forecast,type="response")
> head(forecast)















Wednesday, 23 January 2013

R Statistical tool Assignment-3

Purpose-
The class focused on using regression analysis on a Data Set. The user needs to identify whether a linear model can at all be fitted, thus performing a check on non-linearity. Importance of QQ plot is also showed from the point of view of finding the range of the independent variable in which the regression analysis can be done. 

Assignment 1: Using mileage groove data,   fit 'lm' and comment on the applicability of 'lm'.

>Data<-read.csv(file.choose,header=T)
>Data
>z1<-Data[,1]
>z2<-Data[,2]
>reg1<-lm(z1~z2)
>reg1

For normal distribution pattern...
>res<-resid(reg1)
>res

Plotting the residues vs the independent variable

>plot(z2,res)

Now the QQ plot 
> qqnorm(res)
> qqline(res)

Verdict: As the plot of the residuals versus the independent variable shows a parabolic plot so we cannot draw a regression on the data set. The function over here is non-linear.


Assignment 2: The alpha-pluto Data

>Data<-read.csv(file.choose( ), header=T)
>Data
>reg1<-lm(Data[,2]~Data[,1])
>res<-resid(reg1)
>res
>plot(Data[,1],res)

Now plotting the Standard deviation of the residuals vs the independent variable

>stdres<-rstandard(reg1)
>stdres
>plot(data[,1],stdres)
>qqnorm(stdres)
>qqline(stdres)

Assignment 3: Hypothesis testing using Anova

>Data<-read.csv(file.choose( ), header=T)
>Data
>Data.anova<-aov(Data[,2]~Data[,1])
>summary(Data.anova)


The P value comes out to be as 0.687 which is greater than 0.05 so we do not have sufficient proof to negate the null hypothesis.





Tuesday, 15 January 2013

R-Statistical Tool Assignment -2

Second Day - Matrices operations like 1- Transpose ; 2- Inverse; 3- Merging two columns from two different matrices 4-Regression using the data from NSE;

Q1- Create two matrices of say size 3 X 3 and select the column 1 from one matrix and column 3 from second matrix. After selecting the columns in objects say x1 and x1  merge these two columns using cbind to create a new matrix 

> matrix1<-c(1:9)
> matrix2<-c(10:18)
> dim(matrix1)<-c(3,3)
> dim(matrix2)<-c(3,3)
> z1<-matrix1[,1]
> z2<-matrix2[,1]
> z<-cbind(z1,z2)
> z

Q2- Multiply two matrices 

> multipliedmatrix<-matrix1 %*% matrix2
> multipliedmatrix

Q3-Read historical data of indices from NSE for the period 1st Dec 2012 to 31st Dec 2012. Find regression and residuals

> z<-read.csv(file.choose(),header=T)
> High<-z[,3]
> open<-z[,2]
> z1<-cbind(high,low)
> reg1<-lm(high~open,data=z)
> reg1

Screenshot ......of the above three commands 










Q4- Create a normal distribution and plot it 

>x<-rnorm(50,0,1)
> y<-dnorm(x)
> plot(x,y)





Tuesday, 8 January 2013

R-statistical tool Introduction Assignment-1

Welcome to the Introduction of the "R x64 2.15.2" Data Analysis and Statistical tool

Assignment-1 
Creating a histogram from the given Data set:

 Assignment-2 
 Creating a both lines and dots graph of the S&P- CNX-Nifty High graph from October 2012 to 8th     January-2013

 
Assignment-3
 Creating a both lines and dots graph of the S&P- CNX-Nifty High and low from October 2012 to 8th January-2013

Assignment-4 
The command for finding the difference between the highest and the lowest value reached by the S&P-CNX-nifty 50 from October 2012 to 8th January 2013 

> high<-z[,3]
> low<-z[,4]
merge-data <-c(high,low)
> range(merge-data)