Regression

In a multivariate distribution, there will be a more than one variables that are paired to each other, and there may exist a relationship between each of them.  If they is some relation between each of the variables, then its said to be correlated and so it is possible to predict value of one variable from the others. This gives rise to what is called Regression Analysis.

Regression Definition

Regression is a measure of the average relationship between two or more variables in terms of the original units of the data.
Regression analysis is a statistical device. In the regression analysis, the independent variable is also known as the “Regression” or “Predictor” or “Explanator” and the dependent variable is known as “Regressed” or “Explained variable”.

Regression Testing

Regression line of Y on X: It gives the most probable value of Y for given values of X.
The regression equation of Y on X is given by
Yc = a + bX
Here a and b are constants, which are called the parameters, determines the position of the line completely.

'a' - the level of the fitted line.
'b' - the degree of the slope of the line, which is the regression coefficient.

The symbol 'Yc' stands for the value of Y computed from the relationship for a given X.

The values of the parameters 'a' and 'b' are obtained by the method of least squares and the line is completely determined.

To determine the value of 'a' and 'b', the following equations are to be solved simultaneously

We obtain the equations by first taking sum on both side of equation (1)
$\sum_{i}$ y = a $\sum_{i}$ 1 + b $\sum_{i}$ x
$\sum_{i}$ y = na + b $\sum_{i}$ x

Multiplying $\sum_{i}$ x on both sides of (2)

 $\sum_{i}$ xy = a $\sum_{i}$ x + b $\sum_{i}$  $x^{2}$

These equations are usually called normal equations. In the equations $\sum_{i}$ x , $\sum_{i}$ y,  $\sum_{i}$ xy,  $\sum_{i} x^{2}$ indicates totals which are computed from the observed pairs of two variables X and Y to which the least square estimating line is to be fitted and ‘n’ is the observed pairs of values.

Regression Coefficient

Regression line of X on Y: It gives the most probable value of X for given values of Y.
The regression equation of X on Y is given by
Xc = $\alpha$ + $\beta$Y
Here $\alpha$ and $\beta$ are constants, which are called the parameters, determines the position of the line completely.

'$\alpha$' - the level of the fitted line.
'$\beta$' -  the degree of the slope of the line

The symbol '$X_{c}$' stands for the value of X computed from the relationship for a given Y.

The values of the parameters '$\alpha$' and '$\beta$' are obtained by the method of least squares and the line is completely determined.

Regression Formula

$\sum_{i} x$ = $\alpha\sum_{i}1$ + $\beta\sum_{i}y$

$\sum_{i} x$ = $n\alpha + \beta\sum_{i}y$

Multiplying $\sum_{i}$ y on both sides

$\sum_{i}$ xy = $\alpha\sum_{i}y$ + $\beta \sum_{i} y^{2}$

These equations are called normal equations. In the equations $\sum_{i}$x , $\sum_{i}$y, $\sum_{i}$ xy, $\sum_{i} y^{2}$ indicates totals which are computed from the observed pairs of two variables X and Y to which the least square estimating line is to be fitted and ‘n’ is the observed pairs of values.

Regression Analysis Example

Lets consider an example of regression and see how the data can be interpreted.

Consider the following pair of data. Using that form the two regression lines corresponding to these variables
(23,69), (29,95), (29,102), (35,118), (42,126), (46,125), (50,138), (54,178), (64,156), (66,184), (76,176), (78,225)

Solution:
Lets take the first term of the ordered pair under X and the other part under Y.
The regression equation of Y on X is given by
$Y_{c}$ = a + bX
So the computation will be as follows

 No     Y 
 $X^{2}$ 
   XY
  1   23  69       529   1587 
  2
  29  95      841
  2755
  3
  29  102      841
  2958
  4
  35  118     1225
  4130
  5
  42  126     1764   5292
  6
  46  125     2116
  5750
  7
  50  138     2500
  6900
  8
  54  178     2916
  9612
  9
  64  156     4026
  9984
  10
  66  184     4356  12144
  11   76  176     5776
 13376
  12   78  225     6084
 17550
 Total   592   1692     33044
 92038 

Plug them in the normal equations
 $\sum_{i}$ y = na + b $\sum_{i}$ x             =>   1692 = 12a + 592b
 $\sum_{i}$xy = a $\sum_{i}$ x + b $\sum_{i} x^{2}$   =>   92038 = 592a + 33044b
Solving them we get,
a= 30.88
b=2.232
So the regression line of Y on X is given by Y = 30.888 + 2.232X

Next

The regression equation of X on Y is given by
$X_{c}$ = a + bY
So the computation will be as follows

 No     X   
   Y 
 $X^{2}$  
   XY   
  1   23     69 
     4761  1587
  2
  29
   95      9025
 2755
  3
  29   102      10404
 2958
  4
  35
  118
     13924
 4130
  5
  42   126      15876  5292
  6
  46
  125      15625  5750
  7
  50   138      19044  6900
  8
  54   178      31684  9612
  9
  64   156      24336  9984
  10
  66   184      33856  12144
  11
  76   176      30976  13376
  12 
  78   225      50625  17550
  Total 
 592
 1692 
   260136
 92038 

Plug them in the normal equations
 $\sum_{i}$x = n$\alpha$ + $\beta$ y             =>   592= 12$\alpha$+ 1692 $\beta$
  $\sum_{i}$xy = $\alpha \sum_{i}$ y + $\beta \sum_{i} y^{2}$  =>   92038 = 1692$\alpha$ + 260136 $\beta$
Solving them we get,
$\alpha$ = -6.677
$\beta$ = 0.397
So the regression line of X on Y is given by X = -6.677 + 0.397Y
So we get two equations
Y = 30.888 + 2.232X
X = -6.677 + 0.397Y
So  => 0.397Y= X+ 6.677
So, Y= 2.519X + 16.82
Solving   and 
X = 49.04
Y = 140.35
Hence, the graph of original data is as follows

Regression Analysis Example

Stepwise Regression

It is a method in which the variables are removed and added for the purpose of identifying a useful subset of the predictors. So the dependent variable may depend on many independent variables.

Each of the variables are added as well as removed in sequential form and at each time all the values assessed. If adding of the variable contributes to the model as such then it is retained, but all of the other variables present are then re-tested to check if they are in anyway contributing to the success of it. If any of them doesn’t contribute, then they will be removed. Hence in the end, this model will contain the smallest possible set of predictor variables.

Logistic Regression

It can be defined as a statistical technique in which the dependable variables are dichotomous, that is binary.
Binary response and ordinal responses always arise in many fields of studies.
Logistic regression is mostly use in order to investigate the relationship between these types of responses and a set of explanatory variable.

Poisson Regression

It is a type of regression model, which assumes that its outcome variable follows a poisson distribution, that is, the response variable will follow a poisson distribution.

These models are mostly used to predict rate or count variable. Poisson regression becomes a natural choice only when the outcome variable is a small integer. The explanatory variable helps to model the mean of the response variable.

Regression Analysis

In order to analyze the regression equation formed, its important to known how well the equation fits the data given.
This could be checked using a formula called Coefficient of determination. It is a key output of regression analysis.
It has the following characteristics:
•    Its value ranges from 0 to 1.
•    When the value becomes equal to 0,it means that the dependent variable is not connected linearly with the independent variable.
•    When the value becomes equal to 1,it means that the dependent variable can be predicted with the independent variable.
•    When the value of $R^{2}$ is between 0 and 1 it indicates the extent to which the dependent variable can be predictable.
Its formula is $R^{2}$ = $\left [ \frac{1}{N}\frac{\sum(x_{i}-\bar{x})(y_{i})-\bar{y}}{\sigma_{x}\sigma_{y}} \right ]^{2}$
where N is the number of observations ,
            $x_{i}$ is the ith value of the x observation,
            $\bar{x}$ is the mean of x value,
            $y_{i}$ is the ith value of the y observation,
            $\bar{y}$ is the mean y value,
           $\sigma_{x}$ is the standard deviation of x, and
           $\sigma_{y}$ is the standard deviation of y

Regression analysis example

Lets analyze the data from the question used above. From, the above example lets take the regression line of Y on X which is given by
Y = 30.888 + 2.232X
To find the coefficient of determination, lets calculate the following:

Mean of x = $\frac{\sum x_{i}}{N}$ =$\frac{592}{12}$  = 49.33

Mean of y = $\frac{\sum y_{i}}{N}$ =$\frac{1692}{12}$  = 141

  No   $x_{i}$   $y_{i}$ 
 $x_{i}-\bar{x}$ 
 $y_{i}-\bar{y}$  $(x_{i}-\bar{x})^{2}$ $(y_{i}-\bar{y})^{2}$  ($x_{i}-\bar{x}$) ($y_{i}-\bar{y}$)
  1
    23     69
   -26.33
       -72
     693.2689
            5184
      1895.76
  2
    29     95    -20.33        -46      413.3089             2116         935.18
  3
    29   102    -20.33        -39      413.3089             1521         792.59
  4
    35   118    -14.33        -23      205.3489               529         329.59
  5
    42   126     -7.33        -15        53.7289               225         109.95
  6
    46   125     -3.33        -16        11.0889               256           53.28
  7
    50   138      0.67          -3          0.4489                  9            -2.01
  8
    54   178     4.67         37        21.8089             1369          172.79
  9
    64   156    14.67
        15      215.2089               225          220.05
  10
    66   184    16.67         43      277.889             1849          716.81
  11
    76   176    26.67         35      711.2889             1225          933.45
  12
    78   225    28.67         84      821.9689             7056         2408.28
 Total     592  1692         3838.667           21564         8566

Now $\sigma_{x}$ = $\sqrt{\frac{\sum(x_{i}-\bar{x})^{2}}{N}}$ = 17.88

$\sigma_{y}$ = $\sqrt{\frac{\sum(y_{i}-\bar{y})^{2}}{N}}$ = 42.39

so $R^{2}$ = $\left [\frac{1}{N} \frac{\sum(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sigma_{x}\sigma_{y}}  \right ]$

                 = $\left [\frac{1}{N} \frac{8566}{(17.88)(42.39)} \right ]^{2}$

                 = 0.887

A coefficient of determination equal to 0.887 indicates that about 88.7% of the variation. So there is a good relationship between the two variables.

Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.

There are three basic steps in a linear regression.
  1. Scatter plot
  2. perform regression analysis which generates estimate of population parameter.
  3. Interpret that analysis.

Regression procedure becomes easy if we start with the scatter plot as it tell what type of regression it is. Some times it can be seen that there exists a relation but it may be non linear or sometimes there may not be any relationship at all between them. Only choice a function in which, it shows a linear tendency.

The standard linear regression equation is

Y = $\alpha$ + $\beta$X + $\varepsilon$

where $\alpha$ is the y-intercept of the line, $\beta$ is the slope of the line and $\varepsilon$  is the error term.

So the dependent variable Y depends on the independent variable x.

The error term $\varepsilon$  is the residual, which reminds us that X tries to explain but cannot fully explain Y.
It’s the best fit line.

Non- Linear Regression

In some situation it can be seen that the data provided forms a curve shape rather than a linear one. So to fit this, its not possible to choice the linear regression but need to switch to non-linear regression.

Hence, nonlinear regression can be defined as an extended linear regression technique in which a nonlinear mathematical model is used to describe the relationship between the response variable and the predictor variables.