# Regression

Sub Topics
In a multivariate distribution, there will be a more than one variables that are paired to each other, and there may exist a relationship between each of them.  If they is some relation between each of the variables, then its said to be correlated and so it is possible to predict value of one variable from the others. This gives rise to what is called Regression Analysis.

## Regression Definition

Regression is a measure of the average relationship between two or more variables in terms of the original units of the data.
Regression analysis is a statistical device. In the regression analysis, the independent variable is also known as the “Regression” or “Predictor” or “Explanator” and the dependent variable is known as “Regressed” or “Explained variable”.

## Regression Testing

Regression line of Y on X: It gives the most probable value of Y for given values of X.
The regression equation of Y on X is given by
Yc = a + bX
Here a and b are constants, which are called the parameters, determines the position of the line completely.

'a' - the level of the fitted line.
'b' - the degree of the slope of the line, which is the regression coefficient.

The symbol 'Yc' stands for the value of Y computed from the relationship for a given X.

The values of the parameters 'a' and 'b' are obtained by the method of least squares and the line is completely determined.

To determine the value of 'a' and 'b', the following equations are to be solved simultaneously

We obtain the equations by first taking sum on both side of equation (1)
$\sum_{i}$ y = a $\sum_{i}$ 1 + b $\sum_{i}$ x
$\sum_{i}$ y = na + b $\sum_{i}$ x

Multiplying $\sum_{i}$ x on both sides of (2)

$\sum_{i}$ xy = a $\sum_{i}$ x + b $\sum_{i}$  $x^{2}$

These equations are usually called normal equations. In the equations $\sum_{i}$ x , $\sum_{i}$ y,  $\sum_{i}$ xy,  $\sum_{i} x^{2}$ indicates totals which are computed from the observed pairs of two variables X and Y to which the least square estimating line is to be fitted and ‘n’ is the observed pairs of values.

## Regression Coefficient

Regression line of X on Y: It gives the most probable value of X for given values of Y.
The regression equation of X on Y is given by
Xc = $\alpha$ + $\beta$Y
Here $\alpha$ and $\beta$ are constants, which are called the parameters, determines the position of the line completely.

'$\alpha$' - the level of the fitted line.
'$\beta$' -  the degree of the slope of the line

The symbol '$X_{c}$' stands for the value of X computed from the relationship for a given Y.

The values of the parameters '$\alpha$' and '$\beta$' are obtained by the method of least squares and the line is completely determined.

## Regression Formula

$\sum_{i} x$ = $\alpha\sum_{i}1$ + $\beta\sum_{i}y$

$\sum_{i} x$ = $n\alpha + \beta\sum_{i}y$

Multiplying $\sum_{i}$ y on both sides

$\sum_{i}$ xy = $\alpha\sum_{i}y$ + $\beta \sum_{i} y^{2}$

These equations are called normal equations. In the equations $\sum_{i}$x , $\sum_{i}$y, $\sum_{i}$ xy, $\sum_{i} y^{2}$ indicates totals which are computed from the observed pairs of two variables X and Y to which the least square estimating line is to be fitted and ‘n’ is the observed pairs of values.

## Regression Analysis Example

Lets consider an example of regression and see how the data can be interpreted.

Consider the following pair of data. Using that form the two regression lines corresponding to these variables
(23,69), (29,95), (29,102), (35,118), (42,126), (46,125), (50,138), (54,178), (64,156), (66,184), (76,176), (78,225)

Solution:
Lets take the first term of the ordered pair under X and the other part under Y.
The regression equation of Y on X is given by
$Y_{c}$ = a + bX
So the computation will be as follows

 No X Y $X^{2}$ XY 1 23 69 529 1587 2 29 95 841 2755 3 29 102 841 2958 4 35 118 1225 4130 5 42 126 1764 5292 6 46 125 2116 5750 7 50 138 2500 6900 8 54 178 2916 9612 9 64 156 4026 9984 10 66 184 4356 12144 11 76 176 5776 13376 12 78 225 6084 17550 Total 592 1692 33044 92038

Plug them in the normal equations
$\sum_{i}$ y = na + b $\sum_{i}$ x             =>   1692 = 12a + 592b
$\sum_{i}$xy = a $\sum_{i}$ x + b $\sum_{i} x^{2}$   =>   92038 = 592a + 33044b
Solving them we get,
a= 30.88
b=2.232
So the regression line of Y on X is given by Y = 30.888 + 2.232X

Next

The regression equation of X on Y is given by
$X_{c}$ = a + bY
So the computation will be as follows

 No X Y $X^{2}$ XY 1 23 69 4761 1587 2 29 95 9025 2755 3 29 102 10404 2958 4 35 118 13924 4130 5 42 126 15876 5292 6 46 125 15625 5750 7 50 138 19044 6900 8 54 178 31684 9612 9 64 156 24336 9984 10 66 184 33856 12144 11 76 176 30976 13376 12 78 225 50625 17550 Total 592 1692 260136 92038

Plug them in the normal equations
$\sum_{i}$x = n$\alpha$ + $\beta$ y             =>   592= 12$\alpha$+ 1692 $\beta$
$\sum_{i}$xy = $\alpha \sum_{i}$ y + $\beta \sum_{i} y^{2}$  =>   92038 = 1692$\alpha$ + 260136 $\beta$
Solving them we get,
$\alpha$ = -6.677
$\beta$ = 0.397
So the regression line of X on Y is given by X = -6.677 + 0.397Y
So we get two equations
Y = 30.888 + 2.232X
X = -6.677 + 0.397Y
So  => 0.397Y= X+ 6.677
So, Y= 2.519X + 16.82
Solving   and
X = 49.04
Y = 140.35
Hence, the graph of original data is as follows

## Stepwise Regression

It is a method in which the variables are removed and added for the purpose of identifying a useful subset of the predictors. So the dependent variable may depend on many independent variables.

Each of the variables are added as well as removed in sequential form and at each time all the values assessed. If adding of the variable contributes to the model as such then it is retained, but all of the other variables present are then re-tested to check if they are in anyway contributing to the success of it. If any of them doesn’t contribute, then they will be removed. Hence in the end, this model will contain the smallest possible set of predictor variables.

## Logistic Regression

It can be defined as a statistical technique in which the dependable variables are dichotomous, that is binary.
Binary response and ordinal responses always arise in many fields of studies.
Logistic regression is mostly use in order to investigate the relationship between these types of responses and a set of explanatory variable.

## Poisson Regression

It is a type of regression model, which assumes that its outcome variable follows a poisson distribution, that is, the response variable will follow a poisson distribution.

These models are mostly used to predict rate or count variable. Poisson regression becomes a natural choice only when the outcome variable is a small integer. The explanatory variable helps to model the mean of the response variable.

## Regression Analysis

In order to analyze the regression equation formed, its important to known how well the equation fits the data given.
This could be checked using a formula called Coefficient of determination. It is a key output of regression analysis.
It has the following characteristics:
•    Its value ranges from 0 to 1.
•    When the value becomes equal to 0,it means that the dependent variable is not connected linearly with the independent variable.
•    When the value becomes equal to 1,it means that the dependent variable can be predicted with the independent variable.
•    When the value of $R^{2}$ is between 0 and 1 it indicates the extent to which the dependent variable can be predictable.
Its formula is $R^{2}$ = $\left [ \frac{1}{N}\frac{\sum(x_{i}-\bar{x})(y_{i})-\bar{y}}{\sigma_{x}\sigma_{y}} \right ]^{2}$
where N is the number of observations ,
$x_{i}$ is the ith value of the x observation,
$\bar{x}$ is the mean of x value,
$y_{i}$ is the ith value of the y observation,
$\bar{y}$ is the mean y value,
$\sigma_{x}$ is the standard deviation of x, and
$\sigma_{y}$ is the standard deviation of y

Regression analysis example

Lets analyze the data from the question used above. From, the above example lets take the regression line of Y on X which is given by
Y = 30.888 + 2.232X
To find the coefficient of determination, lets calculate the following:

Mean of x = $\frac{\sum x_{i}}{N}$ =$\frac{592}{12}$  = 49.33

Mean of y = $\frac{\sum y_{i}}{N}$ =$\frac{1692}{12}$  = 141

 No $x_{i}$ $y_{i}$ $x_{i}-\bar{x}$ $y_{i}-\bar{y}$ $(x_{i}-\bar{x})^{2}$ $(y_{i}-\bar{y})^{2}$ ($x_{i}-\bar{x}$) ($y_{i}-\bar{y}$) 1 23 69 -26.33 -72 693.2689 5184 1895.76 2 29 95 -20.33 -46 413.3089 2116 935.18 3 29 102 -20.33 -39 413.3089 1521 792.59 4 35 118 -14.33 -23 205.3489 529 329.59 5 42 126 -7.33 -15 53.7289 225 109.95 6 46 125 -3.33 -16 11.0889 256 53.28 7 50 138 0.67 -3 0.4489 9 -2.01 8 54 178 4.67 37 21.8089 1369 172.79 9 64 156 14.67 15 215.2089 225 220.05 10 66 184 16.67 43 277.889 1849 716.81 11 76 176 26.67 35 711.2889 1225 933.45 12 78 225 28.67 84 821.9689 7056 2408.28 Total 592 1692 3838.667 21564 8566

Now $\sigma_{x}$ = $\sqrt{\frac{\sum(x_{i}-\bar{x})^{2}}{N}}$ = 17.88

$\sigma_{y}$ = $\sqrt{\frac{\sum(y_{i}-\bar{y})^{2}}{N}}$ = 42.39

so $R^{2}$ = $\left [\frac{1}{N} \frac{\sum(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sigma_{x}\sigma_{y}} \right ]$

= $\left [\frac{1}{N} \frac{8566}{(17.88)(42.39)} \right ]^{2}$

= 0.887

A coefficient of determination equal to 0.887 indicates that about 88.7% of the variation. So there is a good relationship between the two variables.

## Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.

There are three basic steps in a linear regression.
1. Scatter plot
2. perform regression analysis which generates estimate of population parameter.
3. Interpret that analysis.

Regression procedure becomes easy if we start with the scatter plot as it tell what type of regression it is. Some times it can be seen that there exists a relation but it may be non linear or sometimes there may not be any relationship at all between them. Only choice a function in which, it shows a linear tendency.

The standard linear regression equation is

Y = $\alpha$ + $\beta$X + $\varepsilon$

where $\alpha$ is the y-intercept of the line, $\beta$ is the slope of the line and $\varepsilon$  is the error term.

So the dependent variable Y depends on the independent variable x.

The error term $\varepsilon$  is the residual, which reminds us that X tries to explain but cannot fully explain Y.
It’s the best fit line.

## Non- Linear Regression

In some situation it can be seen that the data provided forms a curve shape rather than a linear one. So to fit this, its not possible to choice the linear regression but need to switch to non-linear regression.

Hence, nonlinear regression can be defined as an extended linear regression technique in which a nonlinear mathematical model is used to describe the relationship between the response variable and the predictor variables.