In a multivariate distribution, there will be a more than one variables that are paired to each other, and there may exist a relationship between each of them. If they is some relation between each of the variables, then its said to be correlated and so it is possible to predict value of one variable from the others. This gives rise to what is called Regression Analysis.

Regression analysis is a statistical device. In the regression analysis, the independent variable is also known as the “Regression” or “Predictor” or “Explanator” and the dependent variable is known as “Regressed” or “Explained variable”.

The regression equation of Y on X is given by

'a' - the level of the fitted line.

'b' - the degree of the slope of the line, which is the

The symbol 'Yc' stands for the value of Y computed from the relationship for a given X.

The values of the parameters 'a' and 'b' are obtained by the method of least squares and the line is completely determined.

To determine the value of 'a' and 'b', the following equations are to be solved simultaneously

We obtain the equations by first taking sum on both side of equation (1)

$\sum_{i}$ y = a $\sum_{i}$ 1 + b $\sum_{i}$ x

$\sum_{i}$ y = na + b $\sum_{i}$ x

Multiplying $\sum_{i}$ x on both sides of (2)

$\sum_{i}$ xy = a $\sum_{i}$ x + b $\sum_{i}$ $x^{2}$

These equations are usually called

The regression equation of X on Y is given by

Here $\alpha$ and $\beta$ are constants, which are called the parameters, determines the position of the line completely.

'$\alpha$' - the level of the fitted line.

'$\beta$' - the degree of the slope of the line

The symbol '$X_{c}$' stands for the value of X computed from the relationship for a given Y.

The values of the parameters '$\alpha$' and '$\beta$' are obtained by the method of least squares and the line is completely determined.

$\sum_{i} x$ = $n\alpha + \beta\sum_{i}y$

Multiplying $\sum_{i}$ y on both sides

$\sum_{i}$ xy = $\alpha\sum_{i}y$ + $\beta \sum_{i} y^{2}$

These equations are called normal equations. In the equations $\sum_{i}$x , $\sum_{i}$y, $\sum_{i}$ xy, $\sum_{i} y^{2}$ indicates totals which are computed from the observed pairs of two variables X and Y to which the least square estimating line is to be fitted and ‘n’ is the observed pairs of values.

Consider the following pair of data. Using that form the two regression lines corresponding to these variables

(23,69), (29,95), (29,102), (35,118), (42,126), (46,125), (50,138), (54,178), (64,156), (66,184), (76,176), (78,225)

Lets take the first term of the ordered pair under X and the other part under Y.

The regression equation of Y on X is given by

$Y_{c}$ = a + bX

So the computation will be as follows

No |
X |
Y |
$X^{2}$ |
XY |

1 | 23 | 69 | 529 | 1587 |

2 |
29 | 95 | 841 |
2755 |

3 |
29 | 102 | 841 |
2958 |

4 |
35 | 118 | 1225 |
4130 |

5 |
42 | 126 | 1764 | 5292 |

6 |
46 | 125 | 2116 |
5750 |

7 |
50 | 138 | 2500 |
6900 |

8 |
54 | 178 | 2916 |
9612 |

9 |
64 | 156 | 4026 |
9984 |

10 |
66 | 184 | 4356 | 12144 |

11 | 76 | 176 | 5776 |
13376 |

12 | 78 | 225 | 6084 |
17550 |

Total |
592 | 1692 | 33044 |
92038 |

Plug them in the normal equations

$\sum_{i}$ y = na + b $\sum_{i}$ x => 1692 = 12a + 592b

$\sum_{i}$xy = a $\sum_{i}$ x + b $\sum_{i} x^{2}$ => 92038 = 592a + 33044b

Solving them we get,

a= 30.88

b=2.232

So the regression line of Y on X is given by Y = 30.888 + 2.232X

Next

The regression equation of X on Y is given by

$X_{c}$ = a + bY

So the computation will be as follows

No |
X |
Y |
$X^{2}$ |
XY |

1 | 23 | 69 |
4761 | 1587 |

2 |
29 |
95 | 9025 |
2755 |

3 |
29 | 102 | 10404 |
2958 |

4 |
35 |
118 |
13924 |
4130 |

5 |
42 | 126 | 15876 | 5292 |

6 |
46 |
125 | 15625 | 5750 |

7 |
50 | 138 | 19044 | 6900 |

8 |
54 | 178 | 31684 | 9612 |

9 |
64 | 156 | 24336 | 9984 |

10 |
66 | 184 | 33856 | 12144 |

11 |
76 | 176 | 30976 | 13376 |

12 |
78 | 225 | 50625 | 17550 |

Total |
592 |
1692 |
260136 |
92038 |

Plug them in the normal equations

$\sum_{i}$x = n$\alpha$ + $\beta$ y => 592= 12$\alpha$+ 1692 $\beta$

$\sum_{i}$xy = $\alpha \sum_{i}$ y + $\beta \sum_{i} y^{2}$ => 92038 = 1692$\alpha$ + 260136 $\beta$

Solving them we get,

$\alpha$ = -6.677

$\beta$ = 0.397

So the regression line of X on Y is given by X = -6.677 + 0.397Y

So we get two equations

Y = 30.888 + 2.232X

X = -6.677 + 0.397Y

So => 0.397Y= X+ 6.677

So, Y= 2.519X + 16.82

Solving and

X = 49.04

Y = 140.35

Hence, the graph of original data is as follows

Each of the variables are added as well as removed in sequential form and at each time all the values assessed. If adding of the variable contributes to the model as such then it is retained, but all of the other variables present are then re-tested to check if they are in anyway contributing to the success of it. If any of them doesn’t contribute, then they will be removed. Hence in the end, this model will contain the smallest possible set of predictor variables.

Binary response and ordinal responses always arise in many fields of studies.

These models are mostly used to predict rate or count variable. Poisson regression becomes a natural choice only when the outcome variable is a small integer. The explanatory variable helps to model the mean of the response variable.

This could be checked using a formula called Coefficient of determination. It is a key output of regression analysis.

It has the following characteristics:

• Its value ranges from 0 to 1.

• When the value becomes equal to 0,it means that the dependent variable is not connected linearly with the independent variable.

• When the value becomes equal to 1,it means that the dependent variable can be predicted with the independent variable.

• When the value of $R^{2}$ is between 0 and 1 it indicates the extent to which the dependent variable can be predictable.

Its formula is $R^{2}$ = $\left [ \frac{1}{N}\frac{\sum(x_{i}-\bar{x})(y_{i})-\bar{y}}{\sigma_{x}\sigma_{y}} \right ]^{2}$

where N is the number of observations ,

$x_{i}$ is the ith value of the x observation,

$\bar{x}$ is the mean of x value,

$y_{i}$ is the ith value of the y observation,

$\bar{y}$ is the mean y value,

$\sigma_{x}$ is the standard deviation of x, and

$\sigma_{y}$ is the standard deviation of y

Lets analyze the data from the question used above. From, the above example lets take the regression line of Y on X which is given by

Y = 30.888 + 2.232X

To find the coefficient of determination, lets calculate the following:

Mean of x = $\frac{\sum x_{i}}{N}$ =$\frac{592}{12}$ = 49.33

Mean of y = $\frac{\sum y_{i}}{N}$ =$\frac{1692}{12}$ = 141

No |
$x_{i}$ |
$y_{i}$ |
$x_{i}-\bar{x}$ |
$y_{i}-\bar{y}$ |
$(x_{i}-\bar{x})^{2}$ |
$(y_{i}-\bar{y})^{2}$ |
($x_{i}-\bar{x}$) ($y_{i}-\bar{y}$) |

1 |
23 | 69 |
-26.33 |
-72 |
693.2689 |
5184 |
1895.76 |

2 |
29 | 95 | -20.33 | -46 | 413.3089 | 2116 | 935.18 |

3 |
29 | 102 | -20.33 | -39 | 413.3089 | 1521 | 792.59 |

4 |
35 | 118 | -14.33 | -23 | 205.3489 | 529 | 329.59 |

5 |
42 | 126 | -7.33 | -15 | 53.7289 | 225 | 109.95 |

6 |
46 | 125 | -3.33 | -16 | 11.0889 | 256 | 53.28 |

7 |
50 | 138 | 0.67 | -3 | 0.4489 | 9 | -2.01 |

8 |
54 | 178 | 4.67 | 37 | 21.8089 | 1369 | 172.79 |

9 |
64 | 156 | 14.67 |
15 | 215.2089 | 225 | 220.05 |

10 |
66 | 184 | 16.67 | 43 | 277.889 | 1849 | 716.81 |

11 |
76 | 176 | 26.67 | 35 | 711.2889 | 1225 | 933.45 |

12 |
78 | 225 | 28.67 | 84 | 821.9689 | 7056 | 2408.28 |

Total |
592 | 1692 | 3838.667 | 21564 | 8566 |

Now $\sigma_{x}$ = $\sqrt{\frac{\sum(x_{i}-\bar{x})^{2}}{N}}$ = 17.88

$\sigma_{y}$ = $\sqrt{\frac{\sum(y_{i}-\bar{y})^{2}}{N}}$ = 42.39

so $R^{2}$ = $\left [\frac{1}{N} \frac{\sum(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sigma_{x}\sigma_{y}} \right ]$

= $\left [\frac{1}{N} \frac{8566}{(17.88)(42.39)} \right ]^{2}$

= 0.887

A coefficient of determination equal to 0.887 indicates that about 88.7% of the variation. So there is a good relationship between the two variables.

There are three basic steps in a linear regression.

- Scatter plot
- perform regression analysis which generates estimate of population parameter.
- Interpret that analysis.

Regression procedure becomes easy if we start with the scatter plot as it tell what type of regression it is. Some times it can be seen that there exists a relation but it may be non linear or sometimes there may not be any relationship at all between them. Only choice a function in which, it shows a linear tendency.

The standard linear regression equation is

where $\alpha$ is the y-intercept of the line, $\beta$ is the slope of the line and $\varepsilon$ is the error term.

So the dependent variable Y depends on the independent variable x.

The error term $\varepsilon$ is the residual, which reminds us that X tries to explain but cannot fully explain Y.

It’s the best fit line.

Hence, nonlinear regression can be defined as an extended linear regression technique in which a nonlinear mathematical model is used to describe the relationship between the response variable and the predictor variables.