# Descriptive Statistics

Sub Topics
Statistics provides methods for collecting, organizing, summarizing, presenting, analyzing and interpreting data. For this purpose the study is broadly divided into two branches.

1. Descriptive statistics
2. Inferential statistics

Descriptive statistics provides methods to process raw data and present it for analysis. Inferential statistics deals with methods that generalize the population traits from populations characteristics.

## What is Descriptive Statistics?

Descriptive statistics as the name suggests is used to describe data.

 Descriptive Statistics Definition Descriptive statistics is the branch of statistics which provides methods and tools for collection, organization, summarization and presentation of data.

Technically it is also known as Exploratory data analysis (EDA). The exploration of data is done both analytically and graphically. Data can be analyzed using measures of central tendency, variability and position. The data can also be displayed using different graphing techniques. The purpose of applying descriptive statistics methods is only to display and summarize data and not generalize the results.

### Example of Descriptive Statistics

The national census is taken once in every 10 years by US Government to get information about the average age, income, housing and educational details etc about the US population. The census bureau employs various means to collect organize and summarize the data. Finally they publish the information collected in the form of charts, graphs and tables.

## Descriptive Statistics Vs Inferential Statistics

Descriptive statistics provides tools only for collecting and exploring the data collected. For example, if data is collected for a sample, we can find the measures of central tendencies, variation etc and also present the data in different forms. We can identify the outliers and gaps for the data.  Beyond this we cannot tell anything about the population characteristics using this kind of exploration.

Inferential Statistics complements descriptive statistics by providing techniques like estimation and hypothesis tests to generalize population characteristics using sample data. The methods applied in inferential statistics make use of probability as a measure.

Suppose we collect data from 50 university students on their GPA scores. We may find the mean GPA score, the standard deviation, calculate the five point summary and present this as a Box plot. We may also make comparative graphs like Bar diagrams and Pie charts for the grades obtained in different subjects. These activities are related to descriptive statistics.

Again if we use the averages calculated to estimate a national average and draw conclusions therefrom providing error significance, then these activities fall within the boundaries of inferential statistics.

## Types of Descriptive Statistics

Variable is a characteristic under study that can assume various values and data are the values assumed by the variables.

For example if we are collecting data on the heights and weights of Children in pre schools, then the variables are the height and weight. The data are the recorded heights and weights of children chosen.

The variables are classified on the basis of types of data they assume.

 Data      /\    /  \   /    \  /      \ Qualitative data Quantitative data            /\           /  \          /    \ Discrete Variables Continuous variables

Qualitative variables can be categorized according to some characteristic or attribute. For example the eye color of the subjects studied makes up qualitative data.

Quantitative variables are numerical data and can be ordered or ranked. The quantitative variables are again categorized into two types discrete and continuous. Discrete variables assume values that can be counted while continuous variables take in all values which are measured in an interval.

Data is also classified according on number of variables it deals with.

Univariate Data

When data is collected on one variable, such data distribution is termed to be univaraiate. The organization, summarization and presentation of univariate data only describes the data collected, but do not explore the cause.

For example, the frequency distribution of GPA scores of University students can give you the average scores, the standard deviation and graph the relating the score ranges to respective frequencies. But it cannot hint at the cause of such a distribution.

Bivarriate Data

Bivariate data contains two variables, whose values change simultaneously. The examination of such data can tell about the relationship between the two variables and can explain how one variable variable is affected by a change in the other.

If the above example of data on GPA scores also include the student's graduation scores at School level, such a data can be explored to find the relationship between these two scores.

## Descriptive statistics Analysis

The most common method used in data collection is through surveys. Data are also collected by surveying records or directly recording the observations of the situations.The four most commonly used methods of survey are
1. Telephone Surveys
2. Mailed Questionnaire surveys
3. Personal interview surveys
4. Online surveys via internet.
Often samples are used to collect data to study the characteristic of a population. The sample formed for this purpose needs to be unbiased, meaning all the members of the population should have the same chance of being included in the sample. The four basic methods of sampling applied for this purpose are random, systematic, stratified and cluster sampling. All these four types use random methods to include population elements into the sample.

## Simple Descriptive Statistics

Data collected is organized in frequency tables.

Frequency is the number of times a value occurs in a data set. The count can be made using tally marks.

Frequency distribution is a method of organizing raw data in tabular form using classes and frequencies. This type of organization is necessary for summarizing data using numerical measures and also for graphical display of the data set. When the range of the data set is large, instead of classes, the class limits are used in the frequency table.

Example:

The Math test scores of 100 students can be tabulated as follows:

 ClassMark Range Frequency    f 0 - 20 4 21 - 40 25 41 - 60 28 61 - 80 26 81 - 100 17 Total 100

Columns will be appended to this table depending upon the measures calculated to summarize data or the graph used to display data. For calculating the mean of the distribution a column to include the class mid-point x and another column to show the product are added.

The bivariate data is also arranged in tabular form for analyzing data using scatter plots or finding the measures of covariance or correlation.

 Student Days absent       x Grade in Finals        y Maria 3 87 Philip 15 42 Danny 6 82 Betty 10 75 Sally 11 60 Christopher 5 92 John 8 78

As for the univariate data, columns are suitably added for further manipulation of data.

## Univariate Descriptive Statistics

The data distribution is summarized using

1. Measures of central tendency
2. Measures of dispersion
3. Measures of position

### Measures of Central Tendency

A measure of central tendency conveys the idea of centralness or the average of the data set. The three most commonly used measures for this purpose are

1. Mean
2. Median
3. Mode

While the mean is the arithmetic average of the data set, median is the middle value of the distribution when the data is ordered from the lowest to the highest. Mode is the value that occurs most or has the highest frequency. Descriptive statistics consists of clear formulas and detailed methods to evaluate these numerical measures.

### Measures of Dispersion

A measure of dispersion tells about the spread of the data. The common measures of variability or dispersion are

1. Range
2. Standard deviation/Variance
3. Inter Quartile range

The range is the difference between the highest and lowest values in the data set.
Variance is the average of the squared deviations from the mean. Standard deviation is the positive square root of of the average squared deviations from the mean. The standard deviation conveys an idea of the spread of data about the mean.
The Interquartile range is the difference between two measures of positions the first and the third quartile. This value gives a measure for the spread of the central 50% of the data set.

### Measures of Position

A measure of position tells about the relative position of a data value in the data set. Median is a measure of position as well as a measure of central tendency. The important measures of position used in the exploration of univariate data are,

1. Percentiles
2. Quartiles
3. Z - scores.

## Descriptive Statistics Sample

Samples and populations are terms dealt mainly in inferential statistics. As the sample statistic like mean is used for estimating the population parameters and to check the claims on population characteristics using hypothesis testing. For evaluating the sample statistics, the formulas of descriptive statistics vary depending on two factors , the sample size and degree of freedom.

The sample mean is considered as an unbiased estimate of the population. Hence same methods and formulas used in descriptive statistics are used for computing the sample mean.

The formula used in descriptive statistics to compute standard deviation $\sigma =\sqrt{\frac{(x_{i}-\overline{x})^{2}}{n}}$

σ is used to denote the population standard deviation.  The standard deviation of the sample is denoted by 's'.  While finding the interval estimates of the population parameter like mean, the standard deviation used is determined by the sample size 'n' and the population standard deviation and  = $\frac{\sigma }{\sqrt{n}}$.
If the population standard deviation is not known, the standard error is used which is calculated using the standard deviation 's' of the sample = $\frac{s}{\sqrt{n}}$.

The sample standard deviation is calculated using degree of freedom to make it unbiased, that is free from sampling errors. The degree of freedom is the number of variables which are free to vary after an estimate is used during the process of the estimation of another measure. The formula for calculating standard deviation makes use of mean. So mean is the measure first calculated. To determine all the data values once the mean is known, we need to know n - 1 values. Hence n -1 is the degree of freedom used in the calculation of sample standard deviation.

Sample standard deviation $s=\sqrt{\frac{(x_{i}-\overline{})^{2}}{n-1}}$.

## Descriptive Statistics Graphs

The common types of graphs and plots used for displaying univariate data are

1. Dot plots
2. Bar graphs
3. Pie charts
4. Histograms
5. Stem Plots
6. Box Plots
7. Cumulative frequency charts and Ogives.

The Scatter plots describe the graphs of Bi variate data. Using scatter plots regression lines and curves can be done to make an estimating model based on the data.

## Bivariate Descriptive Statistics

Covariance measures the strength of relationship between the two variables in a bivariate distribution. The symbol $'\sigma _{xy}'$ is used to denote the covariance between the variables x and y. the formula used to compute the covariance in terms of recorded data is

$\sigma _{xy}= \sum_{i=1}^{n}\frac{(x-\overline{x})(y-\overline{y})}{n}$

As the covariance will be expressed in terms of the units of x and y, another correlation measure is used, which is called correlation coefficient and denoted by $\rho$.
The formula used to find the correlation coefficient is
$\rho =\frac{\sigma _{xy}}{\sigma _{x}\sigma _{y}}$

That is the correlation coefficient $\rho$ is got by dividing the covariance by the product of standard deviation of the two variables.

Interestingly the cov (x,x) is the variance of x.
$\sigma _{xx}=\sigma ^{2}_{x}$