Classification of Statistics based on data handling

Janani Nagarajan
3 min readJul 20, 2021

Statistics is one of the applications of mathematics that studies the collection, analysis, interpretation, and presentation of data. Data mining by definition refers to data extraction, discovering patterns, and developing models, and testing those models to extract useful information from the data source collected. This data mining process adopts techniques from many domains namely Statistics, Machine Learning, Data warehouse, Database systems, Algorithms, and a few more.

Statistics is one such discipline that plays a crucial role in data modeling and testing the model on top of which data mining tasks can be carried on. Statistics distinguish between random noise and significant findings that help in better modeling. To understand the various statistical models that are built it’s essential to understand the core branch/division of statistics that indeed helps in gaining better clarity.

Statistics can be broadly classified as depending on how the data is handled as,

  1. Inferential statistics, and
  2. Descriptive statistics.

Descriptive Statistics :

This method of data handling is used to simplify large volumes of data in a sensible way such as representing the outcome in tables, charts, graphs, and summary measures. This is done using parameters that are broadly classified into

  1. Measure of frequency — Count, Percent, Frequency
  2. Measure of dispersion — Mean, median, mode
  3. Measure of central tendency — Range, Variance, Standard Deviation
  4. Measure of position — Percentile ranks, Quartile ranks.

In short, the representation of entire data or sample data is mostly done in form of graphs and charts that require all the above parameters to quantitatively summarize the information.

The main goal of this descriptive statistics is that irrespective of the population or sample data used the results are applicable only to the data from which we have made results and this cannot be extrapolated/generalized, unlike inferential statistics.

Before we get into inferential statistics let’s understand the term “population” means the entire data set and the term “ sample” refers to a part of the data set that we have randomly grouped from the entire data available for the study. Also, note that “Population” doesn’t necessarily mean people and it differs in different considerations.

Inferential statistics:

While descriptive statistics describe data in a sensible way the inferential statistics allows us to make predictions/inferences from the sample data and generalizes it to the entire population under consideration. The subtle difference to note here is that Inferential statistics works only on the “sample data” and gets it generalized to the entire population whereas descriptive statistics works on the entire population and cannot be generalized at all.

And here we are allowed to extrapolate data and draw conclusions unlike descriptive where we are supposed only to summarize in a quantitative way.

Comparison with an example :

Let’s consider a case where we have 1000 customers who purchase in a shop regularly. In order to measure the satisfaction level of the customer's descriptive statistics collects data from the entire population and derives results. Here the entire population represents the 1000 customers. Now, in inferential statistics, sample data is randomly collected from the entire population and the results obtained are generalized to the entire population set. That is, 85% satisfaction of a sample of 100 people can be generalized as 85% of satisfaction of the entire population set of 1000 customers.

Now, you might think the inferential statistic model might not be as accurate as a descriptive model. No, it’s not the case as inferential statistic helps extensively to form new conclusions from existing ones. Even at times when the population of the data set is not known accurately or when it’s practically impossible to collect the data from the entire set say we have 1 million data set as the population. In such cases collecting random samples and extrapolating the results works well. An important pointer is that an estimate of the approximate samples needed in order to fairly generalize the result to the entire population is done beforehand that depends on the entire population set.

To make this generalization accurate we use various tools in which some standard methodologies are Hypothesis test, Confidence intervals, and Regression analysis.

Let’s discuss in detail, the tools of inferential statistics in the next post. Thanks for reading!

--

--