IF DATA IS THE FUEL OF FUTURE ECONOMY -Getting to know few basic Mathematical statistics terms better.

Janani Nagarajan
5 min readApr 23, 2021

Not all, but for most people even the term “Math” seems to be so abstract that we never bother to try understand and explore its beauty a bit more. As naturally it (math) appreciates a bit of patience in understanding the concepts and even more when we have to root down or build on top to extend its features, not because it’s complex that equates a dozen equation and deals with a band of numbers/data points that might make no sense for a common man(obviously I don’t belong to this bracket, may be that’s why am able to pen it down now :p)but because we as ‘human beings’ are always interested in the end result and never paid much attention to the process.

As the quote goes “Everything has beauty , but not everyone sees it” by Confucius, Sometimes it takes extra effort to appreciate the beauty of so called “abstract objects”, in my opinion these are not really abstract as much they seem to be.

When great algorithms were built or when huge amount of data being manipulated into useful information we admire them and appreciate its final structure .But we are those same strange people who sometimes miss out the clarity in basic terms , may be for here let’s call it as pre requisite of basics that are at times interchangeably used and ends up in incorrect data interpretation or that leaves a void in our concepts which we commonly ignore by tagging them “simple ideas”, but understanding these simple nuances helps in long run.

So I thought let’s address a few really basic terminologies in mathematics/statistics but this time not much like our academic books during our school days when we used to solve them for the sake of completing the exercises(replicating my thoughts those days, lol) and dwell on clarity and broad idea of how things work in real time.

Let us dive into few concepts that are indeed basic and are highly essential to understand them conceptually .

MEAN :

First things first, as we all knew mean is otherwise known as averages where we use the formula (sum of all terms)/(Total no. of terms). But the things to be understood is, they are highly sensitive in nature , In other terms ‘mean value’ tend to be polarized or skewed to its extreme values. Skewness can be further classified as negative skew or positive skew that affects the valuable insight that we derive from the data.

Hence when the given data points are not evenly distributed, which means they have extreme values and this results in skewness in the output ‘mean value ‘ obtained. In such cases Median comes into picture.

MEDIAN:

As mentioned, median gives accurate desired results in such above cases. At some cases when data points are evenly arranged then mean and median becomes equal to each other. One common example is when we see the placement reports of various colleges they report their “Median salary” as the average package so as to avoid any typical skewness in data.

Let’s take a simple example for all the above cases

Case 1: Mean = Median

Take few numbers 1000 3000 5000 7000 9000 11000

Here median = mean = 6000 ( As Data is evenly distributed in difference of 2000 hence)

Case 2 : Mean >Median

Take a sample placement report of a college/ University for a random 10 students in terms of their package per annum (INR) and let them be as follows : 1L, 50K, 2.5L , 7L , 10L, 70K, 9.5L, 13L, 65K , 5L.

For a graphical representation if we try to plot them it won’t form a normal distribution curve spread over symmetrically or in simple terms the data appears to be skewed in either right/left end and do you think computing average will be fair enough ? Given that if we term the package to be the so called ‘average’ then that has to be the bare minimum to each and everyone right (may be with a slight deviation here and there but can’t be a case when average is lets say 5L per annum and one student gets placed for 50K per annum).

Here ,

Mean = 4,98,500 INR

Median= 1,75,000 INR

Here median seems fair enough as we have lowest package as 50k and the other end extreme to be 13L . As the data is not evenly spread and is more skewed towards right end we observe mean to be much higher than median.

Case 3 : Mean<median

If the above data gets heavily extreme on the left side that is towards the lower end values then mean would turn to be less than median. So in both these cases median acts as better indicator of the value.

This is why median comes in handy. Here we have used just a small sample from a very large population to show the results. But this is the case with any data may be billions or trillions in quantity.

Now you might ask , is median the better indicator generally?

No, not at all. They both have their own significance and unique value attached to them even though we might use them interchangeably at times. Simple conclusion is that it all depends on the data points we take into consideration to arrive at better approximate results.

One tip could be use ‘mean’ whenever we plan for monthly expenses on our grocery purchases and electric bill budgets as there won’t be such high deviations or skewness . Yes there could be a very little deviation either incremental/decremental due to inflation in the economy or due to our increase/decrease in consumption but that’ll not affect the mean in large extent unlike the previous placement example where extreme ends experienced intense skewness .

From here it is conclusive that depending on the arrangements of data points in the data we arrive at a better decision . From next time whenever you come across any set of data or planning for the monthly budget or something as big as manipulating huge data as a part of any analysis , just take a moment and reflect which could be a better indicator.

Before ending this topic let’s touch upon few final parts that is standard deviation and skewness that we have been discussing so long.

When data points are plotted into a normal distribution graph ,mean value represents the middle highest point of the curve and standard deviation represents the spread out of the data points. In other words it shows how skinny or wide the curve is.

Hence mean , median and deviations play the elementary but the most indispensable part in any graphical representation whenever it requires us to draw conclusions from the set of given data that are continuous in nature (this reflects real world data that are not discrete in most cases).

MODE :

Let’s discuss this final concept that is naturally intertwined with the above concepts. We know that this represents how frequently a value occurs in the data set.

For example when an algorithm is built for any online retail services it tracks the consumer behaviors and displays recommendations accordingly. Similarly it’ll match items such as when you shop bread , it might also recommend you to purchase milk or any other commodity that is a result due to the past data matches among other consumers, which means that most of them bought milk + bread . Hence it recommends in a guess that you might also probably need the same set of commodities.

These use extensively ‘mode’ as their fundamental criteria to function in machine learning algorithms that includes clustering data and so to identify unique pattern or behavior.

Hope this article adds some value to you. Happy Learning !

--

--