Data Standardization

Janani Nagarajan
3 min readMay 20, 2022
https://unsplash.com/photos/6EnTPvPPL6I

Data Standardization is one of the most important concepts of predictive modeling. Normalization and data standardization are two scaling techniques used when it comes to scaling the available data. In this article, let's get to briefly understand an overview of how, when, and why is the standardization process carried on.

When continuous independent variables at different scales are the sample of data available, this leads to a scenario in which the variables do not give an equal contribution to the analysis. Another scenario could be where organizations utilize data from a no. of sources including data warehouses, cloud storage, databases, etc. Again here data standardization becomes indispensable.

In nutshell, data standardization is the process of converting data to a common format to enable users to process and analyze it. In statistics, this is the process of putting different variables on the same scale.

Standardization allows us to compare the data with different metrics directly and make a statement about them. Standardization improves the quality of the data by transforming it into the same scale.

Z-score is one of the most popular methods to normalize data. To do this we calculate the mean and standard deviation of the variables.

Mean as we know is the simple average value of the data points available, a measure of central tendency.

Standard deviation is a measure just like an average, but here this indicates how much the data is spread around the mean/average value of the data. Example Are all your scores close to the average or are lots of scores way above (or way below) the average score. Technically, it can also be stated as square root of the variance, where variance is defined as the average of the squared differences from the mean.

Z-score is nothing but the standard normal distribution(bell curve)where the values are centered around the mean with unit standard deviation. That is, if we calculate the mean and standard deviation of standard scores, they will be 0 and 1 respectively. Standardization tells us how far from the mean, we are in terms of standard deviation.

Normal distribution curve. Image by author
Z-score formula. Image by author

Here, X= value of the data, µ = mean value , σ = Standard deviation.

This z value for the data set will help in scaling the variables to a single metric and hence facilitates our analysis further. If we get the z score as 2 that means the variable is 2 units above the average of the data set in terms of standard deviation. If our data set has outliers or it doesn’t have any minimum and maximum value to it and follows normal distribution we use this data standardization technique. Whereas if we have to shift and rescale the data points into a range of [0,1] then we use the Normalization method(min-max scaling) for scaling the data.

Unlike normalization where we have maximum and minimum values and a range to be bound by when it comes to standardization, the outcome can be any real number with no range/boundaries to be bound to.

Data standardization is about making sure that the data is internally consistent, that is each type has the same content and format which makes it ready for our analysis going forward.

Thanks for reading & hope this article adds value to you. Happy learning!

--

--