Regression Analysis

Janani Nagarajan
5 min readMar 14, 2022
image source

Regression Analysis is one of the most basic tools used for prediction. It is an estimation process used to turn a set of disconnected or say independent data points into an equation that models the whole set. In a nutshell, Regression analysis is a technique that helps us to analyze the relationship between variables. When we have data points that are independent and we try to fit the data into a curve or a line is called curve fitting also known as the regression line.

The main goals of regression analysis are

i) To measure the influence of one or more variables on another variable. Technically it can be established as the technique of using the values of independent variable x to estimate the value of the dependent variable y.

The variable that we want to predict is called dependent variable and the variables that we use for predictions are called independent variables.

ii) Prediction i.e, the ability to estimate what we are after using the available one or more independent variables to estimate the dependent variable.

Let’s understand it better, suppose for example we have a table that has independent data point (month) and the dependent data point (revenue) of a Shoe industry and we are asked to predict the revenue of this same industry at its 30th month we can take the help of regression analysis to fit the data and model this into useful information.

source: Image by author

Linear regression and Logistic regression are the two most prominent techniques of regression. Also, there are different types of regression analysis techniques and their usage differs according to the kind and size of the data available, and the one that gives maximum accuracy as per the requirement. But in this article, we’ll get into a brief understanding of the above two techniques.

image source

Linear Regression:

When we use one independent variable to infer the dependent variable then it is known as simple linear regression and when several independent variables are used to infer the dependent variable it is Multiple linear regression.

Simple linear Regression

Here, X is the independent variable and Y is the dependent variable. For example when we have to estimate the relationship between weekly working hours and the wages of the employees of a given data set and then extrapolate to predict just as how we did for the shoe industry example above. Another example could be to estimate the number of days the patient has to stay in the hospital after the surgery(dependent variable) w.r.t to the age of the patient(independent variable).

Note that here we have only one independent variable in the process and this is known as simple linear regression. The calculation process can be done using the least-squares method.

Not a normal line equation where every point satisfies the equation of the line as this is an estimating line.

Here ε is the error which is calculated by the difference between the actual value and the estimated value.

source: image by author

This y cap equation forms the equation of the regression line and when a new data point with x and y co-ordinate is substituted here in this equation it fits the existing line. This model is susceptible to outliers and hence should not be used in the case of big-size data.

Multiple linear regression

When several independent variables are used to infer the dependent variable then it is called multiple linear regression. For example when we have to estimate the relationship between the weekly working hours, age, qualification(independent variables) to the wages of the employee(dependent variable).

image source

In general, in linear regression the dependent variable is metric and continuous. When we deal with discrete and categorical dependent variable then we look for Logistic regression.

Logistic Regression:

The goal of logistic regression is to estimate the probability of occurrence. and not the value of the variable itself. Here the dependent variable is discrete/dichotomous as the output of the model can be a Yes/No or a Male/Female and things like that.

For choosing this regression as the technique it should be noted that the size of the data should be large and just like the multiple linear regression, here we have several independent variables to deal with, and importantly there should be no correlation between the independent variables i.e, no multicollinearity.

Correlation determines how strong is the relationship between variables x and y. The value of the correlation coefficient r will always fall within the interval [-1,1]. When r=-1 we can say that a regression line with negative slope will perfectly describe the data, when r=1 the line with positive slope will do , whereas when r=0, then we can say that the data may be a big blob(no association), or parabolic i.e, the relationship is non-linear.

For example: Does the weekly working hours and age of the employees(Independent variables) have an influence on the probability that they are at risk of burn-out? The output of this can be a yes/No and hence it comes under the logistic regression model. Similarly, does the age, gender, smoking habit (independent variables) that will result in a particular disease(dependent variable)? also has to be answered by Yes/No.

Does a person vote for a certain party during the election? , will the customer buy this newly launched product? all these predictions result in a dichotomous output and hence come under the logistic regression model.

image source

Here z = b1.x1+b2.x2+…bk.xk+ε just as same as multilinear regression representation. One more difference is that in linear regression the values between ±∞ can occur but here in logistic regression, the value for prediction is restricted to the range between 0 and 1.

Hope this article gives you a brief basic understanding of the regression analysis. Happy learning!

--

--