Regression in Data Mining
Regression is a data mining function that predicts a number. Age, weight, distance, temperature, income, or sales could all be predicted using regression techniques. For example, a regression model could be used to predict children’s height, given their age, weight, and other factors. Regression involves predictor variable (the values which are known) and response variable (values to be predicted).
The two basic types of regression are:
1. Linear regression
- It is simplest form of regression. Linear regression attempts to model the relationship between two variables by fitting a linear equation to observe the data.
- Linear regression attempts to find the mathematical relationship between variables.
- If outcome is straight line then it is considered as linear model and if it is curved line, then it is a non linear model.
- The relationship between dependent variable is given by straight line and it has only one independent variable.
Y = α + Β X
- Model ‘Y’, is a linear function of ‘X’.
- The value of ‘Y’ increases or decreases in linear manner according to which the value of ‘X’ also changes.
2. Multiple regression model
- Multiple linear regression is an extension of linear regression analysis.
- It uses two or more independent variables to predict an outcome and a single continuous dependent variable.
Y = a0 + a1 X1 + a2 X2 +………+ak Xk +e
‘Y’ is the response variable.
X1 + X2 + Xk are the independent predictors.
‘e’ is random error.
a0, a1, a2, ak are the regression coefficients.
Top 6 Regression Algorithms Used In Data Mining And Their Applications In Industry
1. Simple Linear Regression model: Simple linear regression is a statistical method that enables users to summarize and study relationships between two continuous (quantitative) variables. Linear regression is a linear model wherein a model that assumes a linear relationship between the input variables (x) and the single output variable (y). Here the y can be calculated from a linear combination of the input variables (x). When there is a single input variable (x), the method is called a simple linear regression. When there are multiple input variables, the procedure is referred as multiple linear regression.
Application: some of the most popular applications of Linear regression algorithm are in financial portfolio prediction, salary forecasting, real estate predictions and in traffic in arriving at ETAs.
Application: Lasso regression algorithms have been widely used in financial networks and economics. In finance, its application is seen in forecasting probabilities of default and Lasso-based forecasting models are used in assessing enterprise wide risk framework. Lasso-type regressions are also used to perform stress test platforms to analyze multiple stress scenarios.
3. Logistic regression: One of the most commonly used regression techniques in the industry which is extensively applied across fraud detection, credit card scoring and clinical trials, wherever the response is binary has a major advantage. One of the major upsides is of this popular algorithm is that one can include more than one dependent variable which can be continuous or dichotomous. The other major advantage of this supervised machine learning algorithm is that it provides a quantified value to measure the strength of association according to the rest of variables. Despite its popularity, researchers have drawn out its limitations, citing a lack of robust technique and also a great model dependency.
Application: Today enterprises deploy Logistic Regression to predict house values in real estate business, customer lifetime value in the insurance sector and are leveraged to produce a continuous outcome such as whether a customer can buy/will buy scenario.
Application: support vector machines regression algorithms has found several applications in the oil and gas industry, classification of images and text and hypertext categorization. In the oilfields, it is specifically leveraged for exploration to understand the position of layers of rocks and create 2D and 3D models as a representation of the subsoil.
5. Multivariate Regression algorithm: This technique is used when there is more than one predictor variable in a multivariate regression model and the model is called a multivariate multiple regression. Termed as one of the simplest supervised machine learning algorithms by researchers, this regression algorithm is used to predict the response variable for a set of explanatory variables. This regression technique can be implemented efficiently with the help of matrix operations and in Python, it can be implemented via the “numpy” library which contains definitions and operations for matrix object.
Application: Industry application of Multivariate Regression algorithm is seen heavily in the retail sector where customers make a choice on a number of variables such as brand, price and product. The multivariate analysis helps decision makers to find the best combination of factors to increase footfalls in the store.
6. Multiple Regression Algorithm: This regression algorithm has several applications across the industry for product pricing, real estate pricing, marketing departments to find out the impact of campaigns. Unlike linear regression technique, multiple regression, is a broader class of regressions that encompasses linear and nonlinear regressions with multiple explanatory variables.