A Guide to Multicollinearity & VIF in Regression

multicollinearity meaning

A Variance Inflation Factor value of 1 indicates no correlation among the independent variable in question and the rest of the model, which is ideal. As the VIF increases, so does the level of multicollinearity, with values typically above 5 or 10 signaling problematic levels that might distort regression outcomes. One of the most direct impacts of multicollinearity is the reduction in the precision of the estimated coefficients. This reduction manifests as increased standard errors, which makes it harder to determine whether an independent variable is statistically significant. The significance of multicollinearity extends beyond theoretical concerns—it has practical implications in the real world. When independent variables are not distinctly separable due to their inter-correlations, the stability and interpretability of the coefficient estimates become compromised.

Linear Combination of Variables

This approach not only reduces multicollinearity but also helps in extracting the most relevant features from a set of variables, thereby enhancing the model’s efficiency. For instance, if a market research survey asks multiple questions that are closely related (such as different aspects of customer satisfaction), the responses may be highly correlated. This correlation becomes embedded in the dataset, creating multicollinearity that can skew analytical outcomes. This form of multicollinearity was noted by Thornedike as far back as 1920 and is known colloquially as the “Halo effect”. This type of multicollinearity is a consequence of the way the data or the model is structured.

Replies to “A Guide to Multicollinearity & VIF in Regression”

An alternative method for fixing multicollinearity is to collect more data under different conditions. An example is a multivariate regression model that attempts to anticipate stock returns based on metrics such as the price-to-earnings ratio (P/E ratios), market capitalization, or other data. The stock return is the dependent variable (the outcome), and the various bits of financial data are the independent variables. In general, multicollinearity can lead to wider confidence intervals that produce less reliable probabilities in terms of the effect of independent variables in a model.

Highly Correlated Independent Variables

Consider a pricing research study using regression analysis aiming to predict housing prices based on features such as size, age, and proximity to the city center. In the limit, when tends to 1, that is, in the case of perfect multicollinearity examined above, tends to infinity. If the regressor being examined is highly correlated with a linear combination of the other regressors, then is close to 1 and the variance inflation factor is large. When this happens, the OLS estimator of the regression coefficients tends to be very imprecise, that is, it has high variance, even if the sample size is large. The standard errors for Weight and Height are much larger in the model containing BMI. This means we assume that we’re able to change the values of a given predictor variable without changing the values of the other predictor variables.

This condition can result in inflated standard errors, leading to less statistically significant coefficients.
Identifying the root causes of multicollinearity is crucial for effectively managing its impact in regression analysis.
One method for detecting whether multicollinearity is a problem is to compute the variance inflation factor, or VIF.
Imperfect multicollinearity refers to a situation where the predictive variables have a nearly exact linear relationship.
This indicates that they’re likely suffering from multicollinearity and that their coefficient estimates and p-values are likely unreliable.

Variable usage

I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike. My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations. This tutorial explains why multicollinearity is a problem, how to detect it, and how to resolve it. Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. This section delves into these types and provides real-world examples to illustrate their impacts on statistical models. If there is only moderate multicollinearity, you likely don’t need to resolve it in any way.

multicollinearity meaning

Multicollinearity in a multiple regression model indicates that collinear independent variables are not truly independent. The stocks of businesses that have performed well experience investor confidence, increasing demand for that company’s stock, which increases its market value. Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model. VIF quantifies the extent to which the variance of an estimated regression coefficient is increased because of multicollinearity.

For investing, multicollinearity is a common consideration when performing technical analysis to predict probable future price movements of a security, such as a stock or a commodity future.
If both are included in a regression model aiming to predict customer retention, it may be difficult to determine the distinct impact of each factor.
In the case of perfect multicollinearity, at least one regressor is a linear combination of the other regressors.
It’s important for analysts to consider the context of their specific analysis, as different fields may have different thresholds for acceptable VIF levels.
You can also try to combine or transform the offending variables to lower their correlation.
IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications.

Similarly, trying many different models or estimation procedures (e.g. ordinary least squares, ridge regression, etc.) until finding multicollinearity meaning one that can “deal with” the collinearity creates a forking paths problem. P-values and confidence intervals derived from post hoc analyses are invalidated by ignoring the uncertainty in the model selection procedure. There are many ways to prevent multicollinearity from affecting results by planning ahead of time. However, these methods all require a researcher to decide on a procedure and analysis before data has been collected (see post hoc analysis and Multicollinearity § Misuse).

Detecting multicollinearity is a critical step in ensuring the reliability of regression analyses, and one of the most effective tools for this purpose is the Variance Inflation Factor (VIF). This section explains how VIF is used to measure the level of multicollinearity among independent variables in a regression model, and demonstrates how to interpret its values to assess the severity of multicollinearity. Multicollinearity impacts the coefficient estimates and the p-values, but it doesn’t impact predictions or goodness-of-fit statistics. Two relatively simple tools for measuring multicollinearity are a scatter plot and correlation matrix of independent variables.

More Commonly Misspelled Words

The most extreme case is that of perfect multicollinearity, in which at least one regressor can be expressed as a linear combination of other regressors. This would mean that the other predictors explain 99% of the variation in the given predictor. As a rule of thumb, a VIF of 5 or 10 indicates that the multicollinearity might be problematic.

Linear Combination of Variables

Replies to “A Guide to Multicollinearity & VIF in Regression”

Highly Correlated Independent Variables

Variable usage

More Commonly Misspelled Words

Leave a Reply Cancel reply