Jethro's Braindump

Regression

tags
Statistics, Bayesian Statistics

Introduction

Regression analysis is a conceptually simple method for investigating functional relationships between variables. The relationship is expressed in the form of an equation or a model connecting the response or dependent variable and one or more explanatory or predictor variables.

We denote the response variable by Y, and the set of predictor variables by X1,X2,,Xp, where p denotes the number of predictor variables.

The general regression model is specified as:

Y=f(X1,X2,,Xp)+ϵ

An example is the linear regression model:

Y=β0+β1X1++βpXp+ϵ

where β0,,βp called the regression paramaters or coefficients, are unknown constants to be estimated from the data. ϵ is the random error representing the discrepancy in the approximation.

Steps in Regression Analysis

1. Statement of the problem

Formulation of the problem includes determining the questions to be addressed.

2. Selection of Potentially Relevant Variables

We select a set of variables that are thought by the experts in the area of study to explain or predict the response variable.

Each of these variables could be qualitative or quantitative. A technique used in casse where the response variable is binary is called logistic regression. If all predictor variables are qualitative, the techniques used in the analysis of the data are called the analysis of variance techniques. If some of the predictor variables are quantitative while others are qualitative, regression analysis in these cases is called the analysis of covariance.

Type Conditions
Univariate Only 1 quantitative RV
Multivariate 2 or more quantitative RV
Simple Only 1 predictor variable
Multiple 2 or more predictor variables
Linear All parameters enter the equation linearly, possibly after transformation of the data
Non-linear The relationship between the response and some of the predictors is non-linear (no transformation to make parameters appear linearly)
Analysis of variance All predictors are qualitative variables
Analysis of covariance Some predictors are quantitative variables, others are qualitative
Logistic The RV is qualitative

3. Model specification

The form of the model that is thought to relate the response variable to the set of predictor variables can be specified initially by experts in the area of study based on their knowledge or their objective and/or subjective judgments.

We need to select the form of the function f(X1,X2,,Xp) in .

4. Method of Fitting

We want to perform parameter estimation or model fitting after defining the model and collecting the data. the most commonly used method of estimation is called the least squares method. Other estimation methods we consider are the maximum likelihood method, ridge regression and the principal components method.

5. Model Fitting

The estimates of the regression parameters β0,β1,,βp are denoted by β0^,β1^,,βp^. the obtained Y^ denotes the predicted value.

6. Model Criticism and Selection

The validity of statistical methods depend on certain assumptions, about the data and the model. We need to address the following questions:

  1. What are the required assumptions?
  2. For each of these assumptions, how do we determine if they are valid?
  3. What can be done in cases where assumptions do not hold?
Figure 1: Flowchart illustrating the dynamic iterative regression process

Figure 1: Flowchart illustrating the dynamic iterative regression process

Simple Linear Regression

Simple linear regression is a straightforward approach for predicting a quantitative response Y on the basis of a single predictor variable X. Mathematically, we write this linear relationship as:

Yβ0+β1X

We describe this as regressing Y on X.

We wish to measure both the direction and strength of the relationship between Y and X. Two related measures, known as the covariance and the correlation coefficient are developed later.

We use our training data to produce estimates β0^ and β1^ for the model coefficients, and we can predict outputs by computing:

y^=β0^+β1^X

Let yi^=β0^+β1^ be the prediction of Y based on the ith value of X. Then ei=yiyi^ represents the ith residual. We can define the residual sum of squares (RSS) as:

RSS=e12+e22++en2

The least squares approach chooses β0^ and β1^ to minimise the RSS.

If f is to be approximated by a linear function, then we can write the relationship as:

Y=β0+β1X+ϵ

where ϵ is an error term – the catch-all for what we miss with this simple model. For example, the true relationship is probably not linear, and other variables cause variation in Y. It is typically assumed that the error term is independent of X.

We need to assess the accuracy of our estimates. How far off will a single estimation of μ^ be? In general, we can answer this question by computing the standard error of μ^, written as SE(μ^). This is given by:

Var(μ^)=SE(μ^)2=σ2n

where σ is the standard deviation of each of the realizations yi of Y.

Standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. For linear regression, the 95% confidence interval for β1 approximately takes the form:

β1^±2SE(β1^)

Standard errors can also be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing the null hypothesis of:

H0:ThereisnorelationshipbetweenXandY(β1=0)

versus the alternative hypothesis:

Ha:ThereisarelationshipbetweenXandY(β10)

To test the null hypothesis, we need to determine whether β1^ is sufficiently far from zero that we can be confident that β1 is non-zero. How far is far enough? This depends on the accuracy of β1^, or SE(β1^). In practice, we compute a t-statistic, given by:

t=β1^0SE(β1^)

which measures the number of standard deviations that β1^ is away from 0. If there really is no relationship between X and Y, then we expect that Eq. will have a t-distribution with n2 degrees of freedom. The t-distribution has a bell shape and for values of n greater than approximately 30 it is similar to the normal distribution.

It is a simple matter to compute the probability of observing any number equal to t or larger in absolute value, assuming β1=0. We call this probability the p-value. A small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance.

Assessing the accuracy of the model

Once we have rejected the null hypothesis in favour of the alternative hypothesis, it is natural to want to quantify the extent to which the model fits the data.

Residual Standard Error (RSE)

After we compute the least square estimates of the parameters of a linear model, we can compute the following quantities:

SST=(yiy¯)2 SSR=(yi^y¯)2 SSE=(yiyi^)2

A fundamental equality in both simple and multiple regressions is given by SST=SSR+SSE. This can be interpreted as: The deviation from the mean is equal to the deviation due to fit, plus the residual.

R2 statistic

The RSE provides an absolute measure of lack of fit of the model to the data. But since it is measured in the units of Y, it is not always clear what constitutes a good RSE. The R2 statistic provides an alternative measure of fit. It takes a form of a proportion, and takes values between 0 and 1, independent of the scale of Y.

R2=SSRSST=1SSESST

where SST=(yiy¯)2 is the total sum of squares. SST measures the total variance in the response Y, and can be thought of as the amount of variability inherent in the response before the regression is performed. Hence, R2 measures the proportion of variability in Y that can be explained using X.

Note that the correlation coefficient r=Cor(X,Y) is related to the R2 in the simple linear regression setting: r2=R2.

Multiple Linear Regression

We can extend the simple linear regression model to accommodate multiple predictors:

Y=β0+β1X1+β2X2++βpXp+ϵ

We choose β0,β1,,βp to minimise the sum of squared residuals:

RSS=i=1n(yiyi^)2

Unlike the simple regression estimates, the multiple regression coefficient estimates have complicated forms that are most easily represented using matrix algebra.

Interpreting Regression Coefficients

β0, the constant coefficient, is the value of Y when X1=X2=Xp=0, as in simple regression. The regression coefficient βj has several interpretations.

First, it may be interpreted as the change in Y corresponding to a unit change in Xj when all other predictor variables are held constant. In practice, predictor variables may be inherently related, and it is impossible to hold some of them constant while varying others. The regression coefficient βj is also called the partial regression coefficient, because βj represents the contribution of Xj to the response variable Y adjusted for other predictor variables.

Centering and Scaling

The magnitudes of the regression coefficients in a regression equation depend on the unit of measurements of the variables. To make the regression coefficients unit-less, one may first center or scale the variables before performing regression computations.

When dealing with constant term models, it is convenient to center and scale the variables, but when dealing with no-intercept models, we need only to scale the variables.

A centered variable is obtained by subtracting from each observation the mean of all observations. For example, the centered response variable is (Yy¯), and the centered jth predictor variable is (Xjx¯j). The mean of a centered variable is 0.

The centered variables can also be scaled. Two types of scaling are usually performed: unit-length scaling and standardizing.

Unit-length scaling of response variable Y and the jth predictor variable Xj is obtained as follows:

Z~y=(Yy¯)/Ly Z~j=(Xx¯j)/Lj 

where:

Ly=i=1n(yiy¯)2 and Lj=i=1n(xijx¯j)2 , j=1,2,,p

The quantities Ly is referred to as the length of the centered variable Yy¯ because it measures the size or the magnitudes of the observations in Yy¯. Lj has a similar interpretation.

Unit length scaling has the following property:

Cor(Xj,Xk)=i=1nzijzik

The second type of scaling is called standardizing, which is defined by:

Y~=Yy¯sy X~j=Xjx¯jsj , j=1,,p

where sy and sj are the standard deviations of the response and jth predictor variable respectively. The standardized variables have mean zero and unit standard deviations.

Since correlations are unaffected by centering or scaling, it is sufficient and convenient to deal with either unit-length scaled or standardized models.

Properties of Least-square Estimators

Under certain regression assumptions, the least-square estimators have the following properties:

  1. The estimator β^j is an unbiased estimate of βj^ and has a variance of σ2cjj, where cjj is the jth diagonal element of the inverse of a matrix known as the corrected sums of squares and products matrix. The covariance between β^i and β^j is σ2cij. For all unbiased estimates that are linear in the observations the least squares estimators have the smallest variance.

  2. The estimator β^j, is normally distributed with mean βj and variance σ2cjj.

  3. W=SSE/σ2 has a χ2 distribution with np1 degrees of freedom, and β^j and σ^2 are distributed independently from each other.

  4. The vector β^=(β^0,β1^,,βp^) has a $(p+1)$-dimensional normal distribution with mean vector β=(β0,β1,,βp) and variance-covariance matrix with elements σ2cij.

Important Questions in Multiple Regression Models

We can answer some important questions using the multiple regression model:

1. Is there a relationship between the response and the predictors?

The strength of the linear relationship between Y and the set of predictors X1,X2,Xp can be assessed through the examination of the scatter plot of Y versus Y^, and the correlation coefficient Cor(Y,Y^). The coefficient of determination R2=[Cor(Y,Y^)]2 may be interpreted as the proportion of total variability in the response variables Y that can be accounted for by the set of predictor variables X1,X2,,Xp.

A quantity related to R2 knows as the adjusted R-squared, Ra2, is also used for judging the goodness of fit. It is defined as:

Ra2=1SSE/(np1)SST/(n1) =1n1np1(1R2)

Ra2 is sometimes used to compared models having different numbers of predictor variables.

If we do a hypothesis test on H1:βjβj0, we can do a t-test:

tj=βj^βj0s.e.(β^j)

which has a Stundent’s t-distribution with np1 degrees of freedom. The test is carried out by comparing the observed value with the appropriate critical value t(np1),α/2.

If we are comparing H0 and Ha , and we do so by computing the F-statistic:

F=(TSSRSS)/pRSS/(np1)

If the linear model is correct, one can show that:

E{RSS/(np1)}=σ2

and that provided H0 is true,

E{(TSSRSS)/p}=σ2

Hence, when there is no relationship between the response and the predictors, one would expect the F-statistic to be close to 1. if Ha is true, then we expect F to be greater than 1.

2. Deciding on important variables

It is possible that all of the predictors are associated with the response, but it is more often the case that the response is only related to a subset of the predictors. The task of determining which predictors are associated is referred to as variable selection. Various statistics can be used to judge the quality of a model. These include Mallow’s Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R2.

There are 2p models that contains a subset of p variables. Unless p is small, we cannot consider all 2p models, and we need an efficient approach to choosing a smaller set of models to consider. There are 3 classical approaches for this task:

  1. Forward selection: We begin with the null model – a model that contains an intercept but no predictors. We then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model, and repeat.
  2. Backward selection: We start with all variables in the model, andd remove the variable with the largest p-value – that is the variable that is the least statistically significant. The new (p-1)-variable model is fit, and the variable with the largest p-value is removed, and repeat.
  3. Mixed selection. This is a combination of forward and backward-selection. We once again start with the null model. The p-values of the variables can become larger as new variables are added to the model. Once the p-value of one of the variables in the model rises above a certain threshold, they are removed.

3. Model Fit

Two of the most common numerical measures of model fit are the RSE and R2, the fraction of variance explained.

It turns out that R2 will always increase when more variables are added to the model, even if those variables are only weakly associated with the response. This is because adding another variable to the least squares equations must allow us to fit the training data more accurately. Models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.

4. Predictions

Once we have fit the multiple regression model, it is straightforward to predict the response Y on the basis of a set of values for predictors X1,X2,,Xp. However, there are 3 sorts of uncertainty associated with this prediction:

  1. The coefficient estimates β0^,β1^,,βp^ are estimates for β0,β1,,βp.

That is, the least squares plane:

Y^=β0^+β1^X1++βp^Xp

is only an estimate for the true population regression plane:

f(X)=β0+β1X1++βpXp

This inaccuracy is related to the reducible error. We can compute a confidence interval in order to determine how close Y^ will be to f(X).

  1. In practice, assuming a linear model for f(X) is almost always an approximation of reality, so there is an additional source of potentially reducible error which we call model bias.

  2. Even if we knew f(X), the response value cannot be predicted perfectly because of the random error ϵ is the model. How mich will Y vary from Y^? We use prediction intervals to answer the question. Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for f(X), and the irreducible error.

Qualitative Variables

Thus far our discussion had been limited to quantitative variables. How can we incorporate qualitative variables such as gender into our regression model?

For variables that take on only two values, we can create a dummy variable of the form (for example in gender):

xi={1if ith person is female 0if ith person is male

and use this variable as a predictor in the equation. We can also use the {-1, 1} encoding. For qualitative variables that take on more than 2 values, a single dummy variable cannot represent all values. We can add additional variables, essentially performing a one-hot encoding.

Extending the Linear Model

The standard linear regression model provides interpretable results and works quite well on many real-world problems. However, it makes highly restrictive assumptions that are often violated in practice.

The two most important assumptions of the linear regression model are that the relationship between the response and the predictors are:

  1. additive: the effect of changes in a predictor Xj on the response Y are independent of the values of the other predictors.
  2. linear: the change in the response Y due to a one-unit change in Xj is constant, regardless of the value of Xj

How can we remove the additive assumption? We can add an interaction term for two variables Xi and Xj as follows:

Y2^=Y1^+βp+1XiXj

We can analyze the importance of the interaction term by looking at its p-value. The hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

How can we remove the assumption of linearity? A simple way is to use polynomial regression.

Potential Problems

When we fit a linear regression model to a particular data set, many problems may occur. Most common among these are the following:

  1. Non-linearity of the response-predictor relationships
  2. Correlation of error terms
  3. Non-constant variance of error terms
  4. Outliers
  5. High-leverage points
  6. Collinearity

Non-linearity of the Data

The assumption of a linear relationship between response and predictors aren’t always true. Residual plots are a useful graphical tool for identifying non-linearity. This is obtained by plotting the residuals ei=yiyi^ versus the predictor xi. Ideally the residual plot will show no discernible pattern. The presence of a pattern may indicate a problem with some aspect of the linear model.

If the residual plots show that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as logX, X and X2.

Correlation of Error Terms

An important assumption of the linear regression model is that the error terms, ϵ1,ϵ2,,ϵp are uncorrelated. The standard errors for the estimated regression coefficients are computed based on this assumption. This is mostly mitigated by proper experiment design.

Non-constant Variance of Error Terms

Variances of the error terms may increase with the value of the response. One can identify non-constant variances in the errors, or heteroscedasticity, from the presence of a funnel shape in the residual plot.

Figure 2: Left: the funnel shape indicates heteroscedasticity, Right: the response has been log transformed, and there is now no evidence of heteroscedasticity

Figure 2: Left: the funnel shape indicates heteroscedasticity, Right: the response has been log transformed, and there is now no evidence of heteroscedasticity

Outliers

An outlier is a point from which yi is far from the value predicted by the model. It is atypical for an outlier that does not have an unusual predictor value to have little effect on the least squares fit. However, it can cause other problems, such as dramatically altering the computed values of RSE, R2 and p-values.

Residual plots can clearly identify outliers. One solution is to simply remove the observation, but care must be taken to first identify whether the outlier is indicative of a deficiency with the model, such as a missing predictor.

High Leverage Points

Observations with high leverage have an unusual value for xi. High leverage observations typically have a substantial impact on the regression line. These are easy to identify, by looking for values outside the normal range of the observations. We can also compute the leverage statistic. for a simple linear regression:

hi=1n+(xixi¯)2i=1n(xix¯)2

Collinearity

Collinearity refers to the situation in which two or more predictor variables are closely related to one another. The presence of collinearity can pose problems in the regression context: it can be difficult to separate out the individual effects of collinear variables on the response. A contour plot of the RSS associated with different possible coefficient estimates can show collinearity.

Figure 3: Left: the minimum value is well defined, Right: because of collinearity, there are many pairs \((\beta_{\text{Limit}}, \beta_{\text{Rating}})\) with a similar value for RSS

Figure 3: Left: the minimum value is well defined, Right: because of collinearity, there are many pairs (βLimit,βRating) with a similar value for RSS

Another way to detect collinearity is to look at the correlation matrix of the predictors. An element of this matrix that is large in absolute value indicates a pair of highly correlated variables, and therefore a collinearity problem in the data.

Not all collinearity problems can be detected by inspection of the correlation matrix: it is possible for collinearity to exist between three or more variables even if no pairs of variables has a particularly high correlation. This situation is called multicollinearity. We instead compute the variance inflation factor (VIF). The VIF is the ratio of the variance of βj^ when fitting the full model divided by the variance of βj^ if fit on its own. The smallest possible value for VIF is 1, indicating a complete absence of collinearity. As a rule of thumb, a VIF exceeding values of 5 or 10 indicates a problematic amount of collinearity.

VIF(βj^)=11RXj|Xj2

where RXj|Xj2 is the regression of Xj onto all of the other predictors.

The data consists of n observations on a dependent or response variable Y, and p predictor or explanatory variables. The relationship between Y and X1,X2,,Xp is represented by:

Linear Basis Function Models

We can extend the class of models by considering linear combinations of fixed non-linear functions of the input variables, of the form:

y(x,w)=w0+j=1M1wjϕj(x)=wTϕ(x)

There are many choices on non-linear basis functions, such as the Gaussian basis function:

ϕj(x)=exp{(xμj)22s2}

or the sigmoidal basis function:

ϕj(x)=σ(xμjs)

Regression Diagnostics

In fitting a model to a given body of data, we would like to ensure that the fit is not overly determined by one or a few observations. The distribution theory, confidence intervals, and tests of hypotheses are valid and have meaning only if the standard regression assumptions are satisfied.

The Standard Regression Assumptions

Assumptions about the form of the model: The model that relates the response Y to the predictors X1,X2,,Xp is assumed to be linear in the regression parameters β0,β1,,βp respectively:

Y=β0+β1X1++βpXp+ϵ

which implies that the ith observation can be written as:

yi=β0+β1xi1+βpxip+ϵi,i=1,2,,n

Checking the linearity assumption can be done by examining the scatter plot of Y versus X, but linearity in multiple regression is more difficult due to the high dimensionality of the data. In some cases, data transformations can lead to linearity.

Assumptions about the errors: The errors ϵ1,ϵ2,,ϵn are assumed to be independently and identically distributed normal random variables each with mean zero and a common variance σ2. Four assumptions are made here:

  • The error ϵi has a normal distribution. This normality assumption can be validated by examining appropriate graphs of the residuals.

  • The errors ϵi have mean 0.

  • The errors ϵi have the same (but unknown) variance σ2. This is the constant variance assumption, also known as the homogeneity or homoscedasticity assumption. When this assumption does not hold, the problem is called the heterogeneity or the heteroscedasticity problem.

  • The errors ϵi are independent of each other. When this assumption doesn’t hold, we have the auto-correlation problem.

Assumption about the predictors: Three assumptions are made here:

  • The predictor variables X1,X2,,Xp are nonrandom. This assumption is satisfied only when the experimenter can set the values of the predictor variables at predetermined levels. When the predictor variables are random variables, all inferences are conditional, conditioned on the observed data.

  • The values x1j,x2j,,xnj are measured without error. This assumption is hardly ever satisfied, and errors in measurement will affect the residual variance, the multiple correlation coefficient, and the individual estimates of the regression coefficients. Correction for measurement errors of the estimated regression coefficients, even in the simplest case where all measurement errors are uncorrelated, requires a knowledge of the ratio between the variances and the random error. These quantities are seldom known.

  • The predictor variables X1,X2,,Xp are assumed to be linearly independent of each other. This assumption is required to guarantee the uniqueness of the least squares solution. If this assumption is violated, the problem is referred to as the collinearity problem.

Assumption about the observations: All observations are equally reliable and have an approximately equal role in determining the regression results.

A feature of the method of least squares is that minor violations of the underlying assumptions do not invalidate the inferences or conclusions drawn from the analysis in a major way. However, gross violations can severely distort conclusions.

Types of Residuals

A simple method for detecting model deficiencies in regression analysis is the examination of residual plots. Residual plots will point to serious violations in one or more of the standard assumptions when they exist. The analysis of residuals may lead to suggestions of structure or point to information in the data that might be missed or overlooked if the analysis is based only on summary statistics.

When fitting the linear model to a set of data by least squares, we obtain the fitted values:

y^i=β^0+β^1xi1++β^pxip

and the corresponding ordinary least squares residuals:

ei=yiy^i

The fitted values can also be written in an alternative form as:

yi^=pi1y1+pi2y2++pinyn

where the pij are quantities that depend only on the values of the predictor variables. In simple regression pij is given by:

pij=1n+(xi+x)(xjx)(xix)2

In multiple regression the pij are elements of matrix known as the hat or projection matrix.

The value pii is called the leverage value for the ith observation, because y^i is a weighted sum of all the observations in Y and pii is the weight (leverage) given to yi in determining the ith fitted value y^i. Thus, we have n leverage values, and they are denoted by:

p11,p22,,pnn

When the standard assumptions hold, the ordinary residuals, e1,e2,,en will sum to zero, but they will not have the same variance because:

Var(ei)=σ2(1pii)

To over come the problem of unequal variances, we standardize the ith residual ei by dividing by its standard deviation:

zi=eiσ1pii

This is called the ith standardized residual because it has mean zero and standard deviation 1. The standardized residuals depend on σ, the unknown standard deviation of ϵ. An unbiased of σ2 is given by:

σ^2=ei2np1=(yiy^i)2np1=SSEnp1

An alternative unbiased estimate of σ2 is given by:

σ^(i)2=SSE(i)np2

where SSE(i) is the sum of squared residuals when we fit the model to the n - 1 observations by omitting the ith observation.

Using σ^ as an estamite of σ, we obtain:

ri=eiσ^1pii

which we term the internally studentized residual. Using teh other unbiased estimate, we get:

ri=eiσ(i)^1pii

which is a monotonic transformation of ri. This is termed the externally studentized residual. The standardized residuals do not sum to zero, but they all have the same variance.

The externally standardized residuals follow a t-distribution with n - p - 2 degrees of freedom, but the internally standardized residuals do not. However, with a moderately large sample, these residuals should approximately have a standard normal distribution. The residuals are not strictly independently distributed, but with a large number of observations, the lack of independence may be ignored.

Graphical methods

Before fitting the model

Graphs plotted before fitting the model serve as exploratory tools. There are four categories of graphs:

1. One-dimensional graphs

One-dimensional graphs give a general idea of the distribution of each individual variable. One of the following graphs may be used:

  • histogram
  • stem-and-leaf display
  • dot-plot
  • box-plot

These graphs indicate whether the variable is symmetric, or skewed. When a variable is skewed, it should be transformed, generally using a logarithmic transformation. Univariate graphs also point out the presence of outliers in the variables. However, no observations should be deleted at this stage.

2. Two-dimensional graphs

We can take the variables in pairs and look at the scatter plots of each variable versus each other variable in the data set. These explore relationships between each pair of variables and identify general patterns. These pairwise plots can be arranged in a matrix format, known as the draftsman’s plot or the plot matrix. In simple regression, we expect the plot of Y versus X to show a linear pattern. However, scatter plots of Y versus each predictor variable may or may not show linear patterns.

Beyond these, there are rotation plots and dynamic graphs which serve as powerful visualizations of the data in more than 2 dimensions.

Graphs after fitting a model

The graphs after fitting a model help check the assumptions and assess the adequacy of the fit of a given model.

  1. Graphs checking linearity and normality assumptions

When the number of variables is small, the assumption of linearity can be checked by interactively and dynamically manipulating plots discussed in the previous section. However, this quickly becomes difficult with many predictor variables. Plotting the standardized residuals can help check the linearity and normality assumptions.

  • Normal probability plot of the standardized residuals: This is a plot of the ordered standardized residuals versus the normal scores. The normal scores are what we would expect to obtain if we take a sample of size n from a standard normal distribution. If the residuals are normally distributed, the ordered residuals should be approximately the same as the ordered normal scores. Under the normality assumption, this plot should resemble a straight line with intercept zero and slope of 1.

  • Scatter plots of the standardized residual against each of the predictor variables: Under the standard assumptions, the standardized residuals are uncorrelated with each of the predictor variables. If the assumptions hold, the plot should be a random scatter of points.

  • Scatter plot of the standardized residual versus the fitted values: Under the standard assumptions, the standardized residuals are also uncorrelated with the fitted values. Hence, this plot should also be a random scatter of points.

  • Index plot of the standardized residuals: we display the standardized residuals versus the observation number. If the order in which the observations were taken is immaterial, this plot is not needed. However, if the order is important, a plot of the residuals in serial order may be used to check the assumption of the independence of the errors.

Figure 4: Scatter plots of residuals: (a) shows nonlinearity, and (b) shows heterogeneity

Figure 4: Scatter plots of residuals: (a) shows nonlinearity, and (b) shows heterogeneity

References

, Chatterjee and Hadi, n.d., @james2013introduction