standardized residuals and outliers

The data for the problem are in the files:
housePrM.mtp housePrM.txt

column 2 is the price of the house in thousands of dollars
and column 3 is the size in hundreds of square feet.

(a)

Plot price vs size.
Notice that there is a house whose price is unusually low
given its size.

Click on the little brush in the Minitab menu and then
click on the unusual point.

Which observation is it ?

(b)

Run the regression of price on size and obtain the
residuals.

Plot the residuals vs size.

Overall, do you see any pattern in the plot of residuals vs size ?

Do you see an unusual point ?
Which observation does it correspond to ?

(c)

In part (b) we found an unusual point.
This happens quite frequently in practice.
Maybe the y value for this observation is in error ?
Maybe there is something special about this house
which makes it different from the rest ?
In practice we would have to check into it.

Points that have unusually large (or small) residuals are
called
outliers.
This means the y value is larger (or smaller) than you expect
given the x values.

How can we quantify how unusual an outlier is ?
We standardize it.

In minitab we can obtain the standardized residuals by using
the storage option in the regession dialogue.
Check "standardized residuals" and minitab will create a new column
containing the standardized residuals.

Under the assumptions of the model the standardized residuals should
look like iid draws from the standard normal distribution.


If an observation has an unusually large standardized residual some
"special cause" my have affected that one in particular.
In practice it is often well worth the time to investigate and find out
why an observation is different from the rest.

Plot the standardized residuals vs size.

How unusual is our outlier ?

(d)

What are the standardized residuals ?
Well, it is a long story, but I can give you a simple
approximate answer which gets at the intuition.

If we knew the true parameters we could calculate the
true errors:

ei = yi - b0- b1xi1- b2xi2 - ... - .bkxik

since we don't know the
b's, we plug in estimates
giving the residuals:


e
i = yi - b0 - b1xi1 - b2xi2 -...- bkxik

Thus, we can think of the residuals (the e
i) as
estimates of the true errors (the
ei).

Under the assumptions of our model we have
ei~N(0,s2), so the standardized values would be,
(
ei-0)/s = ei/s which we can estimate by ei/s.

It turns out it can be more complicated than this
but approximately the standardized residuals are
e
i/s.

For the data in this question obtain the values e
i/s
(that is, get a new column of numbers by dividing each
residual by the s value on the Minitab regression output).
Plot these values vs the standardized residuals given
by Minitab. How do they compare ?

(e)

Note that Minitab routinely prints out a list of "unusual observations".
Any observation which has a standardized residual bigger than 2
(in absolute value) is listed.

If the regression model is correct and there is nothing really unusual
about any of the observatioins, and we have 1000 observations,
how many observations would you expect Minitab to print out
because the standardized residual is bigger than 2 (in abs val).

(f)

the data for this problem is in
zagat.mtp zagat.txt


The data for the problem is in the file zagat.mtp.
For each of 114 restaurants in New York we have ratings on
food, decor, and service.
We also have a value for price of a meal.

Our goal is to see how the three characteristics of the restaurant
are related to the price.

Regress price on food, decor, and service.

Are there any observations with large standardized residuals ?

Can you find this observation in any of the plots of price
vs the explanatory variables ?


solution