Linear+Regression...New+Version


 * Scatterplots and Correlation**

A response variable measures the outcome of a study, whilst an explanatory variable attempts to explain the observed outcomes. Explanatory variables are also known as independent variables and are put on the x-axis; response variables are known as dependent variables and are put on the y-axis.

When interpreting a graph, look at the overall pattern and for striking deviations (known as outlers) from that pattern. You can describe the overall pattern by the form, direciton, and strength of the relationship.
 * There can be a positive association (direct relationship) or a negative assocation (indirect relationship)
 * The strength of a relationship depends on how closely the points follow a clear form

In general, a linear relationship is strong if the points lie close to a striaght line, and weak if they are widely scattered about a line. Correlation (written as //r//)measures the direction and strenght of the linear relationship between two quantitative variables.

Key things to know about correlation:
 * 1) It makes no difference which variable is the explanatory variable and which is the response variable
 * 2) Both variables must be quantitative
 * 3) The units of measurement doesn't matter--correlation stays the same
 * 4) A positive //r// indicates a positive association, and a negative //r// indicates a negative association
 * 5) The //r// is always a number between -1 and 1
 * 6) Values of //r// near 0 indicate a weak relationship, but values of //r// closer to -1 or 1 indicate that the points in a scatterplot lie close to a straight line
 * 7) Correlation is not resistant--outliers do affect it
 * 8) Correlation is NOT a complete descrption of two-variable data

Here are some examples of how correlation measures the strength of a linear relationship.



Strength:
 * A strong correlation ranges from .8 to 1.0
 * A moderate correlation ranges from .6 to .8
 * A weak correlation ranges from 0 to .6

To get an LSRL and your correlation, put x into L1 and y into L2. Go to Stat, Calc, #8 (LinReg a+bx). When interpreting your //r// value, be sure to indicate the strength, direction, and form.
 * "There is a strong, negative linear relationship between x and y in the sixth dot plot"


 * Least-Squares Regression**

Least-squares regression is a method for finding a line that summarizes the relationship between two variables. A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. So we use a regression line to predict the value of y for a given value of x. The more linear the data, the more appropriate it is to use a LSRL.

We want a regression line tha tmakes the residuals in a scatterplot from the line as small as possible. To determine the equation for a line, we need to solve for the intercept //a// and the slope //b//. EVERY LSRL passes through the point (x bar, y bar). The slope of a regression line is important for the interpretation of data. Remember the following sentence to interpret the slope on the AP test: //For every (x in words), there is an average increase/decrease of (slope and y in words)//.

The r² is known as the coefficient of determination. Essentially, r² allows you to see what proportion of sample variability that is explained by the LSRL. On the AP test, use this sentence to interpret r²: //The (r²%) of variation in (y in words) can be explained by the LSRL. //

A residual is the difference between an observed value of the response variable and the value predicted by the regression line. Residual = observed - predicted. There is a residual for each data point. Positive residuals fall above the line, and negative residuals fall below the line. Something to remember is that the mean of all of the residuals for a least-sqaures line is ALWAYS 0.

Outliers are basically observations that lie outside the overall pattern of the other observations. Influential outliers markedly changes the position of the regression line when they are removed; usually they are in the x direction. Regression outliers are points that are far away from the LSRL in the y direction.

Residual plots can be useful examining whether the regression lines captures the overall relationship between x and y. If the line fits well, then the residuals should have no pattern and be random. If there is any kind of pattern or curve, then the relationship is not very linear. Look at some examples below:

The first residual plot is of data that demonstrates a strong linear relationship. The second residual plot is of data that does not.

To do residuals in your calculator:
 * 1) Put data into L1 and L2
 * 2) Go to Stat, Calc, #8 (LinReg a+bx)
 * 3) 2nd Stat Plot, turn on Plot 2
 * 4) Adjust x to be L1 and y to RESID
 * 5) Zoom 9 it

To get the sum of your residuals: go to 2nd List, Math, #5 (sum), 2nd List, #8 (RESID)


 * Transformations**

Nonlinear relationships between two quantitative variables can sometimes be changed into linear relationships by transforming one or both variables. When the variable being transformed takes only positive values, the power transformations are all monotonic.

To make exponential functions (y=ab^x) linear: log y and keep x. To make power functions (y=ax^p) linear: log y and log x.

Linear growth increases by a fixed amount (your slope). Exponential growth increases by a fixed percentage (ratio) of the previous total.

Exponential transformations in the calculator:
 * 1) Enter data into the calculator
 * 2) Make sure Stat Plot 1 says (L1, L2)
 * 3) Use Linear Regression (#8)
 * 4) Look at RESID by making Stat Plot 2 say (L1, RESID)
 * 5) Make L3 into log(L2)
 * 6) Graph
 * 7) Use Linear Regression (#8) L1, L3
 * 8) Look at r-value from your linear regression and look at RESID

If the residuals are still not random, proceed to do power transformations in the calculator:
 * 1) Make L4 into log(L1)
 * 2) Graph
 * 3) Use Linear Regression (#8) L4, L3
 * 4) Look at r-value from your linear regression
 * 5) In Plot 2, change x to L4 and y to RESID (L4, RESID)
 * 6) Look at residuals


 * Things to Keep in Mind with Correlation and Regression**

First and foremost, correlation and regression describe LINEAR relationships, and correlation //r// and the LSRL are not resistant to outliers. Extrapolation is the use of a regression line to predict outside of your data (not very accurate). Often the relationship between two variables is strongly influenced by other variables that we did not measure or even think about. These variables are called lurking variables, and they may influence the interpreation of relationships among the explanatory and response variables. A lurking variable can falsely suggest a strong relationship between x and y, or it can hide a relationship that is really there.

One of the most important things to remember in Statistics is that ASSOCIATION DOES NOT IMPLY CAUSATION.



Clearly, even a strong association between two variables is not enough to say that there is a cause-and-effect link between the variables. The best method to establish causation is by performing an experiment in which the effects of possible lurking variables are controlled (see other page).


 * Inference for Regression**

When a scatterplot shows a linear relationship between x and y, we can use the LSRL to predict y for a given value of x. Now, we want to do tests and confidence intervals in this setting.

The equation for statistics we use is: ŷ = a + bx, where //a// is the intercept and //b// is the slope. The equation for parameters we use its: μ y  = α + βx, where alpha is the intercept (a) and beta is the slope (b). (Remember: define ŷ, a, and b in the context of the problem.)

There are some conditions that must be met before performing a regression inference test:
 * Assume normality--for any fixed value of x, the response y varies according to a normal distribution
 * Assume linear
 * The standard deviation of y is the same for all values of x
 * SRS or random sample--if not, "proceed with caution"

Things to keep in mind before performing an inference test:
 * Residual = observed y - predicted y
 * Degrees of freedom = n - 2 (because there are two variables, x and y)
 * Slope is the average rate of change
 * The standard error about the line (s) is used to estimate the unknown standard deviation (σ )
 * To calculate the sum of the residuals--enter x into L1 and y into L2; LinReg #8; 2nd List, Math, Sum; 2nd List, Resid


 * Significance Test for Regression**

STEP 1 Ho: β = 0, there is NO linear association Ha: β > 0, there is a positive linear association β < 0, there is a negative linear association

STEP 2 Linear regression t test Assume approximately normal Assume linear

STEP 3 (see formula sheet) t value = slope / standard error degrees of freedom = n -2 p-value = tcdf(t, large #, df)

STEP 4 Reject or fail to Reject Ho based on the size of your p-value. If p-value is greater than α, fail to reject Ho. If p-value is less than α, reject Ho. There is (not) enough evidence to say that there is a positive/negative linear association.

media type="custom" key="5937823"


 * Confidence Intervals for Regression Slope**

The slope β of the true regression line is usually the most important parmeter in a regression problem. The slope is the rate of change of the mean response as the explanatory variable increases. Because we're dealing with statistics, we use //b// as an unbiased estimator of β. A confidence interval shows how accurate the estimate //b// is likely to be.

The confidence interval for the slope β of the true regression line is: b +/- t*SEb STEP 1 Estimate the average (y in words) of all...

STEP 2 Assume approximately normal Assume linear SRS or random sample--if not, "proceed with caution"

STEP 3 degrees of freedom = n - 2 t* = the upper (1 - C)/2 critical value from the t distribution with n - 2 degrees of freedom

STEP 4 I am % confident that the average (y in words) will increase [confidence interval found in step 3] for each (x in words) in repeated samples.

media type="custom" key="5937449"media type="custom" key="5937411"