Linear+regression

= = = = =*This is another version of Linear Regression. There are notes and at the bottom there are websites with tutorials and practice problems. *= = = = = =OVERVIEW= In a cause and effect relationship, the **independent variable** is the cause, and the **dependent variable** is the effect. **Least squares linear regression** is a method for predicting the value of a dependent variable //Y//, based on the value of an independent variable //X//. In this tutorial, we focus on the case where there is only one independent variable. This is called simple regression (as opposed to multiple regression, which handles two or more independent variables).

Prerequisites for Regression
Simple linear regression is appropriate when the following conditions are satisfied. > > >
 * The dependent variable //Y// has a linear relationship to the independent variable //X//. To check this, make sure that the XY [|scatterplot] is linear and that the [plot|residual plot] shows a random pattern.
 * For each value of X, the probability distribution of Y has the same standard deviation σ. When this condition is satisfied, the variability of the residuals will be relatively constant across all values of X, which is easily checked in a residual plot.
 * For any given value of X,
 * The Y values are independent, as indicated by a random pattern on the residual plot.
 * The Y values are roughly normally distributed (i.e., [|symmetric] and [distribution|unimodal]). A little [|skewness] is ok if the sample size is large. A [|histogram] or a [|dotplot] will show the shape of the distribution.

The Least Squares Regression Line
Linear regression finds the straight line, called the **least squares regression line** or LSRL, that best represents observations in a [data|bivariate] data set. Suppose //Y// is a dependent variable, and //X// is an independent variable. The population regression line is: Y = Β0 + Β1X where Β0 is a constant, Β1 is the regression coefficient, X is the value of the independent variable, and Y is the value of the dependent variable. Given a random sample of observations, the population regression line is estimated by: ŷ = b0 + b1x where b0 is a constant, b1 is the regression coefficient, x is the value of the independent variable, and ŷ is the //predicted// value of the dependent variable.

How to Define a Regression Line
Normally, you will use a computational tool - a software package (e.g., Excel) or a [|graphing calculator] - to find b0 and b1. You enter the //X// and //Y// values into your program or calculator, and the tool solves for each parameter. In the unlikely event that you find yourself on a desert island without a computer or a graphing calculator, you can solve for b0 and b1 "by hand". Here are the equations. b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2] b1 = r * (sy / sx) b0 = y - b1 * x where b0 is the constant in the regression equation, b1 is the regression coefficient, r is the correlation between x and y, xi is the //X// value of observation //i//, yi is the //Y// value of observation //i//, x is the mean of //X//, y is the mean of //Y//, sx is the standard deviation of //X//, and sy is the standard deviation of //Y//

Properties of the Regression Line
When the regression parameters (b0 and b1) are defined as described above, the regression line has the following properties. The least squares regression line is the only straight line that has all of these properties.
 * The line minimizes the sum of squared differences between observed values (the //y// values) and predicted values (the ŷ values computed from the regression equation).
 * The regression line passes through the mean of the //X// values (x) and the mean of the //Y// values (y).
 * The regression constant (b0) is equal to the [intercept|y intercept] of the regression line.
 * The regression coefficient (b1) is the average change in the dependent variable (//Y//) for a 1-unit change in the independent variable (//X//). It is the [|slope] of the regression line.

The Coefficient of Determination
The **coefficient of determination** (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. The formula for computing the coefficient of determination for a linear regression model with one independent variable is given below. **Coefficient of determination.** The coefficient of determination (R2) for a linear regression model with one independent variable is: R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2 where N is the number of observations used to fit the model, Σ is the summation symbol, xi is the x value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y value, σx is the standard deviation of x, and σy is the standard deviation of y.
 * The coefficient of determination ranges from 0 to 1.
 * An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
 * An R2 of 1 means the dependent variable can be predicted without error from the independent variable.
 * An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in //Y// is predictable from //X//; an R2 of 0.20 means that 20 percent is predictable; and so on.

Standard Error
The **standard error** about the regression line (often denoted by SE) is a measure of the average amount that the regression equation over- or under-predicts. The higher the coefficient of determination, the lower the standard error; and the more accurate predictions are likely to be.

Test Your Understanding of This Lesson
A researcher uses a regression equation to predict home heating bills (dollar cost), based on home size (square feet). The correlation between predicted bills and home size is 0.70. What is the correct interpretation of this finding? (A) 70% of the variability in home heating bills can be explained by home size. (B) 49% of the variability in home heating bills can be explained by home size. (C) For each added square foot of home size, heating bills increased by 70 cents. (D) For each added square foot of home size, heating bills increased by 49 cents. (E) None of the above. The correct answer is (B). The coefficient of determination measures the proportion of variation in the dependent variable that is predictable from the independent variable. The coefficient of determination is equal to R2; in this case, (0.70)2 or 0.49. Therefore, 49% of the variability in heating bills can be explained by home size. = = = = =Chapter 3​=  Variable (y): measures the outcome of a study Explanatory Variable (x): attempts to explain the outcome (x,y) must be a point on the LSRL Residuals: observed-predicted or y-ŷ Sum and mean of residuals always equals 0 Correlation Coefficient (r): measures strength and direction of linear relationship between two quantitative variables Sentence: **//__"Strong positive linear association between__ _X-Value _and __Y-Value_____"__//** Coefficient of Determination (r2) Influential Outlier: If a point is removed and it markedly changes the position of the regression line Regression Outlier: points far away from LSRL in y direction
 * Problem 1**
 * Solution**
 * Example: child's age (explanatory/x) predicting height (response/y)
 * There is a residual for each data point
 * + means the point falls above the line, or prediction
 * - means the point falls below the line, or prediction
 * If residual plot is random that means the data is linear, if in a pattern then not linear
 * A strong r does not necessarily mean a linear relationship
 * A correlation at or near 0 doesn’t mean there isn’t a relationship btw the variables; there may be a strong nonlinear relationship
 * Facts about r:
 * + r means + correlation; therefore, - r has a - correlation
 * r is between [-1, 1] the closer to -1 or 1 the stronger the correlation
 * strongly affected by outliers
 * ======**// __"The r2% of variation in y in words can be explained by the LSRL"__ //**======
 * ======r2 is the proportion of the variance that is predictable from a knowledge of x======
 * ======r2 gives the % of variation in y that is explained by the variation in x======
 * ======When calculating r from r2 remember r could be +/-======
 * Points that are outliers in the x direction
 * [|Influential Outlier Example (Regression Applet)]

Least Squares Regression Line (LSRL; Line of Best Fit) is used to predict points  > Real World Example: Okun's law in macroeconomics is an example of simple linear regression, with the dependent variable (GDP growth) is shown to be in a linear relationship with the changes in the unemployment rate. Slope: **// __"For every x in words there is an average (increase or decrease) of y in words and # of slope"__ //**
 * Formula is based on standardizing scores (z scores) therefore changing units does not change correlation
 * Best fitting line means the line that minimizes the sum of the squares of the vertical differences between the observed values and the values predicted by the lines
 * ŷ = a+bx
 * a=y intercept b=slope
 * 

​DO NOT FORGET CONTEXT IN THE SENTENCES!
=Chapter 4=

Transformations: > -increasing preserves order > -decreasing reverses order Power Functions: Lurking Variables: Variable influences the response (x) and/or exp. that was not included in the modeling effort (ex: time or weather) Correlation Based on Average: TOO HIGH when applies to individuals Extrapolation: predicting outside the data - not accurate- NOT OKAY! Simpson's Paradox: The reversal of an association when data from several groups are combined to form one big group
 * Transform data that is not linear by using log, square root (√), x2, 1/x
 * A monotonic function f(t) - moves in one direction as its argument (t) increases
 * Linear(+): a+bt
 * Linear (-): a-bt
 * Square: t2
 * Reciprocal: 1/√t
 * Log: log(t)
 * Reciprocal: 1/e
 * Exponential growth becomes linear when you log y and x stays the same
 * Power becomes linear when you log y and log x
 * Growth rate: 10
 * ex- heat days per day would be all over average days for month


 * ASSOCIATION DOES NOT IMPLY CAUSATION! **

=**Chapter 14**= ŷ = a+bx Use s to estimate unknown ** Four Step Procedure (Significant Test of Lines): **

> Ha: B><0 There is a positive/negative association > State alpha level! > If the data is computer generated, divide given p by 2 (2 sided test) Four Step Procedure (CI for Regression):
 * STEP 1: Ho: B=0 There is no linear association
 * STEP 2: Assume approximately normal and linear; name linear test
 * STEP 3: t= b / SE b df= p=
 * STEP 4: Reject/Fail to Reject. Linear association/no linear association; p<>a(alpha)
 * STEP 1: Estimate...
 * STEP 2: Assume normal and linear; Linear regression t interval
 * STEP 3: df= t=b ± t*( SE b)
 * STEP 4: I am % confident ...

How to in the Calculator:
1) Enter data into L1 and L2 2) STAT over to calculate down to 8 (linear regression) and ENTER 3)To graph: a) go to Y= b) VARS go down to 5(statistics) ENTER over to EQ ENTER c) ZOOM 9 4)To graph residuals: turn stat plot #2 on with x: L1 and y: RESID
 * this gives you slope, y intercept, r2, and r
 * also will put RESIDS list in for you (if the list exists in your lists)
 * to sum up residuals: 2ND STAT over to math down to 5 ENTER then 2ND STAT down to RESID ENTER --->for squared add the 2 inside the parentheses so it looks like: sum( LRESID2 )


 * Previous AP Exams Free Response with Linear Regression:** [|__**Statistics AP Central**__]

2001 Question 6 2002 Question 4 2004 (Form B) Question 1 2005 Question 3 2005 (Form B) Question 5 2006 Question 2 2007 (Form B) Question 4 2007 (Form B) Question 6 2008 Question 6

Websites:
[] (Audio Lecture with examples) [] [] []

Practice Problems:
media type="custom" key="5940531"

media type="custom" key="5940559"

media type="custom" key="5940561"

media type="custom" key="5940573"