The general syntax of the regression operator is:
MODEL ← ResponseVariable [transformation] regress PredictorVariables.
The various forms of regression are listed below:
The following table shows the various designs handled by the regress operator:
Design |
Left Argument |
Operand / Function |
Right Argument |
Result |
Simple linear regression |
Response Variable* |
regress |
Predictor Variable |
Intercept, Slope |
Multiple Linear Regression |
Response Variable* |
regress |
Vector of Predictor Variables |
Intercept, Coefficients for each predictor variable. |
Matrix whose columns are predictor variables |
||||
Name of Response Variable* (Character string) |
regress |
Namespace containing all variables |
||
Simple Quadratic Regression |
Response Variable |
⊥ regress |
Predictor Variable |
Intercept, Coeffiicents for centered data and squared centered data |
Multiple Quadratic Regression |
Response Variable |
⊥ regress |
Vector of Predictor Variables |
Intercept, Linear, Quadratic and Interaction Coefficients |
Polynomial Models |
Response Variable |
N ⊥ regress |
Predictor Variable |
Intercept, Coefficients for all powers up to N of predictor variable |
Model with Indicator Variable(s) |
Response Variable* |
regress |
Vector containing Predictor Variables and at least one Character Variable |
Intercept, Coefficients for each predictor variable and ( |
Variance Stabilizing Transformations |
Response Variable |
fn regress [ln|sqrt|÷|arcsin] |
Predictor Variable |
Intercept, Coefficients |
Multiplicative Regression
|
Response Variable |
× regress |
Predictor Variable |
Constant, Powers |
Indicator response variable |
Boolean Variable |
≠ regress |
Predictor Variable |
Intercept, Coefficients |
Custom Regression |
[None] |
transform regress |
Database (Namespace) |
Intercept, Coefficients |
* Pseudo Left Argument – Actually, an array left operand. |
A car dealer runs television ads for five weeks and records the number of cars sold that week:
Week |
Television Ads Run |
Cars Sold |
1 |
1.0 |
14.0 |
2 |
3.0 |
24.0 |
3 |
2.0 |
18.0 |
4 |
1.0 |
17.0 |
5 |
3.0 |
27.0 |
The number of ads run for 5 weeks is known as the independent or predictor variable X.
ADS←1 3 2 1 3
The number of cars sold in each of the 5 weeks is known as the dependent or response variable Y.
CARS←14 24 18 17 27
The regress function produces a namespace RG containing various outputs:
RG←CARS regress ADS
The estimates for the intercept and the slope are represented by the vector B.
RG.B ⍝ Intercept, Slope
10 5
To obtain a full report apply the report function:
report RG
────────────────────────────────────────────────────────────────────────
The regression equation is:
Y←10+(5×X1)+E
ANOVA Table
SOURCE SS DF MS F P
------ --------------- ----- --------------- --------- ---------
Regression 100.00 1 100.00 21.43 0.01899
Error 14.00 3 4.67
---------- --------------- ----- --------------- --------- ---------
Total 114.00 4
S = 2.16025 R-Sq = 87.72% R-Sq(adj) = 83.63%
Solution
Variable Coeff SE T P
Intercept 10.00 2.37 4.22577 0.02424
B1 5.00 1.08 4.62910 0.01899
────────────────────────────────────────────────────────────────────────
Sometimes it is useful to have more than one predictor variable. For example, we can estimate height from both weight and shoe size. The two-independent-variable model is:
or where 𝜀 represents the residual or error.
Multiple regression in TamStat requires the right argument to take on one of three forms: a variable list, a matrix or a namespace. For both the variable list and matrix forms, the left argument the response variable, a numeric vector.
MODEL←Weight regress Height ShoeSize ⍝ Variable List
XX←Height,⍪ShoeSize ⍝ XX is N×2 Matrix
MODEL←Weight regress XX ⍝ Multiple Regression
report MODEL
Observe that the intercept and the coefficient for shoe size are significant, the weight coefficient is not significant due to the large p-value. This shows that Weight does not contribute significantly to height when shoe size is in the model.
To preserve the names of the variables, one can use a namespace which represents a database:
V←'Weight Height ShoeSize'
⍝ Variables of interest
DB←V selectFrom SD ⍝ Put variables into namespace
MODEL←'Weight' regress DB ⍝ Left argument is name of response variable.
report MODEL
MODEL.f 68 9.5 ⍝ Estimate weight from height, shoe size
158.84
0.9 MODEL.f confInt 68 9.5 ⍝ 90% Confidence interval
151.38 166.3
0.9 MODEL.f predInt 68 9.5 ⍝ 90% Prediction interval
114.96 202.72
If there are character fields in a database, TamStat treats them as indicator variables. If there are more than two categories, TamStat will create multiple indicator variables. The indicator variable names will be taken from the value(s) of the indicator variable. There will always be
V←'Height ShoeSize Sex' ⍝ “Sex” is a character field
DB←V selectFrom D ⍝ Create a namespace
MODEL←'Height' regress DB ⍝ “Height” is response variable
report MODEL ⍝ “F” is a value from “Sex” character field
Indicator Variables with more than two Categories
Let’s replace Sex with Party. There are 3 parties: Democrat, Independent and Republican. TamStat will create two indicator variables from Party: “Democrat” and “Independent”. “Republican” will be the base case.
V←'Height ShoeSize Party'
⍝ Replace “Sex” with “Party”
DB←V selectFrom D ⍝ Create new database
MODEL←'Height'
regress DB ⍝ Height is response variable
report MODEL ⍝ “Republican” is base Case
Multiple Indicator Variables
When there are more than one character field in a database, the number of indicator variables will become
where m is the number of character fields:
DB←'Height ShoeSize Party Sex' selectFrom SD ⍝ Two-character fields
report 'Height' regress DB ⍝ 3 Indicator Variables
We are trying to predict Y = tensile strength in p.s.i. from X = hardwood concentration (%) .
X←1 1.5 2 3 4 4.5 5 5.5 6 6.5 7 8 9 10 11 12 13 14 15
Y←6.3 11.1 20 24 26.1 30 33.8 34 38.1 39.9 42 46.1 53.1 52 52.5 48 42.8 27.8 21.9
scatterPlot show Y X ⍝ Data are non-linear
Since the data above are clearly non-linear, perhaps a quadratic regression model would be appropriate. We use the base-value (⊥) or poly operand to indicate a quadratic model. Higher order polynomials are indicated by (n⊥) where n is degree of the polynomial. To eliminate correlation between the linear and squared terms, we correct by subtracting the mean:
MODEL←Y poly regress X ⍝ Quadratic regression
MODEL.B ⍝ Constant, linear and square coefficients
45.295 2.5463 ¯0.63455
⍝ ↑ ↑ ↑
⍝ Int Linear Square
MODEL.g 7 10 15 ⍝ The function g is the non-linear model:
44.581 47.511 27.012
MODEL.g confInt 10 ⍝ Confidence Interval
44.402 50.619
MODEL.g predInt 15 ⍝ Prediction Interval
15.752 38.273
report MODEL
Polynomial Models in Two or More Variables
We are trying to predict percentage yield from reaction time and temperature. Both predictor variables are quadratic, so we have a linear and quadratic term for each, plus an interactive term for a total of five input variables:
⍝ Predictor Variable: Reaction Time
X←76 80.5 78 89 93 92.1 77.8 84 87.3 75 85 90 85 79.2 83 82 94 91.4 95 81.1 88.8 91 87 86
⍝ Predictor Variable: Temperature
Y←170 165 182 185 180 172 170 180 165 172 185 176 178 174 168 179 181 184 173 169 183 178 175 175
⍝ Response Variable: Yield
Z←50.95 47.35 50.99 44.96 41.89 41.44 51.79 50.78 42.48 49.8 48.74 46.2 50.49 52.78 49.71 52.75 39.41 43.63 38.19 50.92 46.55 44.28 48.72 49.13
MODEL←Z poly regress X Y
MODEL.B ⍝ Intercept, Linear, Quadratic and Interaction Coefficients
50.4 ¯0.72 ¯0.06 0.013 0.105 ¯0.038
⍝↑ ↑ ↑ ↑ ↑ ↑
⍝Int X X*2 X×Y Y Y*2
MODEL.g 90 176 ⍝ Reaction time = 90 sec, temp = 176 degrees C
45.96
MODEL.g confInt 90 176
45.481 46.439
MODEL.g predInt 90 176
44.542 47.378
report MODEL
The regression equation is:
Y←50.417+(¯0.71981×X1-85.467)+(¯0.059653×(X1-85.467)*2)+(0.012577×(X1-85.467)×(X2-175.79))+(0.10528×X2-175.79)+(¯0.037676×(X2-175.79)*2)+E
ANOVA Table
SOURCE SS DF MS F P
------ --------------- ----- --------------- --------- ---------
Regression 416.31 5 83.26 206.28 <0.00001
Error 7.27 18 0.40
---------- --------------- ----- --------------- --------- ---------
Total 423.58 23
S = 0.63532 R-Sq = 98.28% R-Sq(adj) = 97.81%
Solution
Variable Coeff SE T P
Intercept 50.42 0.26 192.84947 <0.00001
X1 ¯0.72 0.02 ¯29.36231 <0.00001
X2 ¯0.06 0.00 ¯13.09424 <0.00001
XY 0.01 0.01 2.40391 0.02721
Y1 0.11 0.02 4.42554 0.00033
Y2 ¯0.04 0.00 ¯9.18643 <0.00001
───────────────────────────────────────────────────────────────────────────
Linear regression assumes that the variance is constant regardless of the size of the response variable. In some cases this is not true. In order to compensate for this, we can transform the response variable to make the variance constant. There are several ways to do this. These are:
Variance proportional to |
Transformation |
Left Operand to regress |
Constant |
|
⊣ |
|
|
sqrt |
|
|
arcsin∘sqrt |
|
|
ln |
|
|
÷∘sqrt |
|
|
÷ |
An electric utility is developing a model relating peak demand to total monthly energy consumption. Data for 53 residential customers was collected. Demand is proportion to variance, so we well use a square-root transform.
Demand←0.79 0.44 0.56 0.79 2.7 3.64 4.73 9.5 5.34 6.85 5.84 5.21 3.25 4.43 3.16 0.5 0.17 1.88 0.77 1.39 0.56 1.56 5.28 0.64 4 0.31 4.2 4.88 3.48 7.58 2.63 4.99 0.59 8.19 4.79 0.51 1.74 4.1 3.94 0.96 3.29 0.44 3.24 2.14 5.71 0.64 1.9 0.51 8.33 14.94 5.11 3.85 3.93
Usage←679 292 1012 493 582 1156 997 2189 1097 2078 1818 1700 747 2030 1643 414 354 1276 745 435 540 874 1543 1029 710 1434 837 1748 1381 1428 1255 1777 370 2316 1130 463 770 724 808 790 783 406 1242 658 1746 468 1114 413 1787 3560 1495 2221 1526
MODEL←E.Demand sqrt regress Usage ⍝ Use Square Root transform
MODEL.B ⍝ Intercept and Slope
¯1.8313 0.0036828
MODEL.f 1800 ⍝ Least Squares Estimate of transformed demand.
2.2974
MODEL.g 1800 ⍝ Estimate demand in KW from 1800 KWH usage
5.2779
MODEL.g confInt 1800 ⍝ Average demand range
4.4799 6.1413
MODEL.g predInt 1800 ⍝ Individual demand range
1.8181 10.539
Multiplicative regression models the following relationship: . This transforms both the predictor and response variables since it can be rewritten as:
As an example, let X and Y be the predictor and response variables:
X←14 14 8 10 6 7 5 10 5 13
Y←864 870 83 176 37 50 8 164 26 584
MODEL←Y ×regress X
MODEL.B ⍝ Coefficient A and Exponent B
0.029161 3.8539
MODEL.g 9 ⍝ Estimate Y for X = 9
138.8
The user may create a function which selects and/or transforms any of the variables in a database. This is particularly useful if there are multiple transformations. In order to do this one must create a transform function. The variable named Y becomes the response variable; Int defaults to 1; all others are predictor variables:
makeTransFn 'Y←Height' 'X1←ShoeSize' 'X2←Sex eq ''M''' 'X3←Weight'
MODEL←transform regress #.SD
report MODEL