The Regression Operator: Model ← ResponseVariable [transformation] regress PredictorVariables.

The general syntax of the regression operator is:

MODEL ← ResponseVariable [transformation] regress PredictorVariables.

The various forms of regression are listed below:

Simple Regresison. One predictor variable
Multiple Regression: Two or more predictor variables
Indicator Variables: One or more indicator varibles with two or more categories in each.
Polynomial Regression: One or two quadratic predictor variables
Variance Stabilizing Transformations Log, square-root, and inverse transformations
Multiplicative Regression: Log transformation of both predictor and response variables
Custom Regression: Complex designs requiring a combination of the above designs.

The following table shows the various designs handled by the regress operator:

Design	Left Argument	Operand / Function	Right Argument	Result
Simple linear regression	Response Variable*	regress	Predictor Variable	Intercept, Slope
Multiple Linear Regression	Response Variable*	regress	Vector of Predictor Variables	Intercept, Coefficients for each predictor variable.
	Response Variable*	regress	Matrix whose columns are predictor variables
	Name of Response Variable* (Character string)	regress	Namespace containing all variables
Simple Quadratic Regression	Response Variable	⊥ regress	Predictor Variable	Intercept, Coeffiicents for centered data and squared centered data
Multiple Quadratic Regression	Response Variable	⊥ regress	Vector of Predictor Variables	Intercept, Linear, Quadratic and Interaction Coefficients
Polynomial Models	Response Variable	N ⊥ regress	Predictor Variable	Intercept, Coefficients for all powers up to N of predictor variable
Model with Indicator Variable(s)	Response Variable*	regress	Vector containing Predictor Variables and at least one Character Variable	Intercept, Coefficients for each predictor variable and k-1 coefficients for each character variable. (k = unique character values)
Variance Stabilizing Transformations	Response Variable	fn regress [ln\|sqrt\|÷\|arcsin]	Predictor Variable	Intercept, Coefficients
Multiplicative Regression	Response Variable	× regress	Predictor Variable	Constant, Powers
Indicator response variable	Boolean Variable	≠ regress	Predictor Variable	Intercept, Coefficients
Custom Regression	[None]	transform regress	Database (Namespace)	Intercept, Coefficients
* Pseudo Left Argument – Actually, an array left operand.

Simple Regression

A car dealer runs television ads for five weeks and records the number of cars sold that week:

Week	Television Ads Run	Cars Sold
1	1.0	14.0
2	3.0	24.0
3	2.0	18.0
4	1.0	17.0
5	3.0	27.0

The number of ads run for 5 weeks is known as the independent or predictor variable X.

ADS←1 3 2 1 3

The number of cars sold in each of the 5 weeks is known as the dependent or response variable Y.

CARS←14 24 18 17 27

The regress function produces a namespace RG containing various outputs:

RG←CARS regress ADS

The estimates for the intercept and the slope are represented by the vector B.

RG.B ⍝ Intercept, Slope

10 5

To obtain a full report apply the report function:

     report RG
────────────────────────────────────────────────────────────────────────
The regression equation is:

Y←10+(5×X1)+E

ANOVA Table

SOURCE                  SS    DF              MS         F         P
------     --------------- ----- --------------- --------- ---------
Regression          100.00     1          100.00     21.43   0.01899
Error                14.00     3            4.67
---------- --------------- ----- --------------- --------- ---------
Total               114.00     4

S =    2.16025 R-Sq = 87.72% R-Sq(adj) =   83.63%

Solution

   Variable     Coeff        SE         T         P
Intercept      10.00      2.37   4.22577   0.02424
B1              5.00      1.08   4.62910   0.01899
────────────────────────────────────────────────────────────────────────

Multiple Regression

Sometimes it is useful to have more than one predictor variable. For example, we can estimate height from both weight and shoe size. The two-independent-variable model is:

or where 𝜀 represents the residual or error.

Multiple regression in TamStat requires the right argument to take on one of three forms: a variable list, a matrix or a namespace. For both the variable list and matrix forms, the left argument the response variable, a numeric vector.

MODEL←Weight regress Height ShoeSize ⍝ Variable List

XX←Height,⍪ShoeSize ⍝ XX is N×2 Matrix

MODEL←Weight regress XX ⍝ Multiple Regression

report MODEL

Observe that the intercept and the coefficient for shoe size are significant, the weight coefficient is not significant due to the large p-value. This shows that Weight does not contribute significantly to height when shoe size is in the model.

To preserve the names of the variables, one can use a namespace which represents a database:

V←'Weight Height ShoeSize' ⍝ Variables of interest

DB←V selectFrom SD ⍝ Put variables into namespace

MODEL←'Weight' regress DB ⍝ Left argument is name of response variable.

report MODEL

MODEL.f 68 9.5 ⍝ Estimate weight from height, shoe size

158.84

0.9 MODEL.f confInt 68 9.5 ⍝ 90% Confidence interval

151.38 166.3

0.9 MODEL.f predInt 68 9.5 ⍝ 90% Prediction interval

114.96 202.72

Indicator Variables

If there are character fields in a database, TamStat treats them as indicator variables. If there are more than two categories, TamStat will create multiple indicator variables. The indicator variable names will be taken from the value(s) of the indicator variable. There will always be k-1 indicator variables when there are k unique values in the character field.

V←'Height ShoeSize Sex' ⍝ “Sex” is a character field

DB←V selectFrom D ⍝ Create a namespace

MODEL←'Height' regress DB ⍝ “Height” is response variable

report MODEL ⍝ “F” is a value from “Sex” character field

Indicator Variables with more than two Categories

Let’s replace Sex with Party. There are 3 parties: Democrat, Independent and Republican. TamStat will create two indicator variables from Party: “Democrat” and “Independent”. “Republican” will be the base case.

V←'Height ShoeSize Party' ⍝ Replace “Sex” with “Party”

DB←V selectFrom D ⍝ Create new database

MODEL←'Height' regress DB ⍝ Height is response variable

report MODEL ⍝ “Republican” is base Case

Multiple Indicator Variables

When there are more than one character field in a database, the number of indicator variables will become where m is the number of character fields:

DB←'Height ShoeSize Party Sex' selectFrom SD ⍝ Two-character fields

report 'Height' regress DB ⍝ 3 Indicator Variables

Polynomial Regression

We are trying to predict Y = tensile strength in p.s.i. from X = hardwood concentration (%) .

X←1 1.5 2 3 4 4.5 5 5.5 6 6.5 7 8 9 10 11 12 13 14 15

Y←6.3 11.1 20 24 26.1 30 33.8 34 38.1 39.9 42 46.1 53.1 52 52.5 48 42.8 27.8 21.9

scatterPlot show Y X ⍝ Data are non-linear

Since the data above are clearly non-linear, perhaps a quadratic regression model would be appropriate. We use the base-value (⊥) or poly operand to indicate a quadratic model. Higher order polynomials are indicated by (n⊥) where n is degree of the polynomial. To eliminate correlation between the linear and squared terms, we correct by subtracting the mean:

MODEL←Y poly regress X ⍝ Quadratic regression

MODEL.B ⍝ Constant, linear and square coefficients

45.295 2.5463 ¯0.63455

⍝ ↑ ↑ ↑

⍝ Int Linear Square

MODEL.g 7 10 15 ⍝ The function g is the non-linear model:

44.581 47.511 27.012

MODEL.g confInt 10 ⍝ Confidence Interval

44.402 50.619

MODEL.g predInt 15 ⍝ Prediction Interval

15.752 38.273

report MODEL

Polynomial Models in Two or More Variables

We are trying to predict percentage yield from reaction time and temperature. Both predictor variables are quadratic, so we have a linear and quadratic term for each, plus an interactive term for a total of five input variables:

⍝ Predictor Variable: Reaction Time

X←76 80.5 78 89 93 92.1 77.8 84 87.3 75 85 90 85 79.2 83 82 94 91.4 95 81.1 88.8 91 87 86

⍝ Predictor Variable: Temperature

Y←170 165 182 185 180 172 170 180 165 172 185 176 178 174 168 179 181 184 173 169 183 178 175 175

⍝ Response Variable: Yield

Z←50.95 47.35 50.99 44.96 41.89 41.44 51.79 50.78 42.48 49.8 48.74 46.2 50.49 52.78 49.71 52.75 39.41 43.63 38.19 50.92 46.55 44.28 48.72 49.13

MODEL←Z poly regress X Y

MODEL.B ⍝ Intercept, Linear, Quadratic and Interaction Coefficients

50.4 ¯0.72 ¯0.06 0.013 0.105 ¯0.038

⍝↑ ↑ ↑ ↑ ↑ ↑

⍝Int X X*2 X×Y Y Y*2

MODEL.g 90 176 ⍝ Reaction time = 90 sec, temp = 176 degrees C

45.96

MODEL.g confInt 90 176

45.481 46.439

MODEL.g predInt 90 176

44.542 47.378

report MODEL

The regression equation is:

Y←50.417+(¯0.71981×X1-85.467)+(¯0.059653×(X1-85.467)*2)+(0.012577×(X1-85.467)×(X2-175.79))+(0.10528×X2-175.79)+(¯0.037676×(X2-175.79)*2)+E

ANOVA Table

SOURCE SS DF MS F P

------ --------------- ----- --------------- --------- ---------

Regression 416.31 5 83.26 206.28 <0.00001

Error 7.27 18 0.40

---------- --------------- ----- --------------- --------- ---------

Total 423.58 23

S = 0.63532 R-Sq = 98.28% R-Sq(adj) = 97.81%

Solution

Variable Coeff SE T P

Intercept 50.42 0.26 192.84947 <0.00001

X1 ¯0.72 0.02 ¯29.36231 <0.00001

X2 ¯0.06 0.00 ¯13.09424 <0.00001

XY 0.01 0.01 2.40391 0.02721

Y1 0.11 0.02 4.42554 0.00033

Y2 ¯0.04 0.00 ¯9.18643 <0.00001

───────────────────────────────────────────────────────────────────────────

Variance Stabilizling Transformations

Linear regression assumes that the variance is constant regardless of the size of the response variable. In some cases this is not true. In order to compensate for this, we can transform the response variable to make the variance constant. There are several ways to do this. These are:

Variance proportional to	Transformation	Left Operand to regress
Constant		⊣
		sqrt
		arcsin∘sqrt
		ln
		÷∘sqrt
		÷

An electric utility is developing a model relating peak demand to total monthly energy consumption. Data for 53 residential customers was collected. Demand is proportion to variance, so we well use a square-root transform.

Demand←0.79 0.44 0.56 0.79 2.7 3.64 4.73 9.5 5.34 6.85 5.84 5.21 3.25 4.43 3.16 0.5 0.17 1.88 0.77 1.39 0.56 1.56 5.28 0.64 4 0.31 4.2 4.88 3.48 7.58 2.63 4.99 0.59 8.19 4.79 0.51 1.74 4.1 3.94 0.96 3.29 0.44 3.24 2.14 5.71 0.64 1.9 0.51 8.33 14.94 5.11 3.85 3.93

Usage←679 292 1012 493 582 1156 997 2189 1097 2078 1818 1700 747 2030 1643 414 354 1276 745 435 540 874 1543 1029 710 1434 837 1748 1381 1428 1255 1777 370 2316 1130 463 770 724 808 790 783 406 1242 658 1746 468 1114 413 1787 3560 1495 2221 1526

MODEL←E.Demand sqrt regress Usage ⍝ Use Square Root transform

MODEL.B ⍝ Intercept and Slope

¯1.8313 0.0036828

MODEL.f 1800 ⍝ Least Squares Estimate of transformed demand.

2.2974

MODEL.g 1800 ⍝ Estimate demand in KW from 1800 KWH usage

5.2779

MODEL.g confInt 1800 ⍝ Average demand range

4.4799 6.1413

MODEL.g predInt 1800 ⍝ Individual demand range

1.8181 10.539

Multiplicative Regression

Multiplicative regression models the following relationship: . This transforms both the predictor and response variables since it can be rewritten as: . The ln operand to regress will only transform the response variable, so we use × as the operand to indicate multiplicative regression. The parameters will be the coefficient a and the exponent b.

As an example, let X and Y be the predictor and response variables:

X←14 14 8 10 6 7 5 10 5 13

Y←864 870 83 176 37 50 8 164 26 584

MODEL←Y ×regress X

MODEL.B ⍝ Coefficient A and Exponent B

0.029161 3.8539

MODEL.g 9 ⍝ Estimate Y for X = 9

138.8

Custom Designed Regression

The user may create a function which selects and/or transforms any of the variables in a database. This is particularly useful if there are multiple transformations. In order to do this one must create a transform function. The variable named Y becomes the response variable; Int defaults to 1; all others are predictor variables:

makeTransFn 'Y←Height' 'X1←ShoeSize' 'X2←Sex eq ''M''' 'X3←Weight'

MODEL←transform regress #.SD

report MODEL