The Regression Operator:  Model ← ResponseVariable [transformation] regress PredictorVariables.

The general syntax of the regression operator is: 

MODEL ← ResponseVariable [transformation] regress PredictorVariables.  

The various forms of regression are listed below:

  1. Simple Regresison.  One predictor variable
  2. Multiple Regression:  Two or more predictor variables
  3. Indicator Variables:   One or more indicator varibles with two or more categories in each.
  4. Polynomial Regression:  One or two quadratic predictor variables
  5. Variance Stabilizing Transformations  Log, square-root, and inverse transformations
  6. Multiplicative Regression:  Log transformation of both predictor and response variables
  7. Custom Regression:  Complex designs requiring a combination of the above designs.

The following table shows the various designs handled by the regress operator:

Design

Left Argument

Operand / Function

Right Argument

Result

Simple linear regression

Response Variable*

regress

Predictor Variable

Intercept, Slope

Multiple Linear Regression

Response Variable*

regress

Vector of Predictor Variables

Intercept, Coefficients for each predictor variable.

Matrix whose columns are predictor variables

Name of Response Variable*  (Character string)

regress

Namespace containing all variables

Simple Quadratic Regression

Response Variable

  regress

Predictor Variable

Intercept, Coeffiicents for centered data and squared centered data

Multiple Quadratic Regression

Response Variable

  regress

Vector of Predictor Variables

Intercept, Linear, Quadratic and Interaction Coefficients

Polynomial Models

Response Variable

N   regress

Predictor Variable

Intercept, Coefficients for all powers up to N of predictor variable

Model with Indicator Variable(s)

Response Variable*

regress

Vector containing Predictor Variables and at least one Character Variable

Intercept, Coefficients for each predictor variable and k-1 coefficients for each character variable.

(k = unique character values)

Variance Stabilizing Transformations

Response Variable

fn regress

[ln|sqrt|÷|arcsin]

Predictor Variable

Intercept, Coefficients

Multiplicative Regression

Response Variable

× regress

Predictor Variable

Constant, Powers

Indicator response variable

Boolean Variable

≠ regress

Predictor Variable

Intercept, Coefficients

Custom Regression

[None]

transform regress

Database (Namespace)

Intercept, Coefficients

*  Pseudo Left Argument – Actually, an array left operand.              

Simple Regression

A car dealer runs television ads for five weeks and records the number of cars sold that week:

Week

Television Ads Run

Cars Sold

1

1.0

14.0

2

3.0

24.0

3

2.0

18.0

4

1.0

17.0

5

3.0

27.0

The number of ads run for 5 weeks is known as the independent or predictor variable X.      

      ADS←1 3 2 1 3

The number of cars sold in each of the 5 weeks is known as the dependent or response  variable Y.

      CARS←14 24 18 17 27

The regress function produces a namespace RG containing various outputs: 

      RG←CARS regress ADS

The estimates for the intercept and the slope are represented by the vector B.

      RG.B             Intercept, Slope

10 5

To obtain a full report apply the report function:  

     report RG
────────────────────────────────────────────────────────────────────────
 The regression equation is:                                           
                                                                       
 Y←10+(5×X1)+E                                                       
                                                                       
  ANOVA Table                                                          
                                                                       
  SOURCE                  SS    DF              MS         F         P 
  ------     --------------- ----- --------------- --------- --------- 
  Regression          100.00     1          100.00     21.43   0.01899 
  Error                14.00     3            4.67                     
  ---------- --------------- ----- --------------- --------- --------- 
  Total               114.00     4                                     
                                                                       
  S =    2.16025  R-Sq =  87.72%  R-Sq(adj) =   83.63%                 
                                                                        
 Solution                                                              
                                                                       
   Variable     Coeff        SE         T         P                    
 Intercept      10.00      2.37   4.22577   0.02424                    
 B1              5.00      1.08   4.62910   0.01899                    
────────────────────────────────────────────────────────────────────────      

Multiple Regression

Sometimes it is useful to have more than one predictor variable.     For example, we can estimate height from both weight and shoe size.   The two-independent-variable model is:

         or       where 𝜀 represents the residual or error. 

Multiple regression in TamStat requires the right argument to take on one of three forms:   a variable list, a matrix or a namespace. For both the variable list and matrix forms, the left argument the response variable, a numeric vector. 

     MODEL←Weight regress Height ShoeSize ⍝  Variable List

     XX←Height,⍪ShoeSize                  ⍝  XX is N×2 Matrix

     MODEL←Weight regress XX              ⍝  Multiple Regression

     report MODEL

      

 Observe that the intercept and the coefficient for shoe size are significant, the weight coefficient is not significant due to the large p-value.   This shows that Weight does not contribute significantly to height when shoe size is in the model.

To preserve the names of the variables, one can use a namespace which represents a database:

             V←'Weight Height ShoeSize'  ⍝ Variables of interest

     DB←V selectFrom SD          ⍝ Put variables into namespace  

     MODEL←'Weight' regress DB   ⍝ Left argument is name of response variable.

     report MODEL

  

     MODEL.f 68 9.5               ⍝ Estimate weight from height, shoe size

158.84

    0.9 MODEL.f confInt 68 9.5   ⍝ 90% Confidence interval

151.38 166.3

    0.9 MODEL.f predInt 68 9.5   ⍝ 90% Prediction interval

114.96 202.72
 

Indicator Variables

If there are character fields in a database, TamStat treats them as indicator variables.   If there are more than two categories, TamStat will create multiple indicator variables.   The indicator variable names will be taken from the value(s) of the indicator variable.   There will always be k-1 indicator variables when there are k unique values in the character field.  

    V←'Height ShoeSize Sex'    ⍝ “Sex” is a character field

   DB←V selectFrom D          ⍝ Create a namespace

   MODEL←'Height' regress DB  ⍝ “Height” is response variable

   report MODEL               ⍝ “F” is a value from “Sex” character field

Indicator Variables with more than two Categories

Let’s replace Sex with Party.  There are 3 parties:  Democrat, Independent and Republican.  TamStat will create two indicator variables from Party:  “Democrat” and “Independent”.  “Republican” will be the base case.

   V←'Height ShoeSize Party'   ⍝ Replace “Sex” with “Party”

   DB←V selectFrom D           ⍝ Create new database

   MODEL←'Height' regress DB   ⍝ Height is response variable

   report MODEL                ⍝ “Republican” is base Case

  

Multiple Indicator Variables

When there are more than one character field in a database, the number of indicator variables will become   where m is the number of character fields:

   DB←'Height ShoeSize Party Sex' selectFrom SD ⍝ Two-character fields

  report 'Height' regress DB                    ⍝ 3 Indicator Variables

Polynomial Regression

We are trying to predict Y = tensile strength in p.s.i.  from X = hardwood concentration (%) . 

      X←1 1.5 2 3 4 4.5 5 5.5 6 6.5 7 8 9 10 11 12 13 14 15

     Y←6.3 11.1 20 24 26.1 30 33.8 34 38.1 39.9 42 46.1 53.1 52 52.5 48 42.8 27.8 21.9

     scatterPlot show Y X ⍝ Data are non-linear

Since the data above are clearly non-linear, perhaps a quadratic regression model would be appropriate.   We use the base-value () or poly operand to indicate a quadratic model.     Higher order polynomials are indicated by (n⊥)  where n is degree of the polynomial.    To eliminate correlation between the linear and squared terms, we correct by subtracting the mean: 

        MODEL←Y poly regress X ⍝ Quadratic regression

   MODEL.B                ⍝ Constant, linear and square coefficients

45.295 2.5463 ¯0.63455

⍝  ↑       ↑       ↑

⍝ Int   Linear  Square

   MODEL.g 7 10 15        ⍝ The function g is the non-linear model:

44.581 47.511 27.012

   MODEL.g confInt 10     ⍝ Confidence Interval

44.402 50.619

   MODEL.g predInt 15     ⍝ Prediction Interval

15.752 38.273

   report MODEL

Polynomial  Models in Two or More  Variables

We are trying to predict percentage yield from reaction time and temperature.  Both predictor variables are quadratic, so we have a linear and quadratic term for each, plus an interactive term for a total of five input variables:

   ⍝ Predictor Variable: Reaction Time

   X←76 80.5 78 89 93 92.1 77.8 84 87.3 75 85 90 85 79.2 83 82 94 91.4 95 81.1 88.8 91 87 86

   ⍝ Predictor Variable: Temperature

   Y←170 165 182 185 180 172 170 180 165 172 185 176 178 174 168 179 181 184 173 169 183 178 175 175

   ⍝ Response Variable: Yield

   Z←50.95 47.35 50.99 44.96 41.89 41.44 51.79 50.78 42.48 49.8 48.74 46.2 50.49 52.78 49.71 52.75 39.41 43.63 38.19 50.92 46.55 44.28 48.72 49.13

   MODEL←Z poly regress X Y

   MODEL.B    ⍝ Intercept, Linear, Quadratic and Interaction Coefficients

50.4 ¯0.72 ¯0.06 0.013 0.105 ¯0.038

⍝↑      ↑     ↑     ↑    ↑       ↑

⍝Int    X    X*2   X×Y   Y      Y*2

  

   MODEL.g 90 176         ⍝ Reaction time = 90 sec, temp = 176 degrees C

45.96

   MODEL.g confInt 90 176

45.481 46.439

   MODEL.g predInt 90 176

44.542 47.378

   report MODEL

The regression equation is:                                         

                                                                      Y←50.417+(¯0.71981×X1-85.467)+(¯0.059653×(X1-85.467)*2)+(0.012577×(X1-85.467)×(X2-175.79))+(0.10528×X2-175.79)+(¯0.037676×(X2-175.79)*2)+E                                   

                                                       
  ANOVA Table                                  

  SOURCE                  SS    DF              MS         F         P

  ------     --------------- ----- --------------- --------- ---------

  Regression          416.31     5           83.26    206.28  <0.00001

  Error                 7.27    18            0.40                   

  ---------- --------------- ----- --------------- --------- ---------

  Total               423.58    23                                   

                                                                     
  S =    0.63532  R-Sq =  98.28%  R-Sq(adj) =   97.81%  

 Solution                                              

 Variable       Coeff        SE         T         P                  

 Intercept      50.42      0.26 192.84947  <0.00001                  

 X1             ¯0.72      0.02 ¯29.36231  <0.00001                  

 X2             ¯0.06      0.00 ¯13.09424  <0.00001                  

 XY              0.01      0.01   2.40391   0.02721                  

 Y1              0.11      0.02   4.42554   0.00033

 Y2             ¯0.04      0.00  ¯9.18643  <0.00001

───────────────────────────────────────────────────────────────────────────

Variance Stabilizling Transformations

Linear regression assumes that the variance is constant regardless of the size of the response variable.   In some cases this is not true.   In order to compensate for this, we can transform the response variable to make the variance constant.  There are several ways to do this.  These are:

Variance proportional to

Transformation

Left Operand to regress

Constant

sqrt

arcsin∘sqrt

ln

÷∘sqrt

÷

    An electric utility is developing a model relating peak demand to total monthly energy consumption.  Data for 53 residential customers was collected. Demand is proportion to variance, so we well use a square-root transform.

Demand←0.79 0.44 0.56 0.79 2.7 3.64 4.73 9.5 5.34 6.85 5.84 5.21 3.25 4.43 3.16 0.5 0.17 1.88 0.77 1.39 0.56 1.56 5.28 0.64 4 0.31 4.2 4.88 3.48 7.58 2.63 4.99 0.59 8.19 4.79 0.51 1.74 4.1 3.94 0.96 3.29 0.44 3.24 2.14 5.71 0.64 1.9 0.51 8.33 14.94 5.11 3.85 3.93   

Usage←679 292 1012 493 582 1156 997 2189 1097 2078 1818 1700 747 2030 1643 414 354 1276 745 435 540 874 1543 1029 710 1434 837 1748 1381 1428 1255 1777 370 2316 1130 463 770 724 808 790 783 406 1242 658 1746 468 1114 413 1787 3560 1495 2221 1526

    MODEL←E.Demand sqrt regress Usage Use Square Root transform

    MODEL.B                ⍝ Intercept and Slope

¯1.8313 0.0036828

    MODEL.f 1800             ⍝ Least Squares Estimate of transformed demand.

2.2974

    MODEL.g 1800           ⍝ Estimate demand in KW from  1800 KWH usage

5.2779

    MODEL.g confInt 1800   ⍝ Average demand range

4.4799 6.1413

    MODEL.g predInt 1800   ⍝ Individual demand range

1.8181 10.539

 

Multiplicative Regression

Multiplicative regression models the following relationship:  .   This transforms both the predictor and response variables since it can be rewritten as:   .    The ln operand to regress will only transform the response variable, so we use × as the operand to indicate multiplicative regression.   The parameters will be the coefficient  a and the exponent b.

As an example, let X and Y be the predictor and response variables:

   X←14 14 8 10 6 7 5 10 5 13

   Y←864 870 83 176 37 50 8 164 26 584

   MODEL←Y ×regress X

   MODEL.B          ⍝ Coefficient A and Exponent B

0.029161 3.8539

   MODEL.g 9        ⍝ Estimate Y for X = 9

138.8  

Custom Designed Regression

The user may create a function which selects and/or transforms any of the variables in a database.    This is particularly useful if there are multiple transformations.  In order to do this one must create a transform function.  The variable named Y becomes the response variable; Int defaults to 1; all others are predictor variables: 

   makeTransFn 'Y←Height' 'X1←ShoeSize' 'X2←Sex eq ''M''' 'X3←Weight'

   MODEL←transform regress #.SD

   report MODEL