Using SPSS for bivariate and multivariate regression

One of the most commonly-used and powerful tools of contemporary social science is regression analysis. 


As you learn to use this procedure and interpret its results, it is critically important to keep in mind that regression procedures rely on a number of basic assumptions about the data you are analyzing. Advanced statistics courses can show you how to manipulate procedures to deal with most violations of regression's basic assumptions. For our purposes (learning how to interpret regression results by seeing how these statistics are calculated using SPSS), you will want to keep in mind that the basic regression model will not produce accurate results unless the variables you are analyzing have a linear relationship with one another. 


Using SPSS for regression analysis.

We want to build a regression model with one or more variables predicting a linear change in a dependent variable. To do this, open the SPSS dataset you want to analyze. You will see a datamatrix (spreadsheet) that lists your cases (in the rows) and your variables (in the columns). 


Selecting from the several different menus at the top of the data matrix, go down through the hierarchical menus choosing:


When you select the "linear regression"  function, SPSS will provide a wizard that looks like the one portrayed in Figure I:


Figure 1


In Figure 1, the list that you see in the left-hand window lists variables by their variable label (rather than the eight character variable name that you probably have in your codebook). 


Select the variables you want to analyze, and use the arrow button to send them to the appropriate right-handed windows. The dependent variable (the variable's whose variations you want explain) in your model goes in the top right-hand window (only one variable can be analyzed at a time; if you are interested in running a similar model with different dependent variables, you will need to repeat the procedure for each dependent variable). 


Next, one or more independent variables should be listed in the bottom right-handed window. You may select multiple variables at at time by holding down the "control" button on your key board as you click on various variables. 


In Figure 2, I have used the wizard to identify the several variables in which I am interested. In the example, I am trying see if a college student's cumulative GPA (0.00-4.00, measured continuously) can be predicted by the following three variables:

1.  Political Ideology (coded 1-7, at discreet intervals, with one being strongly conservative and seven being strongly liberal),
2.  Gender (a dichotomous variable where male respondents are coded '1' and female coded '0.' 
3.  Whether or not a student's parents pay half of more of his tuition (also a dichotomous variable: yes='1'; no='0')

Figure 2



For our purposes, we will leave all of the options at their SPSS defaults. Hitting the OK button, will produce the following charts in our SPSS output:


Figure 3


The "R Square" statistic in Figure 3 (.073) is generally interpreted to mean that:


"The three independent variables in the regression model account for 7.3 percent of the total variation in a given student's GPA." 

The higher the R-squared statistic, the better the model fits our data. In this case, we would say that the model "modestly" fits our data (in other words, the model is not all that good, which is not surprising because there are lots of other variables not in our model which influence an individual's GPA...not the least of which is how many hours a day he studies).


The "Adjusted R Square" statistic (.062 in Figure 3) is a modified R-Square statistic that takes into account how many variables are included in the model. Typically speaking, the more variables that are inserted in a regression model, the higher the R2 statistic, which means that the R2 will improve even when essentially irrelevant variables are added. The Adjusted R2 statistic is typically smaller than the R2 statistic because it downward adjusts the R2 statistic when additional variables of limited significance are added to a model. It is a common practice to say that one regression model "fits" the data better than another regression model if its adjusted R2 statistic is higher. 


Figure 4



The second output table of importance in our output reports the "F-statistic" for the model. Usually, regression tables will report both this statistic and its significance, but the one that is most important is the significance statistic (.000 in Figure 4). The test of significance for the F-statistic measures the probability that none of the independent variables in the model are correlated with the dependent variable beyond what could be explained by pure chance (due random sampling error). In Figure 4, we might interpret the F-test's significance statistic in the following way:

"The regression model's significance statistic for the F-test indicates that there is essentially no chance (less than one in 1,000) that the observed correlation between one or more of the independent variables and the dependent variable is due solely to random sampling error."

You should note that this significance statistic is of limited utility (meaning sometimes it is not even reported in published regression tables) because of its assumptions: most regression models will report a statistically significant F-statistic even if the fit of the regression model as measured by the R-squared statistic is very low). 


Figure 5


Figure 5 reports the partial regression coefficients [both unstandardized (B) and standardized (Beta)] for each independent variable in the model and tests of significance for each of these statistics. 


The unstandardized constant statistic (2.906 in Figure 5) shows what the model would predict if all of the independent variables were zero. Following the coding schemes noted above, in this case a woman (women=0; man =1), whose parents do not pay half of her tuition (no = 0; yes=1), and who is very conservative (zero on the seven point scale, with 1 being "very conservative"), would be expected to have a GPA of 2.906 on the 4.0 scale. 


The other unstandardized regression coefficients (listed in column B) suggest that, controlling for the other variables,  the fact that a respondent's parents pay half of tuition has only a minimal effect on GPA: on average, these students' GPA's are .01 lower on the 4.0 scale, after controlling for the other variables. I have rounded in reporting the data: the actual unstandardized coefficient for this independent variable is .0108. Note that with scientific notation, you need to shift the the decimal point to the left by the number of times noted in the scientific notation; in this case -1.08E-02 equals -.0108. 


The model shows somewhat stronger findings for the gender variable's contribution to explaining variations in grade point averages: males (gender = 1) had GPA's that were .09 lower, on average, than those of women. 


Every one unit increase in liberal political ideology (1=very conservative; 7=very liberal) was associated with a nearly .07 increase in GPA, after controlling for gender and parental payment of tuition. Thus, the model predicts--all things being equal--that very liberal students will have, on average, have a GPA that is nearly .50 higher than very conservative students (In other words, 7 times .07 because the distance between the two ideological poles is seven units on the scale described above) 


The standardized coefficients listed in the "Beta" column of Figure 5 report the effects of each independent variable on the dependent variable  in standard deviations. 

The main benefit of these standardized measures allow for a direct strength comparison between the model's three independent variables. Political ideology is by far the most important predictor of GPA, followed by gender. Whether or not a parent pays half or more of a student's tuition has a very limited effect on GPA (-.014 standard deviations).


Finally, we must examine each variable's significance statistic, which is reported in the far-right column of Figure 5. Given the small sample size, are these statistics reliable? For a partial regression coefficient, the statistical test examines the probability that an increase in a given independent variable is the product of sampling error. Specifically, it tests the chance that in the larger population from which the sample for this study was drawn an increase in the independent variable would have either no change in the dependent variable or would correlate with a change in the opposite direction of that indicated in the regression coefficient's sign.


Is the partial correlation between the various variables and GPA possibly due to chance, that is random sampling error? The answer is clearly yes for political ideology and no for parent's paying for half of tuition. In the case of student's whose parents pay for tuition, over 80 percent of the time another sample could be expected to show either no relationship or a positive relationship with GPA (in our table, the correlation is negative).


The significance statistic for the gender variable indicates that we can only be 94 percent certain that being a woman would partially correlate to a higher GPA if we were examining a whole population rather than a sample drawn from that population. Because our sample is small, we might decide that being 94 percent certain is good enough, but this is a judgment call that cannot be answered by statistics. With a larger sample, we would want to be at least 95 percent certain that our result was not due to chance, and well might decide not to accept any result in which we were not 99 percent certain.