Research Methods in Political
Logistic Regression Using SPSS
One of the most commonly-used and powerful tools of contemporary social science is regression analysis. Unfortunately, regular bivariate and OLS multiple regression does not work well for dichotomous variables, which are variables that can take only one of two values:
Did you vote?
Do you give blood regularly?
Do you belong to a civic organization?
The procedure that we use if we want to isolate the effect of multiple independent variables on these types of outcomes is called logistic regression.
As with regular regression, as you learn to use this statistical procedure and interpret its results, it is critically important to keep in mind that regression procedures rely on a number of basic assumptions about the data you are analyzing. All of the basic assumptions for regular regression also hold true for logistic regression. Advanced statistics courses can show you how to manipulate procedures to deal with most violations of regression's basic assumptions, but a strong understanding of these manipulations typically requires at least one semester-long course beyond introductory statistics. The primary goal of this class is to give you hands-on experience using regression so that you will be able to understand and critique one of the most common and prevalent techniques of data analysis used in the social sciences. If you want to be a user of this procedure rather than a consumer, I can not encourage you enough to take a statistics class or two down the road. Enough warnings, let's analyze:
Using SPSS for regression analysis
Let us assume that we want to build a logistic regression model with two or more independent variables and a dichotomous dependent variable (if you were looking at the relationship between a single variable and a dichotomous variable, you would use some form of bivarate analysis relying on contingency tables).
Step one. Begin your analysis by opening the SPSS dataset you want to analyze. If you don't have your data already entered into SPSS, you will want to read a chapter on how data is entered for analysis in SPSS. When you open an SPSS dataset, will see a datamatrix (spreadsheet) that lists your cases (in the rows) and your variables (in the columns). If you look at the two tabs at the bottom of this spread sheet, you will see that you can toggle back and forth between the actual entered data and summary information/labels for your variables.
Starting at the top of the data matrix, go down through the hierarchical menus selecting:
When you select the "binary logistic regression" function, SPSS will provide a wizard that looks like the one portrayed in Figure I:
In Figure 1, the list that you see in the left-hand window lists variables by their variable label (rather than the eight character variable names that you probably have listed in your codebook). SPSS limits variable names to eight characters, so either you or another coder has typically provided a longer, more detailed name for the variable, which is what you see in the window.
Step two. Using your mouse, select the variables you want to analyze, and then click on the arrow button to send your variables to the right-handed windows. The dichotomous dependent variable (the variable's whose variations you want explain) goes into the top right-hand window. Note that only one dependent variable can be analyzed at a time; if you are interested in running a similar model with different dependent variables (e.g., did they vote? did they write a politician?, did they make political contributions?), you will need to repeat the entire procedure for each dependent variable (although SPSS remembers your variables from from one procedure to the next).
Step three. Next, one or more independent variables should be moved to the bottom right-handed window. These are the independent variables that will be used in your model. You may select multiple variables at at time by holding down the "control" button on your key board as you click on various variables. Unfortunately, the logistic regression module of SPSS differs from its module for regular regression in that it lists independent variables by their eight (or fewer) character variable names rather than by their variable names. This is unfortunate because it often means that you have to have your codebook handy to remind you which variable is which.
In Figure 2, I have used the wizard to identify the several variables in which I am interested in an analysis of a dataset of college student political behaviors and attitudes. In the example, I am trying see if whether or not a student said that he voted (coded 1="yes"; 0="no") can be predicted by the following four variables:
1. Gender (Variable "q1," a dichotomous variable where male respondents are coded "1" and women="0")
2. Political ideology (Variable "q16," which is coded in discreet, one unit intervals where 1="very conservative" and 7="very liberal."
3. Cumulative GPA (Variable "q9," coded as a continuous variable that can range from o.0-4.0).
4. Follow politics closely (Variable "q17L," which is coded in one unit, discreet intervals ranging from 1-5. Students's answers to the question, "I follow politics closely" in this survey ranged from 1 ("very much disagree") to 5 ("very much agree").
Step four. Leave all of the other options listed at the bottom of the logistic regression wizard at their SPSS defaults. Hitting the OK button, will produce an ouput file (a separate window) with our results.
Step five. Now its is time to analyze the data output. Warning: the results in the SPSS output window will have many tables and will be confusing at first! While you would learn much more about how to interpret this data in a second year statistics course, for right now we are interested only in the following two tables in our SPSS output window (pay close attention to the table labels listed below because many of the tables in the output window will look quite similar; the ones you are looking for are listed towards the very end of the output):
Step 6: Assessing whether or not we have a good model to explain variation in the dependent variable. Although the logic and method of calculation used in logistic regression is different than that used for regular regression, SPSS provides two "pseudo R-squared statistics" (this is the term we use when we report this data), that can be interpreted in a way that is similar to that in multiple regression. The main difference between the Cox and Snell measurement and the Nagelkerke measure is that the former tends to produce more conservative (that is lower) pseudo R2s than the latter measure.
In political science, most researchers use the more conservative Cox and Snell pseudo R2 statistic. The Cox and Snell pseudo R2 statistic reported in Figure 3 is generally interpreted to mean:
"the four independent variables in the logistic model together account for 15.7 percent the explanation for why a student votes or not."
Generally speaking, the higher the pseudo R-squared statistic, the better the model fits our data. In this case, we would probably say that the model we have built "moderately" fits our data (in other words, although the model accounts for a significant amount of the variation in whether or not a student votes, there are also lots of other variables not in our model which influence this decision).
You should be aware of the fact that there is much debate among scholars over which statistics should be reported when using logistic regression, and many articles and books using this technique will employ other measures to assess how well a given logistic regression model "fits"--that is precisely includes the correct independent variables and only the right variables. Nevertheless, the reason the Cox and Snell pseudo R-squared statistic is automatically calculated by SPSS is because it is both widely reported and fairly straightforward to understand and explain. It closely resembles the much more universally accepted R-squared statistic that we use to assess model fit when using OLS multiple regression.
Step 7: Interpreting how much each of independent variable contributes to variations in the dependent variable when controlling for other variables. Figure 4 reports the partial logistic regression coefficients for each independent variable in the model in the column marked "B." PLEASE NOTE: THESE COEFFICIENTS DO NOT HAVE THE SAME MEANING IN LOGISTIC REGRESSION THAT THEY HAVE IN REGULAR BIVARIATE AND MULTIPLE REGRESSION!!! By themselves, these coefficients do not have a meaning that is easily explained or understood except by experts.
To assess the isolated impact of each independent variable, we instead want look at what are called the "odds ratios," which are listed in Figure 4 in the column titled Exp(B).
easiest way to explain how to interpret an odds ratio is to use an example from
the table. Recall that Q1 is the variable name for the gender variable (male=1;
female=0). What the logistic regression results in Figure 4's Exp(B) collumn say is:
"Controlling for differences in political ideology, GPA, and the extent to one agrees with the statement 'I follow politics closely,' being a male student increases the likelihood of voting by 1.46 times."
ideology variable is coded 1-7, where each one unit increase means that a student
self-identified as being increasingly "liberal." Interpreting
odds-ratio for Q16 (the
variable name for the ideology variable) indicates that:
"For every one unit increase in a student's liberalism (as measured by a 7-unit index), the likelihood of voting decreased slightly (by .96 times), after controlling for the other factors in the model."
The other two independent variables might be interpreted in the following manner:
"For each full grade increase in cumulative GPA (one full point on the four point grading scale), students were nearly four times as likely to vote, controlling for all factors included in the model. In interpreting this figure, it is important to keep in mind that the great majority of students in this sample have grade point averages that range between 3.0 and 3.9."
"Students who report that they follow politics regularly were much more likely to vote. Students were coded on a five point discreet scale ranging from strongly disagree (1) to strongly agree (5) with the statement 'I follow politics closely.' In the logistic regression model, every one-unit shift towards the "strongly agree" category corresponded with an increased likelihood of voting by 1.87 times."
As with the pseudo R-square statistic, there is some debate over how logistic partial regression statistics should be interpreted, which means that you may read logistic regression tables where other measures are used. Unfortunately, not all social scientists using logistic regression will report odds-ratios. SPSS reports this statistic because they it is a widely-used and easily-understood measure of how each the independent variable influences the value a dichotomous variable will take, controlling for the other independent variables in the model.
Step 6: Verifying that dependent variables are statistically significant explanations for variations in the dependent variable. Finally, we must look at Figure 4 one more time to examine each independent variable's significance statistic. The odds ratio statistics we have just analyzed are based on a fairly small sample (the first table in the SPSS output for this regression model--not shown above--indicates that fewer than 200 students were included in the sample). We need to figure out whether or not these statistics that show an increase or decrease in the likelihood of a given student voting are reliable. In other words, we want to know if the odds-ratios in Figure 4 that show our independent variables affecting the likelihood of voting are possibly due to random chance because these statistics were generated from a sample rather than a survey of the entire population.
Could the relationships we in Figure 4's odds ratios (as listed in the the Exp(B) column) be due solely to chance? The significance statistics in Figure 4 show that answer to this question is clearly yes for both the gender (Q1) and political ideology (Q16) variables. We cannot say with a high degree of confidence (better than 95 percent certainty) that the relationships we found between these variables and a change in the likelihood of voting in our sample would hold true in the population as a whole because there is more than a .05 chance that our observed relationship between voting and these two variables is due to random survey error.
For GPA (Q9), the significance statistic indicates that the probability that a full grade increase in GPA actually corresponds to no increase (or a decrease) in the likelihood of a student voting is essentially nonexistent. Our odds-ratio is statistically significant because there is only one chance in a thousand that we have observed a relationship in our sample that would not be found in the larger population from which our sample was drawn.