All models are wrong, but some are useful. – G.E.P. Box
Earlier this year, I caught up with a colleague at a national conference who had recently taken a new position in student affairs assessment. Before she took the position we had discussed what to expect, but at this point she had been in the role for a few months so I was excited to hear how things were going. We got the chance to discuss some of the requests she had been receiving and some of the projects she was working on. We got the chance to discuss early challenges and early successes.
A lot of the things that came up were common. Requests by colleagues to demonstrate or “prove” that something was working. Requests to measure the impact that a program was having on students. This discussion though, naturally led to a broader conversation about our role as assessment professionals and the value we provide.
As I reflected on my own experiences, I realized that my assessment contributions rarely “proved” something or provided definitive evidence in support of a specific program or decision. Instead, some of the most valuable discussions I’ve had as an assessment professional have focused on the notion of uncertainty. Some of the value came in helping my colleagues contextualize the uncertainty of their assessment results. Some of the value came in positing alternative explanations and subjecting those explanations to empirical scrutiny. Some of the value came in helping to ensure that any “data-driven” decisions are made with a clear understanding of how certain we are about the conclusions we were drawing from the data.
Providing context and clearly articulating the level of certainty we have in our results is one of the most important tasks we have as assessment professionals. Articulating our level of certainty though, requires striking a delicate balance. On the one hand, we want to be clear about the limitations of any analysis that we conduct so that our colleagues do not draw any invalid or suspect conclusions. At the same time, we do not want to place so much emphasis on uncertainty that our results are dismissed or ignored. So how do we go about striking that balance? How do we discuss uncertainty in a way that it actually can add to the analysis? Furthermore, how do we go about increasing the certainty we have in our results?
How to Express Uncertainty
One of the biggest challenges in articulating the level of certainty we have in our results is that uncertainty can be difficult to talk about. The danger is that this difficulty can encourage us to take shortcuts. For example, I’ve had countless conversations with colleagues about different models and assessment results where one of the first questions I’m asked is whether a given finding is statistically significant. Regardless of my response, I rarely get a follow up question.
In some ways this is a good thing. Not every conversation of assessment results should be paired with long discussions of model selection and specification, a diatribe about model diagnostics, and a laundry list of caveats. It does, however, highlight the fact that the question about statistical significance for some serves as a proxy for “true or false”, “impact or no impact.”
As assessment professionals, we have a responsibility to find creative ways to help our colleagues understand the relative level of certainty we have in a model or a given finding. How accurately does our model make predictions in the aggregate? Do we visualize the results using confidence intervals and out of sample predictions so our colleagues have a reasonable understanding of the substantive impact of a given variable and how wide of a range that impact might be? Do we convey how stable the results are and the limitations of the design in clear and understandable ways?
Naturally, this can be challenging, but when results are used to make decisions it’s essential that we quantify and convey our level of uncertainty in a clear way. Just because a result is statistically significant based on one test or model specification, does not mean there’s no longer room for skepticism. On the other hand, just because two outcomes are both possible does not mean that they are equally probable.
Self-Selection and Uncertainty: Articulating (and Testing) Alternative Explanations
When we focus our assessment work on demonstrating or “proving” that something has an impact on our students, we run the risk of failing to examine alternative explanations for our results. In many instances it may be the case that students that participate in an academic support program have better outcomes than students that do not participate in the program. The students that participate may have higher grades at the end of the semester and they may persist at a higher rate the following year. It may even be the case that the difference between the two groups of students is statistically significant when we subject the results to a simple hypothesis test.
An entirely different level of evidence would be required though to reach the conclusion that participation in the program caused an improvement in outcomes. Is it possible that students that had higher grades and standardized test scores in the past were more likely to self-select into participating in the program? Is it possible that if we controlled for additional variables in a more robust model that the results would no longer be statistically significant? Without asking these questions and subjecting these alternatives to empirical scrutiny, our level of uncertainty (and our level of skepticism) should be high. If our results remain consistent and stable across different model specifications, we can be more confident in our results.
Universal Treatment Group: Uncertainty by Design
In addition to examining alternative explanations, discussing uncertainty also requires having a clear understanding of the different strengths and weaknesses of different research designs. In many instances, we onboard large programs or initiatives that impact nearly every student and then try to assess their effectiveness after the fact. Since we lack a meaningful comparison group though, we resort to looking for a difference in results between a pre-treatment and post-treatment measurement.
Even if the difference between the pre-treatment and post-treatment results is large in magnitude, we have reason to question whether the intervention had a causal impact. Was there another intervention that occurred between the two measurements that could have caused the change? Are we capturing a cycle, a larger trend or a natural regression to the mean in the data that we can’t observe with only two points of measurement? Is our outcome of interest measured in a way that is finely grained enough to capture meaningful differences?
Of course, there are alternative research designs that reduce some of this uncertainty and give us greater confidence that we have captured a causal relationship. A randomized field experiment where students are assigned to treatment and control groups provides us with additional control, limits the number of competing explanations, and increases internal validity. Naturally, this type of design presents some of its own challenges.
Some of these challenges are technical. Have we ensured that randomized groups are balanced demographically and not unbalanced simply by chance? Have we avoided the spillover effect and ensured that only those in the treatment group, and no one in the control group, received the treatment?
Other challenges to this approach are practical or based on ethical considerations. While these considerations are essential, we also need to avoid simply assuming that our new initiatives and programs will have a positive impact. Does adding a curricular requirement to a student’s residential experience or requiring additional check-in meetings with their advisor improve outcomes or add an additional burden to an overworked student population?
These are empirical questions and ones that we have an ethical obligation to answer with a reasonable degree of certainty. The right design approach can add to our level of confidence. It can also allow us to compare programs to each other to see which are the most cost effective and which have the greatest impact overall.
When reporting out on assessment results, it’s important to craft a narrative that is easily understood and provides a clear course of action. With that said, it is also essential that we help our colleagues understand how much emphasis they should place on our conclusions. Some findings are more robust and provide us with a greater degree of certainty. As a result, we have to be able to demonstrate that there is a middle ground between complete belief in a quantitative analysis and complete skepticism.
So what have you found on your campus? How have you been able to strike this balance?
Eric Walsh, University at Buffalo