The Value of Expressing Uncertainty

The Value of Expressing Uncertainty

All models are wrong, but some are useful. – G.E.P. Box

Earlier this year, I caught up with a colleague at a national conference who had recently taken a new position in student affairs assessment.  Before she took the position we had discussed what to expect, but at this point she had been in the role for a few months so I was excited to hear how things were going.  We got the chance to discuss some of the requests she had been receiving and some of the projects she was working on. We got the chance to discuss early challenges and early successes.

A lot of the things that came up were common.  Requests by colleagues to demonstrate or “prove” that something was working.  Requests to measure the impact that a program was having on students. This discussion though, naturally led to a broader conversation about our role as assessment professionals and the value we provide.

As I reflected on my own experiences, I realized that my assessment contributions rarely “proved” something or provided definitive evidence in support of a specific program or decision.  Instead, some of the most valuable discussions I’ve had as an assessment professional have focused on the notion of uncertainty. Some of the value came in helping my colleagues contextualize the uncertainty of their assessment results.  Some of the value came in positing alternative explanations and subjecting those explanations to empirical scrutiny. Some of the value came in helping to ensure that any “data-driven” decisions are made with a clear understanding of how certain we are about the conclusions we were drawing from the data.

Providing context and clearly articulating the level of certainty we have in our results is one of the most important tasks we have as assessment professionals.  Articulating our level of certainty though, requires striking a delicate balance. On the one hand, we want to be clear about the limitations of any analysis that we conduct so that our colleagues do not draw any invalid or suspect conclusions.  At the same time, we do not want to place so much emphasis on uncertainty that our results are dismissed or ignored. So how do we go about striking that balance? How do we discuss uncertainty in a way that it actually can add to the analysis? Furthermore, how do we go about increasing the certainty we have in our results?

How to Express Uncertainty

One of the biggest challenges in articulating the level of certainty we have in our results is that uncertainty can be difficult to talk about.  The danger is that this difficulty can encourage us to take shortcuts. For example, I’ve had countless conversations with colleagues about different models and assessment results where one of the first questions I’m asked is whether a given finding is statistically significant.  Regardless of my response, I rarely get a follow up question.

In some ways this is a good thing.  Not every conversation of assessment results should be paired with long discussions of model selection and specification, a diatribe about model diagnostics, and a laundry list of caveats.  It does, however, highlight the fact that the question about statistical significance for some serves as a proxy for “true or false”, “impact or no impact.”

As assessment professionals, we have a responsibility to find creative ways to help our colleagues understand the relative level of certainty we have in a model or a given finding.  How accurately does our model make predictions in the aggregate? Do we visualize the results using confidence intervals and out of sample predictions so our colleagues have a reasonable understanding of the substantive impact of a given variable and how wide of a range that impact might be?   Do we convey how stable the results are and the limitations of the design in clear and understandable ways?

Naturally, this can be challenging, but when results are used to make decisions it’s essential that we quantify and convey our level of uncertainty in a clear way.  Just because a result is statistically significant based on one test or model specification, does not mean there’s no longer room for skepticism. On the other hand, just because two outcomes are both possible does not mean that they are equally probable.

Self-Selection and Uncertainty: Articulating (and Testing) Alternative Explanations

When we focus our assessment work on demonstrating or “proving” that something has an impact on our students, we run the risk of failing to examine alternative explanations for our results.  In many instances it may be the case that students that participate in an academic support program have better outcomes than students that do not participate in the program. The students that participate may have higher grades at the end of the semester and they may persist at a higher rate the following year.  It may even be the case that the difference between the two groups of students is statistically significant when we subject the results to a simple hypothesis test.

An entirely different level of evidence would be required though to reach the conclusion that participation in the program caused an improvement in outcomes.  Is it possible that students that had higher grades and standardized test scores in the past were more likely to self-select into participating in the program?  Is it possible that if we controlled for additional variables in a more robust model that the results would no longer be statistically significant? Without asking these questions and subjecting these alternatives to empirical scrutiny, our level of uncertainty (and our level of skepticism) should be high.  If our results remain consistent and stable across different model specifications, we can be more confident in our results.

Universal Treatment Group: Uncertainty by Design

In addition to examining alternative explanations, discussing uncertainty also requires having a clear understanding of the different strengths and weaknesses of different research designs.  In many instances, we onboard large programs or initiatives that impact nearly every student and then try to assess their effectiveness after the fact. Since we lack a meaningful comparison group though, we resort to looking for a difference in results between a pre-treatment and post-treatment measurement.

Even if the difference between the pre-treatment and post-treatment results is large in magnitude, we have reason to question whether the intervention had a causal impact.  Was there another intervention that occurred between the two measurements that could have caused the change? Are we capturing a cycle, a larger trend or a natural regression to the mean in the data that we can’t observe with only two points of measurement?  Is our outcome of interest measured in a way that is finely grained enough to capture meaningful differences?

Of course, there are alternative research designs that reduce some of this uncertainty and give us greater confidence that we have captured a causal relationship.  A randomized field experiment where students are assigned to treatment and control groups provides us with additional control, limits the number of competing explanations, and increases internal validity.  Naturally, this type of design presents some of its own challenges.

Some of these challenges are technical.  Have we ensured that randomized groups are balanced demographically and not unbalanced simply by chance?  Have we avoided the spillover effect and ensured that only those in the treatment group, and no one in the control group, received the treatment?

Other challenges to this approach are practical or based on ethical considerations.  While these considerations are essential, we also need to avoid simply assuming that our new initiatives and programs will have a positive impact.  Does adding a curricular requirement to a student’s residential experience or requiring additional check-in meetings with their advisor improve outcomes or add an additional burden to an overworked student population?

These are empirical questions and ones that we have an ethical obligation to answer with a reasonable degree of certainty.  The right design approach can add to our level of confidence. It can also allow us to compare programs to each other to see which are the most cost effective and which have the greatest impact overall.

When reporting out on assessment results, it’s important to craft a narrative that is easily understood and provides a clear course of action.  With that said, it is also essential that we help our colleagues understand how much emphasis they should place on our conclusions. Some findings are more robust and provide us with a greater degree of certainty.  As a result, we have to be able to demonstrate that there is a middle ground between complete belief in a quantitative analysis and complete skepticism.

So what have you found on your campus?  How have you been able to strike this balance?

Eric Walsh, University at Buffalo

Go Back

Thank you for sharing this timely posting! We are currently really focused on demonstrating our impact on student success and increasing retention. You pulled together so well some of the conversations that we are having on campus currently regarding program effectiveness. I am hoping that we can work to better embrace the uncertainty!


Thank you, Eric, for this post. Your comments about uncertainty, statistical significance, and proof are bringing me joy. Thanks for that!

When I hear administrators or programming staff talk about making data-driven decisions, I cringe (usually in an exaggerated if not dramatic way to help emphasize my forthcoming point). I urge them to make data-informed decisions where data is considered in context of assessment design, theory or programming model, professional ethics, cost considerations, revenues, campus policies, and other applicable rules, regulations, and laws. Balancing these (and other) considerations can be tricky, and each decision may require a different decisional balance. Learning to keep one's balance on this balance beam of decision-making means continuously monitoring the relevant factors. Explicating them for each decision may be useful at first. After a while, that is after much practice, we gain some efficiency in the task of decision-making, so we might not need to explicate each relevant factor.

Like Gavin, when I talk about or am asked about statistical significance, I remind my audience that it is an estimate of how likely the observed difference is due to measurement error. It is not "proof" that the program or service worked as intended. Rather, it is permission for us then to consider effect size to decide whether or not the effect is big enough in a practical sense for us to do anything. I also take the opportunity to remind them that statistical significance strongly relates to sample size, so it is important to choose one's critical p value carefully. Is .05 the right significance level? or should we use something more conservative like .01 (or .001)? One-tailed or two-?

Finally, to the point of "proof," my playful side comes out. After reminding folks that the hypothetico-deductive model of science never proves anything, I tell them not to lose heart. They can still find proof in one of three places: mathematics, a court of law, or the liquor store.

Thanks so much for this post. With a 99% confidence level and +/- 1% confidence interval, I am certain that I appreciate you. :-)


Oops! I meant to say "Like Gavin and Darby..." in the paragraph about statistical significance. Sorry, Darby!


"They can still find proof in one of three places: mathematics, a court of law, or the liquor store."

:) I may have to borrow this phrase!


Judd, thank you so much for the kind comment and for sharing your experiences! I think that your point that each decision may require a different balance is very insightful. The decisions we are faced with do not follow a simple formula of maximizing profit or cutting costs. We have various institutional goals and we need to balance between them. Spending resources in one area naturally means we have fewer resources to devote to our other priorities.

I agree with Sara, I love the phrase about proof! I also share your aversion to the term “data-driven” decision making. Whenever I see it in a presentation I think it should be accompanied by a picture of Thelma and Louise. Thanks again for the great post!


Hi Michael, that book is my next read, so I look forward to learning from it and applying the concepts.


Well said, Eric. I caution staff from drawing conclusions about causation/impact and proof when looking at most assessment results. If you have a students that didn't meet your expected outcome, have they disproved your outcome/impact? That doesn't mean you should cancel the program; it could be that there were other factors at play, there is some benefit for most people, or something else.

College students don't live in a vacuum, so saying that one voluntary, brief program has an impact on retention or time to graduation or GPA worries me. We are rarely looking at inputs or other experiences that could account for some of the difference. It is a challenge to get a lot of programs, services, and employment together to track students across their college experience to see what has "impact" or not.

I think this is a topic we (student affairs assessment professionals) need to talk about and examine as the field matures.


Thank you for commenting, Darby! I agree with you, it is concerning when we try to conclude that a single voluntary program has an impact on student retention, academic performance, etc. When we come across these claims, we should approach them with skepticism and try to take a more holistic approach to understand the broader context. I think you’re right, these conversations will become increasingly important as the field continues to grow. As we progress as a field, we may need to place less emphasis on simply gathering evidence and more emphasis on understanding which designs and techniques provide us with the greatest level of certainty and the most reliable insights.

Thanks again for the post!


I read a book recently that was premised on what I found to be a very freeing approach to assessment. The book, How to Measure Anything: Finding the Value of Intangibles in Business by DW Hubbard defines measurement as "A quantitatively expressed reduction of uncertainty based on one or more observations."

In this framework, the measurement goal we are aiming for is not "Big T" truth, or even statistical significance. What all assessment should be doing is reducing the uncertainty that faculty and staff have in making decisions about the experiences they are providing to students.

In the applied settings that higher education institutions invariably are, there will always be an element of uncertainty. Acknowledging uncertainty and using our observations of students, their work, and their behaviors to reduce it is the very definition of measurement. Disentangling ourselves and colleagues from the the need to perfect knowledge can do a lot to make assessment useful and meaningful.


Thanks, Michael! I am adding this book to my reading list.


Eric, I really appreciated your comments regarding uncertainty and the issues we need to consider when we implement assessment. There is one comment that you made that I continue to emphasize when I talk about assessment and research. That issue is what statistical significance really means. As you note, for many, statistical significance means that something is true or false or there is an impact or no impact. What I remind my colleagues and students is that all statistical significance means is that probability that the results found from a sample would be found in a population if the same test were run. What statistical significance doesn't note is the magnitude of the finding. That is why it's important to run an effect size analysis as well. A statistically significant finding may not be practically significant.

Excellent post reminding all of us regarding the importance of considering uncertainty in our work.


Gavin, thank you so much for the kind and thoughtful comment. I could not agree with you more on the importance of understanding the substantive impact of a given finding rather than simply looking at statistical significance. I’m reminded of the article, “Sinning in the Basement: The Ten Commandments of Applied Econometrics” by Peter Kennedy. It is required reading in our early graduate level statistics courses and is based on the premise that while all of the proper assumptions of economic theory are taught on the higher floors, all these rules were broken by applied researchers when they were running models in the basement (a call back to when all of the models had to be done in the basement). As a result, Kennedy attempts to outline some practical guidelines for applied researchers including, "thou shalt not confuse significance with substance.” We always placed additional emphasis on that one. Thanks again for commenting!