.

Caution: Warnings Research Is Unreliable

Marc Green


I have already discussed the difficulty of going from research to the real-world and shown the problem when investigating road accidents. Here, I show why most of the research performed on warnings cannot be taken at face value and often says little about the behavior of real people in the real-world.

The problems with warnings are highlighted by a typical court case. A common dispute occurs when the user of a product or environment is injured by an action that is prohibited by a warning. Those defending the matter then point to the warning and say that the injury is the user's fault. He should have read and complied with the warning. The other side will say that the fault lay with a defective and inadequate warning.

To decide such issues, the parties may turn to experts who give opinions on warning adequacy. Frequently, an expert will claim that a warning of different a specific design, color, wording, etc., would have prevented the accident. The expert explains that his opinion should carry great weight because it is based on "scientific research".

Most of the research touting warning effectiveness arises from the controlled, experimental research studies. The problem is that the controlled, scientific research world is not the real-world. Observational studies of real people in the real-world, on the contrary, find that warnings frequently, as some would say usually, fail to alter behavior when the hazard is not open and obvious - in which case the warning is superfluous anyway.

Why the discrepancy? The answer is methodology. This article examines the experimental warning research literature and comments on its relevance for predicting warning effectiveness in the real world. Rather than reviewing the specific research studies, however, I analyze the entire enterprise of evaluating warnings and self-protective behavior in controlled, experimental settings. The conclusion is that most controlled research studies reveal little about how real people behave in the real-world despite protestations of the authors.

Controlled experiments employ a methodology that limits their generality and relevance. Unfortunately, most people who use these studies are unfamiliar with the inner workings of research. They simply read the abstract and don't examine the methodology. Even a careful reading of an article's "methods" section, however, will not reveal the entire story. Much of the methodology is never explicitly stated. Journals have limited space and researchers are assumed to have certain background knowledge and to work with an accepted set of basic precepts that need not be explicitly stated. Moreover, much of the methodology is grounded in invisible and unstated politics and goals of researchers. Despite protests to the contrary, researchers are seldom totally objective - at least not when their career advancement is at stake. The ultimate goal of all researchers is a publishable paper. The old saw of "publish or perish" is no exaggeration. To be published the paper must find a result that is statistically to at least the .05 level. To obtain such a result, the data variability must be small. To ensure low variability, the study must run in simplified conditions with a highly homogeneous population. Lastly, warnings research does not receive much funding, so the research must usually be done on the cheap, ideally using paper-and-pencil methods and free subjects, hence the popularity of undergraduate university students. As we will see, all of the factors play a major role in reducing the ecological validity of the research. However, there are many more.

Subjects in controlled studies are not representative of real users.

Subjects in controlled studies differ from real people in the real-world in many crucial characteristics. I am not talking about demographic differences, although it is true that many, if not most, controlled studies use college students.

Even some warnings researchers, however, have recognized the limitations of using undergraduates. One popular alternative is recruiting people who are walking through shopping malls, etc. This, of course, misses the point. While the subject population will be more diverse demographically, they are just as limited. Unlike people in real situations, research subjects are operating in an alerted, conscious mode and no costs, goals or context.

Research subjects are aware.

Humans act in two modes, controlled behavior and automated behavior (Shiffrin, and Schneider, 1977). People engaged in routine, everyday behavior operates largely in an automatic mode, where they seldom make conscious decisions or consider environmental factors irrelevant to their goal.

In contrast to this normal behavior, subjects in controlled experiments are in a new and artificial situation where they are on alert and/or simply told where to attend. They operate in controlled mode where they make deliberate choices based on a conscious decision. Hale and Glendon (1987) highlight the importance of the automatic/controlled dichotomy for understanding of risk and warnings by noting that "the conscious pursuit of health and safety is usually a very minor concern of the individual, which is only incidental to the pursuit of other goals, which may at times be in conflict with safety. Only rarely are the conscious plans … brought to the forefront of consideration."(p8-1.) Being in a controlled study brings these to the forefront of consideration, which is not like the real-world.

Research subjects have no costs.

In the real-world, compliance with a warning usually carries a cost, either increased effort (e.g., walking around the pavement rather than across the grass) and/or less optimal outcome (e.g., using less effective bug killer and not eliminating all the ants). One of the few solid findings in the warning literature is that the cost of complying with the admonition is a key factor in effectiveness. Elsewhere, for example, I provided an example of how costs defeated a warning to pedestrians against crossing midblock.

Subjects in a controlled experiment pay no cost when they say that they would comply with a warning. They are under no time pressure, no work schedule, no motivation to be more efficient, no benefit in optimizing the outcome, etc. In the real-world, users are less likely to pay the attendant cost, so their behavior is likely to be less influenced by the warning. Rather, their behavior will be controlled by the contingencies, a topic I shall discuss at length later. As a result, there can be little doubt that mere ratings of intent are going to be grossly inflated relative to the rate of behavioral compliance. It is easy to say that you are going to go on a diet when you aren't hungry.

However, a few of the better studies have put subjects in situations where there may be costs. Unsurprisingly, this decreases compliance, but it is unclear that such studies have added anything new since the operant Law of Effect readily predicts this outcome without the need for any specific experiment. The problem is that the academic world is a collection of silos. Warnings researcher likely have no knowledge of operant learning and have never heard of the Law of Effect. In fact, the few robust warnings research finding would have be readily predicable from basic research. Still, it highlights the fact that studies where subjects have no costs are of little value.

Research subjects have no goals.

People do not act randomly. When using a product or facility, they are generally acting to achieve a goal. If the warning prohibits behavior required to achieve the goal, the user must either forgo achieving the goal, find an alternative means for achieving the goal or move to an alternate goal. These issues do not arise in most controlled studies because the users have no goals, other than to fulfill the experiment. Other controlled studies use simulations of real tasks, but these are little better.

Recent eye movement studies (see Land, 2006 for a review) demonstrate the narrow focus of people engaged in routine, goal-oriented behavior. People dynamically direct attention almost exclusively to task-related objects in a "just in time" manner and seldom notice other information. Moreover, these studies reveal that sensory aspects of objects play almost no role in guiding attention during task performance. Controlled studies that obsess on minutiae such as the color of the warning and the signal word, etc. are largely missing the point; people engaged in routine behavior are unlikely to notice a warning regardless of its content. For example, ATV drivers frequently fail to notice warnings about not carrying passengers, although they are plastered all over the vehicle and in the instructions.

Research subjects have no context.

Behavior in the real-world often occurs in physical and social contexts that are absent in controlled experiments. A few common contextual factors include time pressure and distraction. Social context is also important, as people may take their cues from the actions of others. The behavior of other people is likely to influence warning compliance - if other people cross the street against the light, drop litter, etc., then you are far more likely to behave in a similar manner.

Research subjects make comparative, not absolute judgments.

This is one of the most important, but overlooked limitations of the experimental research literature. In most controlled studies, the subject makes a relative rather than absolute judgment. The subject in the controlled study usually sees several different versions of a warning and then makes comparative judgments about their conspicuity, perceived effectiveness, intent to comply, etc. The experimenter's goal is to determine which value of a variable such as color, size or content leads to greatest compliance.

In normal circumstances, the user sees only the one warning and makes an absolute judgment. Comparative and absolute judgments can produce very different compliance results. Adams, Bochner, and Bilik (1998), for example, had subjects evaluate warnings under two conditions. Some subjects saw a set of warnings and rated each (a "within subject" experimental design). In another condition, subjects evaluated the same warnings, but each subject saw only one of the set ("across subject" experimental design). Subjects viewing the set rated some warnings as being better than others. Ratings of subjects who had seen only one version showed no difference across warnings. This study demonstrates how the context provided by the comparison can produce a difference that is unlikely to appear in the real-world of absolute judgment.

Keown (1983) further found that risk estimations differed in relative vs. absolute judgment. Some subjects rated the risk of five hypothetical drugs higher than another set of subjects rated the risk of only a single drug. Here again, the context provided by the relative judgments produced results that would not likely be meaningful in the real-world.

Another study (Ayres, Gross, Horst, Wood, Beyer, Acomb and Bjetajac, 1990) hints at the same effect. They examined how well subjects could estimate the effectiveness of warnings whose ability to produce compliance was already known. Each subject saw only one of the six possible warnings (across subjects design). Subjects both drastically overrated the levels of actual compliance and were unable to distinguish the effectiveness among different versions of the same warning.

These results are hardly surprising. It is well-known in behavioral research that comparative judgments are more sensitive to small difference than are absolute judgments. Even if the differences produced by comparative judgments may be real, they are also less likely to predict real-world behavior. It is a general property of decision making that choice is relative to the set of alternatives presented.

Measures of intended behavior correlate poorly with actual behavior.

This follows from the absence of contexts and costs. Much of the controlled research literature rests heavily on the assumption that verbal responses of intention reflect future behavior. This, of course, is a crucial link that makes controlled studies ecologically valid. If talk is really cheap, then a large portion of the warnings research literature is immediately rendered ecologically invalid - useless. There is much evidence that people are poor at predicting warning effectiveness. For example, Wagenaar (1992) found that user predictions of their own future behavior vastly overestimate their actual compliance. People in general and engineers in particular were poor at predicting warning effectiveness (Frantz, Miller, & Main, 1993). Well, why should they be any good at predicting behavior?

My personal experience further leads me to doubt that the link between intention and behavior is strong; I must confess that the concordance between my New Year's resolutions and my actual behavior is, sadly, 0. I've already discussed the reasons, which are fairly obvious. First, good intentions often wither in the face of the costs. Intending to walk around the block to avoid ruining the grass is different from actually having to walk around the block. Second, behavior is highly contextual, meaning that it is determined by factors acting on the person at the moment of action. There may be time pressure, distraction, etc. There may be other people ignoring the warning, which is an important factor since social modeling affects most behaviors. Third, a person giving a response in a laboratory is disembodied from context and is not aware of and cannot predict the factors that are not operating at the moment. It may be a pleasant room temperature when the subject says that he will walk around the block, but it might be 10o and snowing when he has to actually do it. It is easy to give the socially acceptable answer when it costs nothing. Fourth, the person is alert in controlled behavior mode during the experiment, but may be in automatic mode when performing the action. Decisions in these two conditions arise from different levels of consciousness and possibly different parts of the brain.

Statistical significance is not always practical significance.

In science, the criterion that divides significant from non-significant is purely statistical. The data must reach the .05 level of significance, meaning that there is a less than 1 in 20 probability that the result was due to chance. (Said another way, if a journal contains twenty articles with results significant to the 0.05 level, then one is reporting artifactual results.) This result might be strong enough to claim warning "effectiveness" of some variable to be a statistically significant and to be publishable, but the effect still may not a realistically significant. For example, a study might find that adding a pictograph to a textual warning raises some measure, such as intent to comply, from 10 to 18 percent. With a large enough number of subjects, this might be a statistically significant effect. However, there are two problems in applying this result. First, an 18 percent compliance rate is still very low. The warning is largely ineffective and would not be a significant result in the legal sense; it would not meet the required "beyond a reasonable scientific certainty" criteria. Second, the 8 percent increase is small and would likely be swamped by more important variables in the real-world. No one could likely say that a 8 percent change in probability would have altered the outcome. The real-world simply has too much uncertainty.

Controlled studies often have strong "demand characteristics".

The structure of a study can influence the outcome by containing "demand characteristics," attributes which bias subjects toward particular responses. For example, subjects in controlled studies cannot simply say, "I don't know." They cannot quit because they are bored or tired (which they probably are if they are just fulfilling their psychology 100 subject pool requirements.) They cannot switch to an alternative goal or behavior. One especially powerful demand characteristic is the tendency to give the socially acceptable or desirable answer. Some people are likely to say what they think is the correct answer and to please the researcher since it costs them nothing. Perhaps the most obvious demand characteristic is that the subjects pay attention to the warning. In the real-world, much of our behavior occurs automatically and under minimal supervisory control, as already explained.

The research that reaches publication is a biased sample.

Results showing a failure of warnings are far less likely to be published than those showing that some variable enhances their effect. Most warning studies appear in publications dominated by academics. Academic journals do not like to publish negative results, so data that do not reach statistical significance are far less likely to appear in print. Moreover, university researchers must produce positive results and publications to gain tenure and advancement. (As the old academic adage says, "they can't read, but they can count!") If a faculty member's academic career depends on finding positive results in warning studies, s/he is going to look for them and push the belief that they will be important for warning effectiveness.

Further, funding agencies don't like negative results, so it is in the researcher's best interest to avoid them. Politics also strongly affect research. No agency, for example, is going to fund a grant whose goal is to show some warning or warning variable to be useless. Imagine trying to get money to show that ANSI warning formats are effective. Now try imagining trying to get money in order to show that they are ineffective. Funding agencies have agendas and university researchers know where their bread is buttered.

Failure to obtain positive results on one grant also severely lowers the odds of being funded on the next application. As a result, university researchers either don't publish negative results or keep adding subjects until some small effect is obtained. In sum, there is strong pressure on university researchers to avoid negative results. (This may be why most negative results originate from non-academic researchers.) It is an essential point since the Human Factors and Ergonomics Society Proceedings, perhaps the major source of research studies on warnings, is largely a collection of papers by university faculty and their graduate students. Moreover, much of this work is published by a small cadre of researchers who are academically related. That is, one advised another as a graduate student, etc., so much of the research represents a single, academic viewpoint. The flood of positive warning results produced by academic researchers with undergraduate students stands in stark contrast to the real-world studies that are much less likely to find that warnings had much effect. This is why the safety hierarchy is a fundamental design principle - warnings are highly unreliable.

Conclusion

The debate over warning compliance should focus more on the viewer rather than on the warning itself. Of course, the warning should be legible and intelligible. It should be brief and formatted for easy reading. It should also be explicit as to the hazard's nature and to the method for avoiding harm. Beyond these minimal requirements, however, the key factor in warning design is a realistic understanding of user expectations and goals.

There is little evidence that following any specific formation guideline, such as ANSI z35, produces more effective warnings. The reason is that warning effectiveness depends primarily on the viewer's mental set and less on the warning itself. Further, the patina of scientific credibility masks major questions on the validity of controlled warnings research. The lack of validity is due to the design of controlled studies - they disembody people from real situations, expectations, goals, and perception of risk, so they fail to reflect normal viewer cognition and mental set, which are the prime determinant in whether a warning will be effective.

Finally, despite the limitations of controlled studies that specifically examine warnings, it is still possible to make reasonable assessments of warning adequacy and of user compliance. Rather than relying (completely) on the controlled warning research literature, however, the assessment should be based on well-documented basic science about the general properties of human perception, attention, and learning. See the Warning Checklist for more discussion on this point.