What is a No Reward Marker (NRM), and is it a useful tool or an awful mistake?
Should a good clicker trainer use an NRM, and, if so, when?
It’s out there, lurking. At times you feel it stalking just behind you. At last it springs as someone asks, “Why don’t you tell your dog it was wrong?”
The NRM debate has been reopened once more.
The debate arises in cycles, but next time you’ll be prepared for it, no matter how stealthily it creeps.
What is an NRM, anyway?
On the surface, an NRM is rather straightforward. At times, though, there is considerable debate regarding its true nature. The No Reward Marker is usually described as “conditioned extinction,” as its intention is to inform the learner that no reinforcement awaits down the path he is considering. One of the most well-known examples of NRM is the children’s game Hot & Cold, where feedback of “getting warmer” guides the participant to a goal object, while “colder” indicates that the participant should try another route.
At first glance, this looks like a good use of continuous feedback. However, a closer examination reveals that the “cold” feedback is really unnecessary. Savvy players start by spinning until they hear “hot;” they do not waste their time passing through the room experimenting with how many “cold” responses they can get. In fact, the lack of a “hot” response is equivalent to a “cold” response, as anyone who has played a shaping game can attest. In the clicker trainer’s version of Hot & Cold, the feedback is click and no-click respectively. In both the children’s version and the clicker version, the cold answer—“cold” or no click—adds no further information.
Testing NRMs in humans
At a Shedd Aquarium training workshop, Ken Ramirez led us through a variety of training games to develop skills in timing, cuing, chaining, and more. After several days, he gave us a new challenge: train a human subject to perform three simple cued behaviors using both a conditioned reinforcer and an NRM. Our task would not be complete unless the three behaviors could be performed successfully—and the subject could recognize and define our NRM stimulus.
The results were amazing. Even though we had been discussing NRMs so recently and the concept was fresh in our minds (unlike the minds of our usual animal subjects), only one learner out of fifteen guessed that the extra stimulus was supposed to be useful data as an NRM.
Meanwhile, every learner exhibited frustration, and even occasional aggression (sometimes veiled as jokes and sometimes not). About half of the learners never completed the tasks in the allotted time, while they had been highly successful in the other games.
In my own learner, I saw cue inversion (frustration and confusion with the NRM caused confusion in other areas) and a general loss of enthusiasm. Even though I was making everything as plain and simple as possible—marking errors with the same precision as I would click correct responses, and trying to follow errors with a chance for success—I could see her attitude souring.
Yet as a group we’d done well. Ours was the first session in the years of Ken’s teaching where someone had not stormed out angrily during the NRM challenge.
This experience cemented my current opinion on the NRM. With this confusion and frustration in humans who already knew the NRM concept, why risk those feelings with those who cannot discuss it with us?
Conditioned extinction or aversive?
There is far more to the NRM debate than this, however. Stand back, as this is where I'll step on some toes…
By the time an NRM has real meaning for the learner, it has become positive punishment.
An NRM may cue extinction, but in doing so it also signals a loss of opportunity. The chance of earning reinforcement has closed. If the subject changes his behavior to avoid the NRM—and that is the whole point of its use—then the NRM is by definition an aversive. It may be a mild aversive or it may be severe, depending upon the learner’s mindset, but it is a stimulus the learner is actively working to avoid. Because the trainer introduces the NRM upon the learner’s mistake (adds an aversive stimulus that modifies behavior), the NRM is positive punishment.
Is the NRM necessarily evil? Probably not, but it's not the completely neutral stimulus that many claim. The punishment continuum runs from fairly mild to extremely harsh, and it is the learner who interprets the severity of any given punisher. If a trainer wishes to avoid the use of positive punishment, he should be aware of all its forms, including the form of an NRM.
Observe a contestant on a game show. When he answers a question and then hears the buzzer marking a wrong answer, does his body language indicate that the buzzer is a neutral stimulus, serving only as useful data? Certainly not! The disappointed contestant may exhibit slumping posture, frustrated displacement gestures, perhaps profanity—even if he does not lose points or money, only the opportunity to earn more of the same. For someone who really wants to be right, being wrong is quite aversive. (A learner who doesn’t care about being right is facing a motivation problem, not a data problem; an NRM won’t help and may even hinder the development of motivation.)
Some trainers use NRMs not only to shape a new behavior, but to indicate any mistake a learner makes, including a failure to respond properly to a cue (no response or an incorrect response). For example, if a trainer sends a dog to select a scented object from a collection and the dog retrieves the wrong one, the trainer might say “oops” as the dog picks up the incorrect object.
While, superficially, this seems to be relevant data, it can break down careful training. Positively-trained cues are themselves tertiary reinforcers. An NRM after a failed cue breaks the contract of reinforcement, offering P+ after a tertiary reinforcer—and creates serious risk of poisoning the cue (and rendering it useless for future use in chains).
(Note: If you find yourself using an NRM after a cue, review the cue. Why isn't it working? The issue is probably not the NRM at all!)
Many animals (and humans) exhibiting stress in challenging conditions are stressed not only by the tasks they face, but by the changing schedules of reinforcement and the increased chance of punishment. Is the dog really finding scent discrimination so difficult—or is the dog frustrated by the learning conditions?
Is this data necessary?
Proponents argue that NRMs are simply data to inform the learner. They say that it’s not fair to leave a dog guessing; it’s kinder to tell him what’s not working.
Why tell the dog that he wasn't successful? This question is usually asked in a more philosophical way, but I mean it very practically—if the dog needs an NRM to realize that he isn't being reinforced, the trainer has screwed up badly. Why doesn't the dog know already? Clicker training is pretty much yes/no. If training has been set up so that the dog can't tell if he's been successful, and he needs supplemental information, then something is wrong! (See “Fixing behavior without an NRM” for more on this.)
Fixing Behaviors without an NRM
Training my dog Shakespeare the (admittedly silly) behavior of putting his head into a bucket for a shaping demo was going smoothly, until I inadvertently reinforced my paw-oriented dog for moving his paw as he dipped his head. Within seconds I had a dog convinced that I wanted his right paw in the bucket! While many trainers might have resorted to an NRM to discourage the paw behavior, I chose to repair the behavior using only good timing and careful placement of reinforcement.
You’ll see in the video how Shakespeare is frustrated by his low rate of success at first. Would his attitude have been improved if I’d told him that his behavior was incorrect? Would an NRM have helped him know exactly how to modify the behavior the way I wanted, or would the NRM have been associated with bucket interaction itself? To fix the behavior, I tightened up my timing and temporarily reduced criteria. With the resulting jump in rate of reinforcement, the learner was able to quickly grasp what I wanted and retain it. (Edits in the video are solely to save time as Shakespeare located and ate his treats.) You’ll see the superstitious paw movement persist and then fade under the weight of reinforcement for the desired behavior.
Note that it was far more tempting for Shakespeare to place his paw in the bucket as he approached it from a distance; the test for this behavior was his ability to approach from across the room and place his head in the bucket cleanly.
Can it ever be useful?
So is it always wrong to mark a behavior as non-reinforcing? Keep in mind that blanket generalizations are always wrong (irony intended!). Some informational cues could be called NRMs, because they signal the lack of potential for reinforcement—a red light rather than the more common and cuing green light. My dogs have learned if I say “shoo” while I’m at the computer, I’m not available to play, while at other times a nose poke might elicit attention. In this situation, “shoo” is a signal that future offered behaviors will not be reinforced. (Most pet owners will recognize that our pets know a host of these types of cues, mostly non-verbal.)
Most of the time, however, I see NRMs used as a crutch where the initial training was not clean and precise. This puts the burden of the trainer’s mistake on the learner, who didn’t receive adequate data in the first place and must now sort through additional cues, stimuli, and frustration. The vast majority of the time, the “need” for an NRM can be avoided through proper attention to training basics—good timing, appropriate criteria, and a high rate of reinforcement.
I think there is an application for NRMs in a situation where click/non-click is not clear to the subject, but these situations are rare and most trainers will not encounter them. This makes training an NRM “in case of need” a waste of effort. Spend your time training more cleanly in the first place and you’ll never need the NRM.
Alternatives to a punishing NRM
So what’s a trainer to do when a learner errs? There are several alternatives to the NRM as unintentional punisher. A time-out (usually the removal of the trainer’s attention and/or opportunity) is negative punishment, rather than positive punishment. A least-reinforcing stimulus (LRS, a complete lack of response from the trainer or environment) is true extinction—and generally the best response to an error. A trainer working at a good pace (15-20 reps per minute for a simple behavior) may pause only a second for an LRS and then move on with the next repetition, but that is enough to note the error and its (lack of) consequence. (I use an LRS at the 59 second point in the video “Fixing Behaviors without an NRM.”)
Training without aversives
Even potentially useful tools can be harmful, especially if they are a crutch for the sloppy use of preferred tools. In Animal Training: Successful Animal Management Through Positive Reinforcement, Ken Ramirez wrote, “I frequently discourage [novice] trainers from ever conditioning a ‘no’ signal, because if there is not a signal for ‘not’ it cannot be overused.”
In the process of writing this article, even though I am arguing against the use of NRMs, I have found myself using more NRMs in my own training—having them on my mind made me more likely to use them even though I knew better!
While it is true that many learners can work through NRMs, it is equally true that many cannot (and many who can, do better without). It is a difficult habit for a trainer to break. Having the option can create the opportunity or even the need. As songwriter Jonathan Coulton noted, “We do what we must because we can.” To avoid aversives in training, be aware of them in all forms, and plan accordingly.