Citigroup and extreme overpayments
In two separate incidents in April, Citigroup employees sent incorrect values to customers. The incidents are an interesting study in ergonomics.
Citigroup was in the news twice over the past month for extreme overpayments into customer accounts.
In the more popular story, Citigroup employees accidentally made an $81 trillion payment into a customer’s account.
Citigroup attributed the $81 trillion error to a combination of manual processes and a cumbersome back-up system. The payment, originally intended for an escrow account in Brazil, became stuck in the bank’s system due to a sanctions screen. Employees were instructed to use a rarely accessed interface that pre-populated transaction fields with 15 zeros, a design flaw that contributed to the mistake.
And apparently people were supposed to check for this, but it got rubber stamped before it was finally caught after the money hit the customer’s account.
The mistake occurred during a manual entry process when a payments employee failed to delete pre-filled zeros in a backup system’s transaction field, Entrepreneur reported. A second official assigned to review the entry passed the error undetected, which was flagged 90 minutes later by another employee monitoring account balances.
Don’t you feel at least a little bad for this person? Imagine being responsible for resolving this transaction. It can’t be processed normally because of a sanctions check. It’s your job to manually override the sanctions block. You have to do this with some almost-forgotten piece of software. You take out this ancient program and blow all of the dust off. You wipe off some dirt. An ancient unwieldy UI lies underneath, staring at you, daring you to plumb its arcane secrets. You carefully match all of the form fields in this application to information from the user’s account and the transaction in question. Finally you are done. You click “submit.” The transaction processes. Some random customer just became the richest person in the world for 90 minutes. “I didn’t know I was supposed to delete all of the zeroes!” you scream as the world starts spinning around you. The ancient program grows dark, waiting for its next victim.
Although this happened last April, it surfaced in the news last month. It was soon joined by a different — but similar — story.
Citigroup almost sent $6 billion to wealth account in copy-paste error, Bloomberg reports
Citigroup nearly credited about $6 billion to a customer's account in its wealth-management business by accident, Bloomberg News reported on Monday, citing people familiar with the matter.
The near-error occurred after a staffer handling the transfer copied and pasted the account number into a field for the dollar figure, which was detected on the next business day, the report added.
This mistake feels really relateable? Copy paste copy paste copy paste copy paste copy paste copy paste. Next form. Copy paste paste copy paste copy paste copy paste copy paste. Next form. Wait, why is everyone mad at me?
Laughing at these mistakes is one thing. But what if we were tasked with improving these systems so that it could not happen again? How would we even approach it? We can look through the lens of a field of study called “human factors” or “ergonomics.” This is the study the physical and cognitive limitations of humans and how to design systems that work with them. For example, air travel is one of the safest forms of travel. But it didn’t begin that way. Crash by crash, death by death, the world examined the mistakes made by doomed pilots and asked, “how can we change the system around them to make this less likely?”
Where do we begin? Let’s ask ourselves a really simple question: “how do accidents happen?” In ergonomics, there are a few defined mental models of how errors occur. Some of them are old and discredited, and some house a more modern view of how accidents occur. Here is a sampling of them:
The most natural human tendency is to believe the “chain-of-events” model where a “bad apple” caused an error. Simply removing the bad link (typically the human) will fix everything. “This employee sent over a trillion dollars to a customer account. Fire them.” It turns out that this approach suffers for a few reasons. The most obvious is that someone will just make that same mistake again. It also has a bunch of insidious consequences for your organization. If people can get fired for making trivial mistakes, then people will just start underreporting their mistakes. Problems won’t receive visibility and won’t be addressed.
The next human tendency is to view catastrophies as basically overcoming every protection that was in place. This is called the “barrier” model, or sometimes the “Swiss cheese” model. Imagine a few slices of Swiss Cheese stacked on top of each other. If you poke a random part of the stack, you will likely hit cheese. But sometimes all of the holes line up and you get all the way through. This is effectively the “accident” — you’ve gotten around every barrier. This model is appealing because it excels at describing the step-by-step events of an accident. “Adam dropped a banana peel in front of Joe’s apartment. Normally Joe looks down for any potential obstructions, but he was walking home with a giant box of murder hornets that obstructed his vision. The giant box was secured with enough tape to keep the murder hornets in. But nobody had ever tested whether this amount of tape could keep the box contents secured in the event of a pratfall. Joe turns the corner and does not see the banana, and proceeds to move forward…”
The barrier model is still not considered an acceptable model of explaining accidents happen. If you believe that safeguards are the only approach to fixing problems, then you would likely issue a suite of recommendations that include extra safeguards. Mirrors should be installed in the hallways so that people can see the floor if they are carrying a large box. Maintenace should install extra cameras and have someone look out for slippery objects in the hall. There are documented proper ways to transport hornets, consider using them in the future. And this shows the problem with the barrier method! Nobody stops to ask themselves, “Why did Adam seemingly intentionally drop a banana in front of Joe’s apartment? Why did Joe have murder hornets in the first place??” And these are the salient questions for getting to the bottom of the accident!
And now we are getting into more interesting accident models. The first one is the “systems model.” In this model, the entire organization define a complex system. The organization contains lots of moving parts that each might have their own objectives and goals. All of the resulting interactions might cause emergent behavior, i.e. behavior that is not designed but emerges from the properties of the system.
One interesting observation is that in an emergent system, everything might have gone “right” and the accident still happened. For example: imagine a pilot attempting to land in a storm. The system dictates that they are clear to land if there is no ice on the runway and the wind is under some threshold. Air traffic control clears the pilot for landing, and the airplane gets caught in a microburst and everyone dies. Nobody directly involved in the accident made a mistake, but the organization did not have the appropriate guidelines in place and there are likely several systemic causes. Were there previous near-misses that were unheeded? Did the people making the recommendations have appropriate experience? Were they trying to keep the recommendations simple enough to understand, and trimmed excess rules that seemed unlikely to be needed?
But there are still plenty of mistakes. Let’s finally go back and start thinking about the bad Citibank transfers again. First, Citibank clearly tried to define a system. An employee had a clear process for overriding the hold and issuing the transaction. This transfer underwent a review from a second employee. If somehow the transaction is approved, there are additional employees that monitor for anomalous transactions and can begin processes to reverse them. But let’s ask ourselves some questions from a systems perspective:
In this arcane program, is there any indication of how the program is interpreting the input? For example, if you typed
000000000000280
could you see that the program believes that the parsed value is in the trillions?How often do employees need to use this program? If they are unfamiliar with it, how are they trained on using it?
How long had management been brushing aside complaints that the program was unacceptable?
If they had taken the concerns seriously, how easy would it have been to prioritize work on a rarely-used program based solely on the argument that it was easy to misuse?
For the transaction reviewer, did transactions issued from the arcane program appear differently than other transactions they reviewed? Did they clearly indicate the amount in the originating request, or would the reviewer need to look up the relevant sanctions ticket and determine what the value should have been?
For the transaction reviewer, how many hours in a row are they reviewing these requests? Is a very high percentage of them approved? Do people have quotas? Are people rubber stamping these reviews because they distract from their own work?
Surely you can see where this is going. This isn’t a story about someone who used a program incorrectly. It’s a story about a company that puts employees in situations where they are very likely to issue incorrect transfer requests. The company just reacts with safeguards instead of attempting to improve their control over the system. It is also very likely that the safeguards differ in quality between theory and practice. You can even see echoes of this yourself — the wealth transfer case would be caused by a similar system.
There is one more model that I want to go over, and I will be brief because it is already past my bedtime. This is the “drift” model, which explains rare accidents. This is the idea that rare accidents happen because a system starts from a safe baseline and then gradually departs from it because of complacence. For example, baggage screeners at airports need to examine many safe bags so that they can occasionally spot malicious goods. And I’m sure people bring every kind of object that toes the line of what is permissible and unpermissible by the guidelines — which are themselves ever changing. After examining hundreds of knife-looking objects that turn out to be other things, it could be very easy to miss an actual knife.
The solution is to inject some doubt into the system. What if you periodically passed contraband through the detectors and made it clear that their job performance will partially hinge on catching all of them? What if you didn’t trust people to maintain performance over time and rotated them out every 30 minutes? Are the current technologies good enough to detect threats or should we investigate new ones? You want to try to force the system to continually improve itself.
If you want to read more about human factors/ergonomics, I really enjoyed The Field Guide to Understanding ‘Human Error’ by Sidney Dekker. This isn’t an affiliate link, I just think you might enjoy it if you enjoyed this post. It goes into much more depth explaining these models and how to think about designing systems where people make mistakes, complete with plenty of depressing anecdotes involving people dying in terrible accidents.