How to Build Robust Processes
In this excerpt from the podcast between Keith Swenson of Fujitsu North American and Peter Schooff, Managing Editor of BPM.com, they discuss how to first identify robust processes, how to tell if a process is not robust, and what to do about it.
PETER: So just to start it off, how would you define a robust process?
KEITH: It's probably better if I start by defining what it isn't. A lot of people think that a robust process is going something that when you start it, it's completely opaque, it's fire and forget, it's a 100% reliable, it does what it's supposed to do. The problem with that is that is that kind of 100% reliability is only possible in isolated and idealized environments and in a business process has to deal with some of the realities of your organization and it can't really run that way, so when I define a robust process, I define it as a process, that when it works, you have confidence that it did it right, and when it can't work, it tells you clearly the problem and it allows for being restarted in some sort of controlled manner so that you can eventually get to completion, and to success.
PETER: That makes sense and when you first start out saying what it isn't, that basically tells me there's a lot of non robust processes out there. Can you give me some examples of those non robust processes that you've seen?
KEITH: Well, I can. The thing is that, many of these business processes are designed by programmers and the programmers are used to working in more or less idealized environments. So when I design a program to run on a particular system, you know, I can code that program; I can test it, I can get all the bugs out and I can make it run in that one system very, very reliably. When we work with distributor systems, it's a little bit different than that. One example I have is from a customer I was working with recently and as many of these things go, it was a legacy modernization project. This was for human resources and they had a process, they wanted to onboard employees effectively. They had these six legacy systems that were out there that did various things like allocating a badge them, or setting up an account, or whatever it is and they didn't want to rip and replace all of those, they wanted to leverage those, so they built a master process and the master process started, you entered all the information about the person and then it would call off to these other processes. But the reason that it wasn't reliable was that as you pass the information from one system to the other, there was a chance that one of those systems may fail. It simply does not guarantee that everything worked right.
PETER: Gotcha. Now, so you have a failure, what is the key to basically hurdling this failure? Getting over this and cracking that process?
KEITH: Yes, it's important to remember that failures of these type are always going to happen and you need to design around those failures so that you can recover from them. You need to expect them and to recover. When I see these highly fragile processes, I often think of Rube Goldberg machine. Are you familiar with Rube Goldberg?
PETER: Sure, that's one of those extremely complex machines that do basically something like throw a light switch or something, correct?
KEITH: That's right. They're, they're very humorous and the involve a bunch of very complicated steps and if every step completes correctly, you end up getting your successful result. What is humorous about that situation is that it is very easy to see how every little step could fail, every little thing could, if it just went a little bit wrong, the whole thing would break. So many of these businesses processes are being designed again, they are being designed either by business people without a lot of experience in the area, they're being designed by programmers that are used to working in idealized environments and they hand information off. System A passes the information to System B which then does some processing on it and it passes back, which then passes it to System C and it looks a little bit like a Rube Goldberg machine. If everything worked perfectly, it worked; it's a real process. But if any little thing fails along the way, the whole thing stops and the system may not be able to recover from that.
PETER: Right. So, how would distributive systems handle something like this and also, can you somewhat touch on micro services as well?
KEITH: Oh, that's two good questions there. The first is that because it's a distributive system, let me talk a little bit about reliability in a distributive environment. In a localized environment, we make use of these things called transactions and that's a software engineering concept which allows you to guarantee consistency. If you start in a consistent state, and you start a transaction, you can have your program make some updates to the data and either all of the updates will be done, or none of them will be done. This is very, very important to make sure that you go from one consistent state to another consistent state. You never have the fact that sort of half of it is written and half of it isn't. So we understand transactions very well in the software engineering field and there's been a lot of research into distributive transactions. That's where you have all sorts of systems that are all involved in a transaction so that they all succeed or they all roll back. The problem is, that as you increase the size, that as you include more and more systems into there, the cost of that goes up exponentially. The amount of memory needed, the amount of time needed, the amount of compute resources. As you add systems, you will reach a point where you just simply cannot add a system to the transaction because it becomes too expensive and your systems slow down far too much.
PETER: That makes sense.
KEITH: You also have the problem that you can end up with deadlocks, so as you have these open transactions out there spread through your organization, if another transaction comes along that needs some of those resources that can be blocked, and that also increases exponentially. So what we actually do is we end up making small islands of reliability. We make a small group of servers and those are going to be reliable and they provide a function something like ERP. Then you have another set of servers and they will be reliable and they'll provide a function such as payroll or something like that, I don't know what it is. Between these islands of reliability, sometimes programmers want to link those islands with a reliable messaging system. This is why I was invited to talk at the robustness in BPM seminar that was held in Monterey this year to talk about how you can make large systems reliable. I won't go into a lot of the details here, but the fact is you have these reliable environments, those work very well, but if you attempt to bridge them with reliable messaging, it doesn't work. It is important that people who design business processes, or at least the system architects that implement those business processes need to understand that you can't just assume that you have a reliable system.
PETER: Right. So you're talking basically creating some kind of feedback loop, correct?
KEITH: That is correct. That's where I see it happening. I worked in a different system where we had a problem for a large international bank and the system was handling requests quite easily. There were hundreds of people using the system, but every now and then the system would get very very slow. It would start to take up tremendous amounts of memory and lots of CPU time and it would just sort of escalate out of control and what was happening was that the programmers deep inside the system had decided that if a particular system call failed, it was probably because that other system was off line so we'll just try again. Instead of giving up, we'll just keep retrying. What was happening is down deep in the code, there were these loops going on where it was trying, trying, trying again, and actually later on we figured out something was coded wrong. It was never going to succeed, but it was trying and trying and so a transaction that normally took fifty milliseconds was now taking twenty or thirty seconds, forty seconds. What was then happening was that instead of the system having to handle only one or two transactions at a time, it was handling fifty or sixty transactions and memory blew up, and everybody slowed down and everybody was unhappy. So the principle that we found out of this was to immediately, when you see a failure, throw an exception, stop processing. Do not continue processing when you hit an error. This is the concept of fail fast and fail fast is critically for large systems to be stable when they encounter an error. So what you do is as soon as you hit an error, you throw an exception, you roll everything back to the beginning and then you record the fact that that error happened. You roll the transaction back so that you're consistent state, but you remember the error happened and you somehow surface that for users to be able discover and to deal with.