Batch Jobs in Clustered Environments

As much as we'd prefer to do everything in real time, there's often a need for batch processing. Performance is the most common driver, but sometimes there are also business reasons as well.

As an example, maybe your central application is some transactional web app, and there's some data cleanup you need to do at the end of the day - some user transactions, say, that are helpful to keep for a day, but can be purged at midnight. Assuming this is a reasonable candidate for a batch job (bear with me), the question then is, what is this batch job? What I mean is, is it external to the transactional system from which the data was created, or is it embedded within it?

It's probably obvious that, from a scalability and performance perspective, the best solution is the external one, since any resources (CPU, memory, etc.) consumed by the batch job are not subtracted from the core transactional system. (though, yes, they both share the DB...so maybe not the best example!)

But is there ever a case where it makes sense to run batch jobs within a transactional system? Though an architectural purist would probably say "no", there can be a few advantages to this approach:

Simplicity: Depending on your organization/environment, procuring a machine and resources to run batch jobs can be a challenge, as can be the management of the scheduling and execution of these processes. Sometimes the easiest thing to do is just include the batch processes in the thing (i.e. deployable unit) you have. For big organizations (where bureaucracy and overhead are high), this is probably the path of least resistance.

Code Reuse: Often times, the batch job needs to do some of the same types of "stuff" the transactional system does. For example, it may query the same tables, shove them into the same domain model classes, and execute the same business logic. To avoid code duplication, these code assets would need to be externalized to some common library for both the batch job and transactional system to leverage, and the work of extracting code into new libraries is non-trivial.

If you do decide that in your case, these advantages of running batch jobs within a transactional system outweight the drawbacks, then it's important to note a crucial snag: clustering. When the batch job is embedded in the transactional system, and there are multiple instances of the transactional system deployed in a cluster (for either scalability or reliability/availability), then some mechanism needs to be in place to ensure that the batch job gets kicked off on one and only one node in the cluster.

There are a few possible solutions to this problem, some more reliable/elegant than others. One naive option is to just configure one node in the cluster to be the one that runs the batch job (e.g. "if hostname='server1.acme.com' then run batch job"), but this has the obvious flaw that if this server is down, another node in the cluster wouldn't know that it needs to pick up the slack. JBoss has a more robust form of this solution with its HASingletonDeployer.

Another workable option is for some external scheduler to kick off the batch job, with a request (either via MQ or even HTTP) to the transactional system. This option is workable, but has some downsides - the transactional system needs to expose endpoints (or consumers) to listen for requests to run the batch job and the remote scheduler needs to be sophisticated enough to know that if the job wasn't picked up (or run successfully) on one node, the request needs to be sent to the other.

Probably the best solution for this problem I've found, for the Java side at least, is a library called ShedLock. It's laser-focus-designed specifically to solve this exact problem. All nodes in a cluster would be able to run the batch job, but they would use a central database to obtain a lock before running it, where only the node that has the lock runs it, and the rest skip it. The configuration is super easy, and rather than regurgitate already great documentation, just check out their github.

Anyway, this was a problem that we recently confronted, and since I didn't find a ton of good content out there I figured I'd write this out. Hope it helps someone!