I often use Shedlock for batch processing in clustered environments. It works great. I hit an interesting situation recently though, where I was surprised at the behavior. Basically, I had a batch job configured like this:
@SchedulerLock(name = "SomeBatchJob", lockAtMostFor = "5m", lockAtLeastFor = "1m")
Per the documentation, I set the lockAtMostFor above what I thought was a reasonable execution time. I had three nodes in the cluster, and released it into the wild. I noticed at some point that multiple hosts were processing the batch job at the same time, and so I started looking at data. At first, everything looked normal:
|6:45:00 am||host 1||batch job started|
|6:45:50 am||host 1||batch job completed|
|6:46:15 am||host 2||batch job started|
|6:47:30 am||host 2||batch job completed|
...but then at some point, I saw this:
|6:53:01 am||host 1||batch job started|
|6:58:30 am||host 2||batch job started|
|6:59:05 am||host 1||batch job completed|
|6:59:23 am||host 3||batch job started|
|7:00:21 am||host 2||batch job completed|
|7:00:50 am||host 1||batch job started|
Basically, what happened was that host 1 took longer than the lockAtMostFor time to process, and so the lock was released. At that point, host 2 started processing as would be expected. However, when host 1 finally finished processing, it released the lock from host 2 when it was done, and then host 3 picked up and started processing concurrent to host 2. And this continued for a little while until the execution times reduced.
Long story short, it's really important to set the lockAtMostFor to a high number. Also, adding some logging, alerts, or some other control mechanism might be helpful as well to catch this, because there are cases where double-processing a batch would be very bad.