I often use Shedlock for batch processing in clustered environments. It works great. I hit an interesting situation recently though, where I was surprised at the behavior. Basically, I had a batch job configured like this:
@SchedulerLock(name = "SomeBatchJob", lockAtMostFor = "5m", lockAtLeastFor = "1m")
Per the documentation, I set the lockAtMostFor above what I thought was a reasonable execution time. I had three nodes in the cluster, and released it into the wild. I noticed at some point that multiple hosts were processing the batch job at the same time, and so I started looking at data. At first, everything looked normal:
Date/Time | Host | Message |
---|---|---|
6:45:00 am | host 1 | batch job started |
6:45:50 am | host 1 | batch job completed |
6:46:15 am | host 2 | batch job started |
6:47:30 am | host 2 | batch job completed |
...but then at some point, I saw this:
6:53:01 am | host 1 | batch job started |
6:58:30 am | host 2 | batch job started |
6:59:05 am | host 1 | batch job completed |
6:59:23 am | host 3 | batch job started |
7:00:21 am | host 2 | batch job completed |
7:00:50 am | host 1 | batch job started |
Basically, what happened was that host 1 took longer than the lockAtMostFor time to process, and so the lock was released. At that point, host 2 started processing as would be expected. However, when host 1 finally finished processing, it released the lock from host 2 when it was done, and then host 3 picked up and started processing concurrent to host 2. And this continued for a little while until the execution times reduced.
Long story short, it's really important to set the lockAtMostFor to a high number. Also, adding some logging, alerts, or some other control mechanism might be helpful as well to catch this, because there are cases where double-processing a batch would be very bad.