Remote workers and idempotency

By
  • Blog
  • >
  • Remote workers and idempotency
TOPICS

30 Day Free Trial

Bring together legacy systems, RPA bots, microservices and more with Camunda

Sign Up for Camunda Content

Get the latest on Camunda features, events, top trends, and more.

TRENDING CONTENT

In Camunda there is a concept called External Tasks. See External Tasks allows new Use Cases with Camunda BPM or External Tasks in the docs. The basic idea is simple: Camunda does not actively call a service (which would be PUSH), but worker fetch work items queued for them (PULL). Whenever a worker finishes his work item, he reports completion back to Camunda. 

Workers can use the Java API, but most often leverage the REST API, as this allows to run workers as own process. This again allows to scale the workers independently and use whatever language you like to implement them. Also it allows on-premise workers in your private network access a cloud hosted engine.
Whenever you are talking REST over the wire, you don’t have transactional guarantees. Why is this important? Let’s look at an example:

Failure scenarios of remote workers

Your worker fetches some task which gets locked for him exclusively. But the resulting data gets lost in the network. Now there is a task locked on the Camunda side which will not be processed, as the worker didn’t get them. This is not a big deal as you just have to wait for the lock timeout (configured in Camunda), then the tasks will be handed over to the next worker.

Now let’s assume your worker got the task data, performed the work and calls the complete method, which fails due to network problems. Now you cannot differentiate if the call came through and the task was completed in Camunda (and the workflow moved on) — or if the task was not completed on the Camunda side. So it is not a good idea to rollback the workers work — because if the task was marked as completed the work will never be carried out. You can retry the call to Camunda but you might face a longer network outage. In this case the best strategy is to ignore the problem. If the task was completed everything is fine, if not, the worker will get the work again (after the lock timeout). But this means you have to make the service your worker calls idempotent, or add some logic in the worker for de-duplication.

A similar problem might arise when you start a new workflow instance. If this fails you cannot know if the workflow was successfully kicked of — or not. This time you have to make the workflow instantiation procedure idempotent. Currently there is no out-of-the box feature for this in Camunda, so you have to take care yourself.

Typical strategies are:

  • Set the so called businessKey in workflow instances and add a unique constraint on the businessKey field in the Camunda database. This is possible and you don’t loose support when doing it. When starting the same instance twice, the the second instance will not be created due to key violation in this case.
  • Add some check to a freshly instantiated workflow instance, if there is already another instance running for the same data. Depending on the exact environment this might be very easy — or quite complex to avoid any race condition.

That’s it. I hope you are now aware why your services need to be idempotent and how you should deal with network problems when calling the Camunda REST API. Think about an idempotency strategy when starting workflow instances. While all this might sound like a lot of things to take care of, it is just every days life in distributed systems 🙂

Camunda Developer Community

Join Camunda’s global community of developers sharing code, advice, and meaningful experiences

Try All Features of Camunda

Related Content

We're streamlining Camunda product APIs, working towards a single REST API for many components, simplifying the learning curve and making installation easier.
Learn about our approach to migration from Camunda 7 to Camunda 8, and how we can help you achieve it as quickly and effectively as possible.
We've been working hard to reduce the job activation latency in Zeebe. Read on to take a peek under the hood at how we went about it and then verified success.