Recovering from machine failures

In a cloud environment, machines can fail at any time. Inferable transparently handles machine failures by periodically sending heartbeats from the SDK, catching and retrying failed operations on a healthy worker. This means that you don’t have to worry about your services being unavailable due to an individual machine failure.

If a machine fails to send any heartbeats within an interval (default 90 seconds):

  1. It is marked as unhealthy, and Inferable will not send any new requests to it.
  2. The functions in progress are marked as failed, and Inferable will retry them on a healthy worker.

If the machine comes back online, Inferable will mark it as healthy, and start sending new requests to it. However, it will disregard any results from the machine for the functions that were marked as failed.

However, it’s possible that the particular workload that you’re executing on the machine is what makes it crash. To account for this, there’s a retry limit for any function call that results in a machine stall (default 0 retries, see retryCountOnStall. If the function fails more than the retry limit, Inferable will mark the function as permanently failed.