Your tools are run on your own machines. Inferable control plane will send instructions via the SDK to your machines.

In a cloud environment, machines can fail at any time. This means at a given time,

  1. Your tool executions may be interrupted
  2. There might not be a healthy machine available to run your tool

Stalled Machines

Inferable control plane will periodically send heartbeats from the SDK to your machines. If a machine fails to send any heartbeats within an interval (default 30 seconds), it is marked as unhealthy, and Inferable will not send any new requests to it, and mark any functions that were running on it as failed.

If the machine comes back online, Inferable will mark it as healthy, and start sending new requests to it. However, it will disregard any results from the machine for the functions that were marked as failed.

Recovering Functions on Stalled Machines

If a function stalls, because of a machine failure, Inferable will by default not retry the function. This is because Inferable doesn’t know if the function is safe to retry, or if it should be retried.

If you want to retry a function, you can do so by setting the retryCountOnStall option to a positive number. This will retry the function up to the specified number of times, before marking it as failed.

Recovering Stalled Functions on Healthy Machines

Sometimes a function stalls because of a bug in the code of the function, or some underlying issue that doesn’t propagate to the function context. In these cases, Inferable has no way of determining if the function has stalled, or genuinely long-running.

To account for this, each tool has a timeoutSeconds option. If a function stalls for longer than the timeout, Inferable will mark the function as failed, and retry it again as long as there are retries left.