A remote kernel in use may fail at any time, due to hardware, network, or software problems. A failure of a remote kernel will be noticed the next time
Parallel Computing Toolkit tries to send a command to the kernel or tries to read a result from it. The error message
Parallel::rdead is used to notify you of a failed remote kernel.
If the failed kernel had any processes assigned to it, these processes will be lost. If you are using
Wait for one of these processes, your program will never terminate because the process will never return.
Because
Parallel Computing Toolkit keeps track of the commands submitted to remote kernels, it can reassign these commands to another available remote kernel if a remote kernel fails. Alternatively, it may simply terminate the waiting processes with the result
$Failed, which indicates failure. The chosen behavior is determined by the value of the variable
$RecoveryMode.
The
ReQueue recovery mode lets you finish a computation as long as at least one kernel remains usable. However, it may give wrong results if the remote computations produce side effects or your computation depends on a certain number of available remote kernels. Side effects are usually present if you use virtual shared memory. There is also the possibility of a deadlock if a process on a failed kernel acquired, but never released, a shared resource.
You can use the
Abandon recovery mode to implement your own failure recovery method.
Failure recovery affects only processes started with
Queue[] and collected with
Wait[]. Other parallel commands, such as
ParallelEvaluate[], cannot handle a failed remote kernel and always return
$Failed in such cases.