The previous watchdog requests required waiting for a reply, which could slow down the system under load and also lead to false errors on slower systems (e.g. the CI). Because of this, this commit replaces the watchdog requests with asynchronous heartbeat messages, which don't require replies. Instead, we record the last heartbeat timestamp whenever we receive a heartbeat message. We then periodically check the time since the last received heartbeat message and consider the connection closed if too much time has passed.
We synchronize the start of nodes (even across machines) to avoid missed messages. This led to deadlock issues when nodes exited before they initialized the connection to the dora daemon. The reason was that the other nodes were still waiting for the stopped node to become ready.
This commit fixes this issue by properly handling node exits that occur before sending the subscribe message. If nodes exit, they are removed from the pending list and added to a new 'exited before init' list. Once all nodes have subscribed or exited, we answer the pending subscribe requests. If nodes exited before subscribing we send an error reply to the other nodes because a synchronized start is no longer possible.
To make this logic work across different machines too, we add a `success` field to the `AllNodesReady` message that the daemon sends to the coordinator. The coordinator forwards this flag to other daemons so that they can act accordingly.
The coordinator is our control plane and should not be involved in data plane operations. This way, the dataflow can continue even if the coordinator fails.
Removes the separate `dora-runtime` binary. The runtime can now be started by passing `--run-dora-runtime` to `dora-daemon`. This change makes setup and deployment easier since it removes one executable that needs to be copied across machines.
In #237, I have grouped the validation of yaml as a method of descriptor.
This validation was copied from the `cli check` method. However, we did not
add the validation of shell command and accept url as valid source in the
original `cli check`.
This Pull Request validate both sources.
As we are not currently using zenoh communication, it would be preferable
to not mention it in the datalflow graph as some people might:
- A. confuse it with our shared memory.
- B. Question why it is there.
- C. Question what is zenoh.
I think that we can support dora-rs without external communication config,
as I can see many use-case in simulation.