The Assimilation system management suite implements a unique Reliable No News Is Good News (RNNIGN) protocol for communicating between the central system (CMA) and the agents (nanoprobes). The key idea is that we created a unique way of interacting with our agents which is two apparently contradictory things at once – it is highly reliable, and yet at the same time can take a no-news-is-good-news approach to things. It’s a bit like we are a 911 (emergency call) service that waits for calls to come in telling us important things. That’s the “no-news-is-good-news” part. However, we’re unlike a real-life 911 service, because we will always get a call when something goes wrong or changes – that’s the reliable part. It’s our ability to do those two apparently-contradictory things at once that makes us able to scale. The key to our doing this is to delegate the work out to our agents in a fair way, such that no agent’s work goes up as we have to manage more and more systems. Perhaps the best thing about this is that we do it in a way that’s very simple.
Reliable No News Is Good News Protocol Properties
The remainder of this page is for those of you who care about the technical details. If you don’t care about this, then that’s all good – please feel free to go read something else. If you do care about technical details, you probably be all excited to learn more about it.
- Specifically designed for C2I (Command, Control, Intelligence) applications like those of the Assimilation system management suite. Not designed for bulk information transfer.
- Designed to be as quiet as possible on the network. It is expected that servers will have connections to the CMA which commonly exchange no packets with the CMA for months at a time. When needed, either end of the communication can begin communicating again with no special provisions.
- Designed for massive scale in terms of open logical connections to the central server.
- Based on top of UDP. Our central system (CMA) only requires only one socket open to talk to any number of agents. RNNIGN incorporates acknowledgements, generation numbers and sequence numbers in order to build a reliable protocol on top of UDP.
- Acknowledgements are at user-level, and occur after it is known that information received has been acted on or will not be lost.
- Uses public key elliptic curve cryptography from libsodium. We have written a related blog post giving background about the cryptographic aspects of this protocol, and one giving more details about how we use cryptography in the RNNIGN protocol.
- Authentication of agents is through trust-on-first-use. The CMA is authenticated by distributing its public key to all agents. Nanoprobe key names are based on the system name and the md5 sum of their public key. CMA keys are named in a way that cannot clash with nanoprobe key names. Authorization of the CMA role is performed by looking at the name of the key used to sign received packets. Nanoprobes only have copies of the public key of the CMA.
- Provides compression (required by UDP and larger discovery payloads) using zlib.
- Provides strong guarantees of data integrity using SHA256 – in addition to the libsodium cryptographic guarantees.
- Low-level data encoding is a binary type-length-value format. Our implementation has a 2-byte type field and a 3-byte length field. Having such a long length field may seem odd since UDP limits packets to fewer than 2^16 bytes. However, this length is the uncompressed size of a value. Since all our bulkiest fields are JSON, and JSON is highly compressible (on the order of 4 or 5 to 1), we may need to support fields whose uncompressed length is on the order of 300 kbytes. The layout and processing of these packets is discussed in more detail in an old blog article.
- The key payloads in most operations are encoded as JSON strings (we like JSON). As noted above, JSON is highly compressible.
- Our protocol includes replay attack prevention through the use of monotonically-increasing generation numbers. When a system restarts, the sequence number resets to zero, but the generation number increases. Generation numbers are created based on system time.
- Restarts of the protocol occur as communication occurs. This means that restarting the CMA does not require talking to every endpoint all at once – either when the CMA shuts down or when it comes back up. Connections are repaired when communication is needed. For a large number of endpoints with infrequent communication, this is a win, perhaps a big win. In effect, connections to the CMA persist through CMA reboots without packet loss.
- Discovery of agent death (system crash) occurs through our unique distributed O(1) heartbeat system. Each system sends heartbeats to and expects heartbeats from a fixed number (currently two) of neighbor systems. This is described in more detail in the technical talk videos. This combines the communication and system monitoring function in a way that’s very helpful to both.
- System crash (agent death) detection has no single point of failure (SPOF).
- Heartbeats are not encrypted or digitally authenticated. This simplifies key management.
- Although the roles of the CMA and nanoprobes are distinct, the protocol itself is role-agnostic – that is, it does not know or care about the respective behaviors and roles of the CMA versus the nanoprobe.