In this article, I present a 90 second demo of the Assimilation software doing its discovery and monitoring – without doing any manual configuration at all.
What is the Assimilation Software?
The Assimilation Software is an extensible and extremely scalable discovery-driven automation engine that uses the discovery data to drive other operations, including monitoring and audits. Because we do excruciatingly detailed discovery, we have all the information needed to do these other operations with little or no human input.
Mini architectural overview
We have a central system which we call the Collective Management Authority (CMA), which stores its data in the Neo4j graph database. Every system we manage has an agent on it which we call a nanoprobe. Nanoprobes are policy free. The only thing they do on their own is announce when they start and when they stop. Everything else they do is because the CMA has told them to do something. In this architecture, there is one central CMA for everything, and there is a nanoprobe process (agent) on every machine in the data center – including on the machine the CMA is running on. For this simple demo, we have one nanoprobe which happens to be running on the same machine as the CMA.
The Demo
The key things to note about this demo are:
- We wipe out the database when we start
- Nothing was given any configuration information, everything was discovered – including where the CMA is running.
There is a detailed explanation all the things that happened after the demo.
To watch the demo from the beginning, press fast or slow at the bottom of the screen below
What Happened In The Demo
Since a lot of things happened, and they can go by pretty fast, here’s what happened in the demo:
- I manually start the CMA – and tell it to completely wipe out the database before starting (I give it no configuration).
- I manually start the nanoprobe – without giving it any configuration at all.
- The nanoprobe discovers the CMA (central system) and registers itself (it does not a priori know where it is).
- CMA sends the nanoprobe configuration information and discovery operations for a variety of things including clients and services running on the system
- The nanoprobe performs discovery requests and returns results to the CMA
- The CMA sends out discovery requests to get checksums of network-facing binaries, libraries, and JARs discovered above (this is to check for tampering)
- The CMA sends out monitoring requests for all the services discovered above
- The nanoprobe starts monitoring and sends success messages back to the CMA
- The nanoprobe finishes checksums and sends those results back to the CMA
- I manually stop a service
- The nanoprobe monitoring discovers the service is not running and sends a message to the CMA
- The CMA gets the message and sends notification to screen and observers
- I manually start the same service
- The nanoprobe monitoring discovers the service is now working again and sends a message to the CMA
- The CMA gets the message and sends notification to screen and observers
- I manually stop the nanoprobe – simulating a normal system shutdown
- The nanoprobe sends a “I’m shutting down” message to the CMA
- The CMA reports system was gracefully shut down and updates the database accordingly
Questions for Readers
- What was your reaction to the demo?
- What else would you like to see demonstrated?
I got this question on Facebook. Seemed good to reproduce it here:
Avi Alkalay
But now a question related to the demo: if CMA runs on another computer, how messages appeared after you shutdown the nanoprobes ?
Alan Robertson
The message you saw came from the CMA. When a nanoprobe gets a SIGTERM it sends a reliable message to the CMA announcing it is shutting down. If the CMA has gone away, it will wait up to 30 seconds before giving up and shutting down anyway. Otherwise, the normal behavior is for the CMA to get the message and ACK it right away – like happened in the demo. Retransmission interval for un-acked packets is 2 seconds if I recall correctly. In a normal data center with 3 more more systems, if it is killed with -9 or the system crashes, then its neighbors will report it dead. So the nice thing is that we can distinguish an untimely death from a deliberate shutdown.
Apparently Avi didn’t get my message about posting his questions here, so I’ll post and answer his followup question here 😉
Avi wrote: “So the shutdown message came from the CMA acknowledging to the nanoprobe his shutdown and the nanoprobe wrote it on the term right before it was finalizing its shutdown process. Or in this case the CMA was running in the same machine ?”
My reply:
The message on the terminal was from the CMA, not the nanoprobe. The nanoprobe sent a reliable network message to the CMA and the *CMA* wrote that message on the screen. For convenience in the demo, I ran the nanoprobe on the same machine, but it would make no difference if it was running on a different machine – except I would have had to ssh there to start it. Thanks for asking!
Hi! Very interesting project!
I’ d like to ask you if there already exists some integration with CMs (Puppet / Chef / Salt / Ansible)?
I thought of a “dump” command to the CMA that would yield a skeleton configuration file for your systems. Or perhaps you could “snapshot” what you have (on the CMA), install the nanoprobe on a new machine, receive the autodiscover data, and then dump the diff. This would nicely externalize the autodiscovery part of the widely-used CMs, so if Assimilation is better at this, it could be leveraged by those projects (that already have a large userbase). Win-win 🙂
For the demo, it would be nice to see a couple of things:
1 – Notification to the admin when a trigger becomes true. I guess the checksums are done routinely, so if one of them fails – say that the OpenSSH server on a machine fails its checksum – would be nice to be emailed / SMSd about it.
2 – Some sort of “tree” listing of the system. Just like the tree command in *nix. Each “directory” would be a machine, and they would be placed on the tree according to their IP range. For each machine / directory, the services monitored. Or inverting the index, each service as a node and the machines as leafs. Guess both would be useful depending on what the admin needs to do at the moment.
3 – How about measuring “coverge” of the monitoring? Perhaps “known and monitored processes” / “total number of processes” on each machine? This would be a very nice-to-have for defense. Consider this: we usually do not know everything we admin, so if I can at least know that serverXYZ is 10% monitored and server123 is 90% monitored, there is probably more low-hanging fruit at serverXYZ, so I’d focus there.
Hmm. I guess this is turning into some sort of wish-list.
Cheers,
Renato.
Hi Renato,
Thanks for the comments!
Let me address them in order:
0) Your suggestion is a good one. On a similar note, I’ve had someone who thought about creating a Docker clone of the data center or a set of services based on the real systems. Awesome for testing!
1) There is an event API you can use to be notified of anything – it’s currently a fork/exec suitable for running a script of your choice.
2) There is lots of need for a user interface. By IP address isn’t one that occurred to me. For many organizations that wouldn’t be a very useful way of organizing them. My thoughts had been more like grouping systems that worked together for a particular service. Since we have client/server dependency information, we can do that. And create a graph of all that, organized by client/server relationships.
3) There is a canned query that will tell you all unmonitored services. If it doesn’t sort by host, it should ;-). There’s a command line command for these canned queries.
We can also tell you “What all IP addresses belong to unknown systems?” and lots of other really cool things.
Of course, this is an open source project, so I highly recommend you join the project and come contribute – at least by doing a trial. The potentials of this project are far beyond what you’ve touched on here – but it’s clear you get the idea ;-).