Darrel: What do you call this love? Sal: Ice cream. Darrel: But yeah, it’s what you do with it. Dale: How do you do it mum? Sal: Scooped it out of the punnet!
The above is a quote from the iconic Australian movie The Castle. It is a rite of passage movie in which the vernacular and lines used in the film have impressed themselves into daily dialogue. It is a movie I give to any friends immigrating to Australia! There is this passage in the movie that goes “It’s what you do with it” . It got me thinking about logging, data and telemetry and something I am passionate about.
There is such rich data within all areas of the data center. These range from switches, routers, server infrastructure, network overlays, application interactions, and all the relationships that they all have. Telemetry from all aspects of the data center is rich, contextual, actionable, and very accessible in recent years. When I speak to customers they have a strategy in place.
Some people are planning on logging. Others think they should get around to that but are more concerned with getting a new flash-bang widget. There are a group who do log EVERYTHING to a SIEM or Log Tool (Log Insight, Splunk, ELK Stack). They are logging events from all devices. Generally, the verbose output of logs is being fed to a centralised or off-premises solution like the aforementioned products. That is great. I then ask them the question “What are you doing with that information?” This is met with very blank looks or knee-jerk responses.
Like Smaug sitting upon his horde of gold in The Lonely Mountain it was going to waste. The real power of data from distributed systems; the telemetry information about actual state and events occurring in real-time is/was/still is being squandered. So how do you unlock the data and start using it?
Before you undertake centralised log management you should understand the baseline in your environment. If you’re choosing to store logs on premises then there is consideration around Storage, Compute, Archival, and consumption. Events Per Second (EPS) is also a metric to consider for on premises. Based on the number of devices and the events they generate per second, the rate of ingest of the logging platform is a consideration. Sharding, Workers, and Distributed logging help with this. Conversely, when storing logs in the cloud, you’re generally charged purely per GB of logs stored. Depending on your requirements, what you’re doing, the infrastructure required, data sovereignty, and amount of use the decision to place this solution on premises versus off premises will come down to some of these factors.
Step One – plan to capture the data
What are your outcomes of a logging data? What do you want to do with it? What are the business requirements. Assess the devices that need to be monitored. Virtual and physical infrastructure, Firewalls, Routers, Applications are just some examples. Most of these, be it VMware, Cisco, Juniper, Oracle, SAP all allow the ability for you to ship logs to a FQDN or VIP of a logging platform.
If you have a targeted use case or know exactly what you’re after you can direct only what you need. Security focused could be Maximum connections, session teardown, SYN floods for example. If you’re unsure and have the ability, logging everything allows you to use the data for much more.
Step Two – Consume and Use
Real-time logging can provide insights into what is occurring in recent history and can provide context into service degradation. Visualisation engines such as those in Cacti, Kibana, Splunk or Log Insight take parameters that are pre-defined and graph based on these attributes. They can be historical events per second, trending occurrences over time, consumption and more.
What you choose to match on can be monitored across the entire DC due to distributed systems, at certain choke points in the network where ingress/egress occurs or a combination of both. Where does the use of such data play its part?
Two immediate places come to the fore. They are Operations and Capacity Planning. Operations of an environment with a strong alerting and logging system will identify issues such as oversubscribed links, broken or flapping topologies, failing hardware or software to name a few. They can then trigger off human or machine workflows to act on and potentially remediate the issue. This data can be subsequently used to look at trends and events over time. Where and why bandwidth upgrades are required. Where consolidation can happen? Where there needs to be potential re-architectures in application, network, or infrastructure?
An example
Darrel Kerrigan very aptly put it with the phrase “It’s what you do with it.” I believe there is so much value in data you collect. It is a matter of doing something useful with it.