Did you know that Splunk already has built-in health checks that can tell you if something important was misconfigured and alert you to important system or data outages? If you did, you are actually in the minority of users!
Most companies have at least some requirement for tracking when a server goes down – and if they do not, they should. Coming in Monday to find out half the virtualized indexers went into a saved state last Thursday does not make for a good start to the week.
This happens to the best of us: I recall back in my Splunk ES admin days being asked for some firewall logs during a suspected breach window only to discover Splunk had not been getting Checkpoint logs for over two weeks. In the middle of a breach investigation is not the time to tell an already pissed manager we don’t have the logs.
Step 1: Set Up the Splunk Distributed Monitoring Console
Luckily, you won’t have to worry about such a career-defining conversation if you follow some simple steps and take some proactive measures. The first step is setting up your Splunk distributed monitoring console (DMC). DMCs are typically a standalone server with minimal CPU, as of this writing 4-cores was the minimum, and they can be virtual. You can also add DMC as a function of another Splunk instance, such as a deployment server, or license master if you’ve got the system resources available.
Any Splunk Enterprise instance (SH, IDX, CM, HF, DS, LM, etc.) we want to be monitored we simply need to add as a search peer – this is done easily via the UI. Splunk best practices say to add every Splunk Enterprise instance in your Splunk environment, even heavy forwarders, and in many cases some of your universal forwarders (UFs), especially if the UFs are on centralized Syslog servers.
Step 2: Start Monitoring Critical Metrics
Now for the fun part, on the Splunk homepage of your DMC go to Settings à Monitoring Console. Here we can view which instances are down, and depending on which devices were added, we can see things like events per index, searches per user, avg CPU percentage by SH, and even how many scheduled searches ran at 2 am last Tuesday. Also, by simply clicking “Health Check” and “Start” Splunk will launch a health check of 18 out-of-box items, things like whether your limits were set correctly, or your scheduler skip ratio, or even the status of your KV store. Good things to know, and a good thing to run monthly at a minimum.
While these metrics are really cool and can keep a data science nerd busy full time, let us focus on the stuff that can get you fired, or at least some nasty-grams if you don’t do it.
There are nine out-of-box alerts the Monitoring console can kick-off for you if bad things happen. I would like to emphasize four of these, they are all very self-explanatory: Critical System Physical Memory Usage (>90% server memory usage), Near Critical Disk Usage (>80% disk capacity used), Missing Forwarders (UFs not sending data), and Search Peer Not Responding. Those last two, a Splunk box being down or no longer sending data, is something any Splunk admin would like to know before anyone else. Luckily Splunk has built these alerts in already, and you can configure them to send you an email within seconds of the box going down. The other two are also important as they are indicators the box may go down very soon if not addressed. I would recommend everyone enable at least these four alerts, with email actions to the teams or individuals with access to immediately address the issues.
Step 3: Creating Index-Specific Alerts
These are a start, but there is much more we can do. Personally, I would like to avoid the embarrassment of having to explain to the boss why I didn’t notice the firewall logs went down weeks ago. This is where index-specific alerts come into play. You can set your search cron schedules based on how quickly you want to know when something stops sending events.
Here is an example of a missing data report by host (in this case Windows hosts) alert:
| tstats latest(_time) as latest where index=windows by host index | where latest < relative_time(now(), “-1h”) | eval latest=strftime(latest,”%m-%d-%Y %l:%M %p”)
In this search what we are looking for is the tstats of all windows hosts that have reported in over the last 7 days, then running that against all the hosts that have reported in over the last 24 hours. We then call any box that was found in the last 7 days, but not the last 24 hours, a “Missing Host”. The “Trigger” action being more than zero events. This will work for any index, replace windows with pan:traffic and your Palo’s are covered, cisco:asa if Cisco is your thing or whatever index your hosts live in could possibly want to know about.
If you don’t need extreme granularity you can go more general by grouping a couple of indexes together under the same rule (with OR between your indexes in both search and sub-search). Or you can get even more granularity, by narrowing the window for traffic not seen in the last one hour over the last one day.
If you’re really serious about being responsive, for deadly serious things like online banking portals or major WAN links where you’re dealing with millions of events per hour, you could even go down to 5 minutes over the last hour. I would highly recommend using this sparingly and only on the most critical of data that requires maximum uptime. The best practice for the timing schedules would be based on how quickly your response team could act on an outage, if you have a 48-hour response time, no need to run a search more frequently than two or three times a week.
To avoid nasty surprises during an incident, set up servers to feed your DMC and set alerts for the following: when a Splunk box is down when a box is about to go down, and when a data source stops sending events. Paired with routine health checks of configurations, indexing status, resource/license usages, and skipped search ratio.
Do these things, and almost nothing will happen in your environment that you don’t already know about. Your powers will actually be so great, you can tell the firewall team which of their boxes went down and exactly at what time, so it can be their fault they were not in Splunk during the incident.
SP6 is a Splunk consulting firm focused on Splunk professional services including Splunk deployment, ongoing Splunk administration, and Splunk development. SP6 has a separate division that also offers Splunk recruitment and the placement of Splunk professionals into direct-hire (FTE) roles for those companies that may require assistance with acquiring their own full-time staff, given the challenge that currently exists in the market today.