What is Summary Indexing???

Question mark morphing into a light bulb

Many people have heard of summary indexing, yet have not made use of them.  One reason for this is that you may not realize you need them until you need them.  Summary indexes, as the name implies, allows for the storage of summarized data over time.  A good use case for a summary index is queries that require the summarizing/trending of large amounts of data over a longer period of time. 

For example, with the onset of increased teleworking, my customer had an increased need to monitor the activity and resources used around teleworking (VPN activity, login/logoff, concurrent users, zoom activity, etc.). I was required to track hourly metrics over a longer period of time, such as hourly metrics over a period of weeks.  When using good spl structure and tuning, the query still failed to return results in a reasonable amount of time.  This type of query is a great candidate for summary indexing.   

There are two parts to this:  1). define a query to populate a summary index using smaller chunks of data, and 2). Define a query to run against the summarized index to calc your final results.

The summary indexing solution allows us to take these bite-size calculations of our data, and store those results in a separate index.  The smaller amount of data has not only gotten us a head start in our calculations, but has also allowed for a smaller amount of data for us to query through. 

What is a summary index?

Summary indexes are no different than other indexes, however, an advantage to using separate indexes is that it allows you to modify retention times for the data – segmenting the summarized data from the source data.  Consider that the source of your data is housed within an index with 90 days of retention, and utilizes a large amount of disk space; once summarized to a separate index, you can hold on to the key pieces of that data for a longer period of time, while also saving on disk space.  By default, Splunk provides an index named “summary”, however dependent on an organization’s needs or structure, you may create additional indexes based on security requirements, retention, etc.

How is the data stored, and what are the licensing implications?

All events in a summary index use the sourcetype stash by default, and thereby summary indexing does not count against your license, no matter how many summary indexes your environment has allocated.  However, licensing is impacted if you make use of the “collect” command and change the sourcetype to something other than ‘stash’.

Does this cost me in licensing?

Now – consider stash vs another sourcetype.

The basic steps to make use of summary indexing

The process to implement summary indexing is fairly straight forward:

  • Identify the index you would like to utilize for summary indexing
  • Identify your report requirements (what data  to report on and the frequency)
  • Create a scheduled savedsearch
    • Develop and test the query that you will use to populate the summary index
    • Schedule the query
  • Enable summary indexing
  • Develop and test a query used to view the summary index results

A Basic Example

For the purposes of this article, we will make use of a summary index in order to summarize data found within the _internal index.  We will implement summary indexing via the web interface. 

Step 1: Identify the index that should hold the summarized data
For this example, we will make use of the default ‘summary’ index.  

Step 2: Identify the report requirements
Before populating the summary index, you’ll need to know what the end-goal of this specific data is; identify what data is to be reported on, and for what time slices.

I’d like to report on the number of events that are indexed per sourcetype within the _internal index.  I’d like to get a report that has the ability to show the data on an hourly basis per day.

Step 3: Develop and test the index-populating query

An easy way to create a query to populate the summary-index is to write a query similar to what you want to report on in the end; take into account all fields that you may want to include, summarize the data using the stats command, and test the output. 

As mentioned above, I’d like to count the number of events indexed per hour within the _internal index by sourcetype:

|bin _time span=10m
|stats count as count by _time, sourcetype

As a result, we get a list of counts by sourcetype divided into 10-minute bins.  Even though I will only run this report 1x/hour, I have decided to ‘bin’ the data into 10-minute chunks in case my requirements change in the future.  Remember, you can always ‘bin’ the _time field in future queries to rollup data by larger time spans.

Note:  In order to report on data using specific time slices, you must include _time as part of your stats command.  If you are working with data that does not include _time, you may create a _time field by appending the following to the end of your query: “|eval _time=now()”

Step 4: Edit the query to utilize a summary-indexing command (an si-* command)

Now that we’ve tested the query, we need to format the query to write to a summary index appropriately.   To do so, we need to make use of our choice of the si-* commands. Splunk has provided a number of commands specifically designed for use with summary indexing (sometimes referenced as the “si-*” commands).  Note that the following commands are simply ‘si’ versions of other spl commands:





In our example, we replace ‘stats’ with ‘sistats’ as shown below:

|bin _time span=10m
|sistats count as count by _time, sourcetype

Step 5: Save, and Schedule the Query

Save your query as a savedsearch using the typical Splunk Web UI navigation:

Settings | Searches, reports, and alerts | New Report

Create report page

From the reports listing, schedule the report to run once every hour

Actions| Edit | Edit Schedule

Edit schedule page

Step 6: Enable Summary Indexing

Access the Summary Indexing Dialogue:

Actions | Edit Summary Indexing

Choose to enable summary indexing

In the dialogue box, we will select to store the data in the “summary” index as we determined in Step 1 above (the only indexes that are shown in this dialogue box are those that your userid has to write permissions to),

Add fields

I have chosen to utilize a new field, “report”, which helps to distinguish the data within the summary index from other queries that are developed.  The name “report” is arbitrary – you may create your own fields as you choose.

Edit summary index

Step 7:  Develop and test a query used to view the summary index results.

This is the final step, to actually run the query to get the final results.  Assuming I’d like to count today’s number of events per hour, by sourcetype:

earliest=-0d@d index=summary report=sourcetype_eventcount
|bin _time span=1h
|stats sum(count) by _time orig_sourcetype

A few interesting items at this point:

Query Fields

  • report: this is the custom field that was added in step 6 to identify the data, you may use this within the query to pull back the appropriate dataset.
  • orig_sourcetype – note that we are required to query based on a new field, “orig_sourcetype”, in order to avoid conflicting with the sourcetype that has been assigned to this data. (stash).

Resulting Fields

  • source: the source of the data is set to the name of the savedsearch that was executed to populate the index.
  • search_name: also set to the name of the savedsearch that was executed to populate the index.
  • psrsvd* fields: these are special “prestats reserved” fields that Splunk has added when you have used any of the si* commands.  These are fields that are not usually directly referenced, but are used by Splunk when using reporting commands such as chart, timechart, and stats with this data.

Further Reading

  • When setting up the query – ensure that you use a _time field – in order to be able to make use of “stats latest()” functionality. Try to include all fields and granularity that you may need in the future.  The best practices, per Splunk, is to capture the lowest granularity possible in order to achieve your goal, and still perform appropriately.

About SP6

SP6 is a Splunk consulting firm focused on Splunk professional services including Splunk deployment, ongoing Splunk administration, and Splunk development. SP6 has a separate division that also offers Splunk recruitment and the placement of Splunk professionals into direct-hire (FTE) roles for those companies that may require assistance with acquiring their own full-time staff, given the challenge that currently exists in the market today.