Server racks in heaven

Splunk Archive to AWS S3: How to Add Amazon S3 Storage to Your Splunk Environment

When Splunk is deployed to Amazon Web Services (AWS), Splunk can be configured to archive data to Amazon’s S3 storage using Hadoop to broker the data transfer and Splunk search queries.  The archival storage is in addition to the standard Amazon Elastic Block Store used for Splunk’s hot, warm, and cold buckets.  Using S3 storage is typically cheaper than using Elastic Block Store because of its slower performance.  This cheaper, but slower, storage is perfect for archiving data that is infrequently needed, but is still searchable by Splunk.

Tested Environment

The AWS environment used to setup an example of a Splunk archive to S3 was:

  • AWS Virtual Private Cloud (VPC) containing 3 availability zones (AZ)
  • AWS S3 bucket in AWS region US-EAST-1
  • Splunk indexer cluster with one indexer in each AZ
  • Splunk Virtual Machines (VM) created using AWS Linux with pre-installed Splunk from the AWS Marketplace
  • Splunk version 6.6.2
  • Apache Hadoop version 2.7.4
  • OpenJDK Java 8
  • Must install openssl-devel.x86_64 package on all Splunk indexers

Splunk AWS S3 Configuration

The benefit of using AWS Linux is that it comes with the AWS command-line interface already installed.  On each of the AWS VMs hosting a Splunk indexer, make sure that the AWS command-line interface is configured to connect to the target S3 bucket.

At the shell prompt:

  1. Become the Splunk user (typically “splunk”)
  2. Enter: aws configure
  3. At the resulting prompt, enter the AWS access key to the target S3 bucket
  4. Enter the AWS secret key
  5. Accept the defaults presented for the remaining settings
  6. Test the configuration by entering: aws s3 ls <bucket_name>

It’s helpful to have a test document in the S3 bucket so the test listing produces a result.  This document can be removed when you’re done testing the connections from each indexer.

Hadoop Installation and Configuration

1. Hadoop installed in /opt (Splunk installs in /opt by default)

2. Hadoop owner and group must be the same as for Splunk (often this is splunk:splunk)

3. Created Hadoop working directory: /opt/hadoop/working_dir

4. Added the following to the Splunk user .bashrc

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/jre
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib/native” export PATH=$HADOOP_INSTALL/bin:$PATH
#HADOOP VARIABLES END

5. Configuration added to: /opt/hadoop/etc/hadoop/core-site.xml

<configuration>
<property>
<name<fs.s3a.access.key>/name>
<value>ENTER ACCESS KEY HERE</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>ENTER SECRET KEY HERE</value>
</property>
</configuration>

6. Configuration added to: /opt/hadoop/etc/hadoop/hadoop-env.sh

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

7. Test the Hadoop connection to the S3 bucket as the Splunk user on each indexer at the shell prompt:

hadoop fs -ls s3a://<bucket_name>/

Splunk AWS S3 Provider Configuration

For non-clustered environments, the Splunk web interface allows configuration of the S3 provider at: Settings > Virtual indexes.  However, with an indexer cluster, you need to edit the indexes.conf file and deploy it to the indexers.

The Splunk AWS S3 provider was configured in indexes.conf as follows:

[provider:aditum-s3a]
vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar

vix.env.HADOOP_HOME = /opt/hadoop

vix.env.HADOOP_TOOLS = $HADOOP_HOME/share/hadoop/tools/lib

vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-1.1.1.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar

vix.env.JAVA_HOME = /usr/lib/jvm/jre

vix.family = hadoop

vix.fs.default.name = s3a://aditum-archive/

vix.fs.s3a.access.key = AKIAI5U5ERONJR72G73A

vix.fs.s3a.secret.key = 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

vix.mapreduce.framework.name = yarn

vix.mode = stream

vix.output.buckets.max.network.bandwidth = 0

vix.splunk.home.hdfs = /opt/hadoop/working_dir

vix.splunk.jars = $HADOOP_TOOLS/hadoop-aws-2.7.4.jar,$HADOOP_TOOLS/aws-java-sdk-1.7.4.jar,$HADOOP_TOOLS/jackson-databind-2.2.3.jar,$HADOOP_TOOLS/jackson-core-2.2.3.jar,$HADOOP_TOOLS/jackson-annotations-2.2.3.jar

vix.env.HADOOP_CLASSPATH = /opt/hadoop-2.7.4/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop-2.7.4/share/hadoop/yarn/lib/*:/opt/hadoop-2.7.4/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/tools/lib/*:/contrib/capacity-scheduler/*.jar

Splunk Archive Index Configuration

Each index to be archived needs a corresponding archive index that is the name of the index with a suffix of “_archive”.  Three archive indexes were created in this demo environment to show that multiple indexes could be archived to the same S3 bucket.  These configurations need to be deployed to both the indexers and the search heads.

The Splunk archive indexes were configured in indexes.conf as follows:

[main_archive]
vix.output.buckets.from.indexes = main

vix.output.buckets.older.than = 1800

vix.output.buckets.path = s3a://aditum-archive/main_archive

vix.provider = aditum-s3a

vix.unified.search.cutoff_sec = 7200

 

[net_ops_prod_cisco_asr_archive]
vix.output.buckets.from.indexes = net_ops_prod_cisco_asr

vix.output.buckets.older.than = 1800

vix.output.buckets.path = s3a://aditum-archive/net_ops_prod_cisco_asr_archive

vix.provider = aditum-s3a

vix.unified.search.cutoff_sec = 7200

 

[net_ops_prod_cisco_isr_archive]
vix.output.buckets.from.indexes = net_ops_prod_cisco_isr

vix.output.buckets.older.than = 1800

vix.output.buckets.path = s3a://aditum-archive/net_ops_prod_cisco_isr_archive

vix.provider = aditum-s3a

vix.unified.search.cutoff_sec = 7200

There will be some overlap between the data in the normal Splunk indexes and the archive indexes in the S3 bucket that is determined by the settings above.  To learn more about how to properly adjust the overlap, see the Splunk document:

http://docs.splunk.com/Documentation/Splunk/latest/HadoopAnalytics/Configureandrununifiedsearch

The “older.than” and “search.cutoff_sec” values above are in seconds and were made very small so the demo environment would move data to the S3 bucket quickly.  Obviously, the values would be much larger in a production environment.

Splunk Unified Search Configuration

Splunk’s unified search capability allows you to search both the normal index and its archive by only specifying the normal index name.  Splunk will know, based on the time span chosen, when to look in the archive index for data.  Some of the configurations needed were added to the archive indexes in the previous section.  The remaining configurations for unified search are added to limits.conf on both indexers and search heads as follows:

[search]
allow_batch_mode = 1

allow_inexact_metasearch = 0

default_allow_queue = 1

disabled = 0

enable_cumulative_quota = 0

enable_datamodel_meval = 1

enable_history = 1

enable_memory_tracker = 0

force_saved_search_dispatch_as_user = 0

load_remote_bundles = 0

remote_timeline = 1

timeline_events_preview = 0

track_indextime_range = 1

truncate_report = 0

unified_search = 1

use_bloomfilter = 1

use_metadata_elimination = 1

write_multifile_results_out = 1

Using Splunk Unified Search

Once data has rolled through hot, warm, and cold buckets it will then get copied out to the archive index associated with the normal index.  This data can be viewed in the web browser by navigating in AWS to the target S3 bucket.  In this demo environment, one could use “index=main” as the search query and get data back from both the “main” and “main_archive” indexes depending on the time period selected.

Assistance with Archiving Splunk Data

For assistance with archiving Splunk data, or help with anything Splunk, contact SP6’s Splunk Professional Services consultants. Our certified Splunk Architects and Splunk Consultants manage successful Splunk deployments, environment upgrades and scaling, dashboard, search, and report creation, and Splunk Health Checks. SP6 also has a team of accomplished Splunk Developers that focus on building Splunk apps and technical add-ons.