When Splunk is deployed to Amazon Web Services (AWS), Splunk can be configured to archive data to Amazon’s S3 storage using Hadoop to broker the data transfer and Splunk search queries. The archival storage is in addition to the standard Amazon Elastic Block Store used for Splunk’s hot, warm, and cold buckets. Using S3 storage is typically cheaper than using Elastic Block Store because of its slower performance. This cheaper, but slower, storage is perfect for archiving data that is infrequently needed, but is still searchable by Splunk.
The AWS environment used to setup an example of a Splunk archive to S3 was:
- AWS Virtual Private Cloud (VPC) containing 3 availability zones (AZ)
- AWS S3 bucket in AWS region US-EAST-1
- Splunk indexer cluster with one indexer in each AZ
- Splunk Virtual Machines (VM) created using AWS Linux with pre-installed Splunk from the AWS Marketplace
- Splunk version 6.6.2
- Apache Hadoop version 2.7.4
- OpenJDK Java 8
- Must install openssl-devel.x86_64 package on all Splunk indexers
Splunk AWS S3 Configuration
The benefit of using AWS Linux is that it comes with the AWS command-line interface already installed. On each of the AWS VMs hosting a Splunk indexer, make sure that the AWS command-line interface is configured to connect to the target S3 bucket.
At the shell prompt:
- Become the Splunk user (typically “splunk”)
- Enter: aws configure
- At the resulting prompt, enter the AWS access key to the target S3 bucket
- Enter the AWS secret key
- Accept the defaults presented for the remaining settings
- Test the configuration by entering: aws s3 ls <bucket_name>
It’s helpful to have a test document in the S3 bucket so the test listing produces a result. This document can be removed when you’re done testing the connections from each indexer.
Hadoop Installation and Configuration
1. Hadoop installed in /opt (Splunk installs in /opt by default)
2. Hadoop owner and group must be the same as for Splunk (often this is splunk:splunk)
3. Created Hadoop working directory: /opt/hadoop/working_dir
4. Added the following to the Splunk user .bashrc
#HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/jre export HADOOP_HOME=/opt/hadoop export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib/native” export PATH=$HADOOP_INSTALL/bin:$PATH #HADOOP VARIABLES END
5. Configuration added to: /opt/hadoop/etc/hadoop/core-site.xml
<configuration> <property> <name<fs.s3a.access.key>/name> <value>ENTER ACCESS KEY HERE</value> </property> <property> <name>fs.s3a.secret.key</name> <value>ENTER SECRET KEY HERE</value> </property> </configuration>
6. Configuration added to: /opt/hadoop/etc/hadoop/hadoop-env.sh
7. Test the Hadoop connection to the S3 bucket as the Splunk user on each indexer at the shell prompt:
hadoop fs -ls s3a://<bucket_name>/
Splunk AWS S3 Provider Configuration
For non-clustered environments, the Splunk web interface allows configuration of the S3 provider at: Settings > Virtual indexes. However, with an indexer cluster, you need to edit the indexes.conf file and deploy it to the indexers.
The Splunk AWS S3 provider was configured in indexes.conf as follows:
[provider:aditum-s3a] vix.command.arg.3 = $SPLUNK_HOME/bin/jars/SplunkMR-hy2.jar vix.env.HADOOP_HOME = /opt/hadoop vix.env.HADOOP_TOOLS = $HADOOP_HOME/share/hadoop/tools/lib vix.env.HUNK_THIRDPARTY_JARS = $SPLUNK_HOME/bin/jars/thirdparty/common/avro-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/avro-mapred-1.7.7.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-compress-1.10.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/commons-io-2.4.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/libfb303-0.9.2.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/parquet-hive-bundle-1.6.0.jar,$SPLUNK_HOME/bin/jars/thirdparty/common/snappy-java-188.8.131.52.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-exec-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-metastore-1.2.1.jar,$SPLUNK_HOME/bin/jars/thirdparty/hive_1_2/hive-serde-1.2.1.jar vix.env.JAVA_HOME = /usr/lib/jvm/jre vix.family = hadoop vix.fs.default.name = s3a://aditum-archive/ vix.fs.s3a.access.key = AKIAI5U5ERONJR72G73A vix.fs.s3a.secret.key = 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 vix.mapreduce.framework.name = yarn vix.mode = stream vix.output.buckets.max.network.bandwidth = 0 vix.splunk.home.hdfs = /opt/hadoop/working_dir vix.splunk.jars = $HADOOP_TOOLS/hadoop-aws-2.7.4.jar,$HADOOP_TOOLS/aws-java-sdk-1.7.4.jar,$HADOOP_TOOLS/jackson-databind-2.2.3.jar,$HADOOP_TOOLS/jackson-core-2.2.3.jar,$HADOOP_TOOLS/jackson-annotations-2.2.3.jar vix.env.HADOOP_CLASSPATH = /opt/hadoop-2.7.4/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop-2.7.4/share/hadoop/yarn/lib/*:/opt/hadoop-2.7.4/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/tools/lib/*:/contrib/capacity-scheduler/*.jar
Splunk Archive Index Configuration
Each index to be archived needs a corresponding archive index that is the name of the index with a suffix of “_archive”. Three archive indexes were created in this demo environment to show that multiple indexes could be archived to the same S3 bucket. These configurations need to be deployed to both the indexers and the search heads.
The Splunk archive indexes were configured in indexes.conf as follows:
[main_archive] vix.output.buckets.from.indexes = main vix.output.buckets.older.than = 1800 vix.output.buckets.path = s3a://aditum-archive/main_archive vix.provider = aditum-s3a vix.unified.search.cutoff_sec = 7200 [net_ops_prod_cisco_asr_archive] vix.output.buckets.from.indexes = net_ops_prod_cisco_asr vix.output.buckets.older.than = 1800 vix.output.buckets.path = s3a://aditum-archive/net_ops_prod_cisco_asr_archive vix.provider = aditum-s3a vix.unified.search.cutoff_sec = 7200 [net_ops_prod_cisco_isr_archive] vix.output.buckets.from.indexes = net_ops_prod_cisco_isr vix.output.buckets.older.than = 1800 vix.output.buckets.path = s3a://aditum-archive/net_ops_prod_cisco_isr_archive vix.provider = aditum-s3a vix.unified.search.cutoff_sec = 7200
There will be some overlap between the data in the normal Splunk indexes and the archive indexes in the S3 bucket that is determined by the settings above. To learn more about how to properly adjust the overlap, see the Splunk document:
The “older.than” and “search.cutoff_sec” values above are in seconds and were made very small so the demo environment would move data to the S3 bucket quickly. Obviously, the values would be much larger in a production environment.
Splunk Unified Search Configuration
Splunk’s unified search capability allows you to search both the normal index and its archive by only specifying the normal index name. Splunk will know, based on the time span chosen, when to look in the archive index for data. Some of the configurations needed were added to the archive indexes in the previous section. The remaining configurations for unified search are added to limits.conf on both indexers and search heads as follows:
[search] allow_batch_mode = 1 allow_inexact_metasearch = 0 default_allow_queue = 1 disabled = 0 enable_cumulative_quota = 0 enable_datamodel_meval = 1 enable_history = 1 enable_memory_tracker = 0 force_saved_search_dispatch_as_user = 0 load_remote_bundles = 0 remote_timeline = 1 timeline_events_preview = 0 track_indextime_range = 1 truncate_report = 0 unified_search = 1 use_bloomfilter = 1 use_metadata_elimination = 1 write_multifile_results_out = 1
Using Splunk Unified Search
Once data has rolled through hot, warm, and cold buckets it will then get copied out to the archive index associated with the normal index. This data can be viewed in the web browser by navigating in AWS to the target S3 bucket. In this demo environment, one could use “index=main” as the search query and get data back from both the “main” and “main_archive” indexes depending on the time period selected.
Assistance with Archiving Splunk Data
For assistance with archiving Splunk data, or help with anything Splunk, contact SP6’s Splunk Professional Services consultants. Our certified Splunk Architects and Splunk Consultants manage successful Splunk deployments, environment upgrades and scaling, dashboard, search, and report creation, and Splunk Health Checks. SP6 also has a team of accomplished Splunk Developers that focus on building Splunk apps and technical add-ons.