Has anyone had success sinking data from flume to splunk?
I've tried the Thrift and Avro flume sinks, but they have issues. Not really great formats for splunk, and flume keeps trying events over and over again after they've been sunk.
I'm looking into the flume HTTP sink to splunk's HEC, but I can't see how to set the HEC token in the header. Has anyone configured the HEC token in header for flume http sink?
Considering just doing a file sink that is forwarded to Splunk, but would like to avoid this temporary file if possible.
Advice?
Ended up just making a rolling file sink, and copying those files over to a directory monitored by Splunk forwarder. Not ideal or performant but good enough for our use case.
Related
Is it possible to read events as they land in S3 source bucket via apache Flink and process and sink it back to some other S3 bucket? Is there a special connector for that , or I have to use the available read/save examples mentioned in Apache Flink?
How does the checkpointing happen in such case, does flink keep track of what it has read from S3 source bucket automatically, or does it need custom code to be built. Does flink also guarentee exactly once processing in S3 source case.
In Flink 1.11 the FileSystem SQL Connector is much improved; that will be an excellent solution for this use case.
With the DataStream API you can use FileProcessingMode.PROCESS_CONTINUOUSLY with readFile to monitor a bucket and ingest new files as they are atomically moved into it. Flink keeps track of the last-modified timestamp of the bucket, and ingests any children modified since that timestamp -- doing so in an exactly-once way (the read offsets into those files are included in checkpoints).
In Apache NiFi, using FetchS3Object to read from an S3 bucket, I see it can reads all the object in bucket and as they are added. Is it possible:
To configure the processor to read only objects added now onwards, not the one already present?
How can I make it read a particular folder in the bucket?
NiFi seems great, just missing examples in their documentation for atleast the popular processors.
A combination of ListS3 and FetchS3Object processors will do this:
ListS3 - to enumerate your S3 bucket and generate flowfiles referencing each object. You can configure the Prefix property to specify a particular folder in the bucket to enumerate only a subset. ListS3 keeps track of what it has read using NiFi's state feature, so it will generate new flowfiles as new objects are added to the bucket.
FetchS3Object - to read S3 objects into flowfile content. You can use the output of ListS3 by configuring the FetchS3Object's Bucket property to ${s3.bucket} and Object Key property to ${filename}.
Another approach would be to configure your S3 bucket to send SNS notifications, subscribe an SQS queue. NiFi would read from the SQS queue to receive the notifications, filter objects of interest, and process them.
See Monitoring An S3 Bucket in Apache NiFi for more on this approach.
Use GetSQS and fetchS3Object processor and configure your GETSQS processor to listen for notification for newly added file. It's a event driven approach as whenever a new file comes SQS queue sends notification to nifi.
Use below link to get full clarifications:
AWS-NIFI integration
Sir I want to do the sentiments analysis of twitter data using Apache hive , flume Now I have a twitter account and i have set the conf file .But the problem is with format of data . it is not loading in hive .Kindly help me, I am working in it for a month.
I think you are able to configure the Flume agent to fetching the data from Twitter. Your problem is the format of the data.
Apache Flume offers several Sink Types. Two of them are useful to your requirement.
HDFS Sink
Hive Sink
Using HDFS Sink:
Configure Flume agent with TwitterSource and HDFS Sink.
Provide your Twitter OAuth details i.e., keys to Flume Agent.
Once Agent configuration is done, then start it.
This agent will fetch the data i.e., tweets from Twitter and stores it
in HDFS path as JSON Documents.
Once data is available in HDFS, the create an Hive external table with JSON SerDe with location clause.
JSON SerDe Code link: https://github.com/cloudera/cdh-twitter-example/blob/master/hive-serdes/src/main/java/com/cloudera/hive/serde/JSONSerDe.java
Using Hive Sink:
Flume allows writing the data into Hive Table using Hive Sink. So we need to configure the Flume agent as follows:
TwiiterSource --> Channel --> Hive Sink
Configure Flume agent with TwitterSource and Hive Sink.
Provide your Twitter OAuth details i.e., keys to Flume Agent.
Once Agent configuration is done, then start it.
This agent will fetch the data i.e., tweets from Twitter and stores it
in Hive table. This uses JSON SerDe.
Hive Sink has parameter called serializer to tell the type of SerDe.
Supported serializers: DELIMITED and JSON
So please configure your Flume agent using any one of the way above solutions.
Please use this documentation link to get the more details about Sink Parameters (HDFS + Hive)
https://flume.apache.org/FlumeUserGuide.html
You can try adding this jar file
hive-serdes-1.0-SNAPSHOT.jar
You can follow the below blog for complete reference of performing sentiment analysis using Hive.
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/
It seems cloudbees writes the logs only to stream and not to a file. I need to save my logs. Can I use any option other than papertrail to store/retreive log files? Can I listen to the some input stream and get feed of logs? Can I dump logs directly to Amazon S3?
As filesystem isn't persistent we also don't provide file based logging. We don't provide a platform helper to store logs to S3, as papertrails offers a comparable persistent solution with better performances and dedicated service.
You can for sure use your favorite logging framework and custom extensions to get log stored on S3 or other if you prefer this option.
I'm new to Apache flume, Just I want to know, where does Apacheflume logs its error messages and metadata information.
I searched apche flume directory for captured error logs, but I did'nt see any floder with the name log.
Could anyone help me on this, how to configure logs in apache flume.
Flume logs are in /var/log/flume-ng. This location is specified in logging configuration file /etc/flume-ng/conf/log4j.properties.
Dmitry is right, the log file location is specified in FLUME_HOME/conf/log4j.properties.
I just wanted to add that, in Apache Flume 1.5, the default log location is:
FLUME_HOME/logs/flume.log
The log file may not be generated in case Flume initialization failed - this usually means that Flume couldn't find Java, configuration files, etc.