Sir I want to do the sentiments analysis of twitter data using Apache hive , flume Now I have a twitter account and i have set the conf file .But the problem is with format of data . it is not loading in hive .Kindly help me, I am working in it for a month.
I think you are able to configure the Flume agent to fetching the data from Twitter. Your problem is the format of the data.
Apache Flume offers several Sink Types. Two of them are useful to your requirement.
HDFS Sink
Hive Sink
Using HDFS Sink:
Configure Flume agent with TwitterSource and HDFS Sink.
Provide your Twitter OAuth details i.e., keys to Flume Agent.
Once Agent configuration is done, then start it.
This agent will fetch the data i.e., tweets from Twitter and stores it
in HDFS path as JSON Documents.
Once data is available in HDFS, the create an Hive external table with JSON SerDe with location clause.
JSON SerDe Code link: https://github.com/cloudera/cdh-twitter-example/blob/master/hive-serdes/src/main/java/com/cloudera/hive/serde/JSONSerDe.java
Using Hive Sink:
Flume allows writing the data into Hive Table using Hive Sink. So we need to configure the Flume agent as follows:
TwiiterSource --> Channel --> Hive Sink
Configure Flume agent with TwitterSource and Hive Sink.
Provide your Twitter OAuth details i.e., keys to Flume Agent.
Once Agent configuration is done, then start it.
This agent will fetch the data i.e., tweets from Twitter and stores it
in Hive table. This uses JSON SerDe.
Hive Sink has parameter called serializer to tell the type of SerDe.
Supported serializers: DELIMITED and JSON
So please configure your Flume agent using any one of the way above solutions.
Please use this documentation link to get the more details about Sink Parameters (HDFS + Hive)
https://flume.apache.org/FlumeUserGuide.html
You can try adding this jar file
hive-serdes-1.0-SNAPSHOT.jar
You can follow the below blog for complete reference of performing sentiment analysis using Hive.
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/
Related
Is it possible to read events as they land in S3 source bucket via apache Flink and process and sink it back to some other S3 bucket? Is there a special connector for that , or I have to use the available read/save examples mentioned in Apache Flink?
How does the checkpointing happen in such case, does flink keep track of what it has read from S3 source bucket automatically, or does it need custom code to be built. Does flink also guarentee exactly once processing in S3 source case.
In Flink 1.11 the FileSystem SQL Connector is much improved; that will be an excellent solution for this use case.
With the DataStream API you can use FileProcessingMode.PROCESS_CONTINUOUSLY with readFile to monitor a bucket and ingest new files as they are atomically moved into it. Flink keeps track of the last-modified timestamp of the bucket, and ingests any children modified since that timestamp -- doing so in an exactly-once way (the read offsets into those files are included in checkpoints).
Has anyone had success sinking data from flume to splunk?
I've tried the Thrift and Avro flume sinks, but they have issues. Not really great formats for splunk, and flume keeps trying events over and over again after they've been sunk.
I'm looking into the flume HTTP sink to splunk's HEC, but I can't see how to set the HEC token in the header. Has anyone configured the HEC token in header for flume http sink?
Considering just doing a file sink that is forwarded to Splunk, but would like to avoid this temporary file if possible.
Advice?
Ended up just making a rolling file sink, and copying those files over to a directory monitored by Splunk forwarder. Not ideal or performant but good enough for our use case.
After finding out Hive LLAP, I really want to use it.
I started Azure HDinsight cluster with LLAP enabled. However, it doesn't seem to work any better than normal Hive. I have data stored in Parquet files. I only see ORC files mentioned in LLAP related docs or talks.
Does it also support Parquet format?
Answering my own question.
We reached out to Azure support. Hive LLAP only works with ORC file format (as of 05.2017).
So with Parquet either we have to use Apache Impala for fast interactive queries (https://impala.incubator.apache.org) as alternative to LLAP or change the stored file format to ORC.
Update: This is currently work in progress and no longer be the case with the next release of HDP. As of HDP 3.0. LLAP will support caching for the Parquet file format. This update should flow into HDInsights shortly after the 3.0 release.
I have tried to connect amazon dynamodb using pentaho using a generic database connection in pentaho but i'm unable to connect. how to query Amazon dynamodb using pentaho?
Assumption: No DyanmoDBInput or DynamoDBQuery kettle plugins available
My Suggested Solutions:
You can write your own kettle plugin using AWS DynamoDB Java SDK
Bonus: Contribute it to the community :)
You can write a Java class that does what you need and uses AWS DynamoDB SDK. Build a jar file that has all the dependencies in it. Drop the jar file into kettle lib directory.
Then use "User Defined Java Class" plugin, create an instance of your class and pass the parameters from the stream as input to dynamodb query and pass any output & error codes/messages to the stream.
Both these solutions are reusable. I have used the 2nd solution to connect to other data sources and it works well.
Goodluck.
How can I pull data from a kinesis stream using a pig script locally.
I noticed this example from the amazon documentation but not sure how to import the amazon kinesis pig libraries and the example seems incomplete. Where do I set the credentials, where can I get the jar for the kinesis library... etc... Their example is from a grunt-shell, but how would I run it locally from my own machine?
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/kinesis-pig-generate-data.html
REGISTER ./lib/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE REGEX_EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract();
raw_logs = load 'AccessLogStream' using com.amazon.emr.kinesis.pig.KinesisStreamLoader('kinesis.iteration.timeout=1') as (line:chararray);
DUMP raw_logs