amazon kinesis from pig script - apache-pig

How can I pull data from a kinesis stream using a pig script locally.
I noticed this example from the amazon documentation but not sure how to import the amazon kinesis pig libraries and the example seems incomplete. Where do I set the credentials, where can I get the jar for the kinesis library... etc... Their example is from a grunt-shell, but how would I run it locally from my own machine?
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/kinesis-pig-generate-data.html
REGISTER ./lib/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE REGEX_EXTRACT org.apache.pig.piggybank.evaluation.string.RegexExtract();
raw_logs = load 'AccessLogStream' using com.amazon.emr.kinesis.pig.KinesisStreamLoader('kinesis.iteration.timeout=1') as (line:chararray);
DUMP raw_logs

Related

Export data from QlikSense cloud to AWS S3 bucket

I am trying to create a pipeline to export Qliksense app data to AWS S3 bucket, but not sure if I can do it directly.
Two options I tried:
Use export API to export data as qvf to local disk, then connect to S3 via python script and push the data.
2.Store the data as csv using qliksense script locally and then push to S3.
Basically my ideal solution would be use a single python script to connect to Qliksense, read the data, convert to csv and export to S3.
Any ideas/code/approach would be helpful.
Qliksense is a tool to consume data and display in general, but when there is need
STORE INTO vFile (text);
is easy option. This way you will get csv file saved on drive. If you are running python script on the same machine is easy as you can EXECUTE statement which can run python script and store it to s3 using boto3 library.
Otherwise just use network drive instead and then in cron put data from there via python to s3.
Last option is that on qliksense server you can map s3 bucket. There are few tools for that like TnTDrive.
So much depends on your infrastructure.

Flink Streaming AWS S3 read multiple files in parallel

I am new to Flink, my understanding is following API call
StreamExecutionEnvironment.getExecutionEnvironment().readFile(format, path)
will read the files in parallel for given S3 bucket path.
We are storing log files in S3. The requirement is to serve multiple client requests to read from different folders with time stamps.
For my use case, to serve multiple client request, I am evaluating to use Flink. So I want Flink to perform AWS S3 read in parallel for different AWS S3 file paths.
Is it possible to achieve this in single Flink Job. Any suggestions?
Documentation about the S3 file system support can be found here.
You can read from different directories and use the union() operator to combine all the records from different directories into one stream.
It is also possible to read nested files by using something like (untested):
TextInputFormat format = new TextInputFormat(path);
Configuration config = new Configuration();
config.setBoolean("recursive.file.enumeration", true);
format.configure(this.config);
env.readFile(format, path);

Is there a way to specify file extension to the file saved to s3 by kinesis firehose

I am setting up a kinesis firehose stream and everything works well with the files getting created on s3 which are delimited. But i was wondering if there is a way to specify an extension to this file since the consumer of this file require it to be either a csv or txt. Is there any way of doing this?
Thanks
You can create an s3 trigger to lambda and rename from there.
I was not able to get an extension for the files generated by firehose but I ended up using data pipeline to do this by using the ShellCommandActivity component which allows one to run shell commands on the files in Amazon S3 and write the resulting files to either S3 or any other location that you'd like.

Sentiment Analysis of Twitter

Sir I want to do the sentiments analysis of twitter data using Apache hive , flume Now I have a twitter account and i have set the conf file .But the problem is with format of data . it is not loading in hive .Kindly help me, I am working in it for a month.
I think you are able to configure the Flume agent to fetching the data from Twitter. Your problem is the format of the data.
Apache Flume offers several Sink Types. Two of them are useful to your requirement.
HDFS Sink
Hive Sink
Using HDFS Sink:
Configure Flume agent with TwitterSource and HDFS Sink.
Provide your Twitter OAuth details i.e., keys to Flume Agent.
Once Agent configuration is done, then start it.
This agent will fetch the data i.e., tweets from Twitter and stores it
in HDFS path as JSON Documents.
Once data is available in HDFS, the create an Hive external table with JSON SerDe with location clause.
JSON SerDe Code link: https://github.com/cloudera/cdh-twitter-example/blob/master/hive-serdes/src/main/java/com/cloudera/hive/serde/JSONSerDe.java
Using Hive Sink:
Flume allows writing the data into Hive Table using Hive Sink. So we need to configure the Flume agent as follows:
TwiiterSource --> Channel --> Hive Sink
Configure Flume agent with TwitterSource and Hive Sink.
Provide your Twitter OAuth details i.e., keys to Flume Agent.
Once Agent configuration is done, then start it.
This agent will fetch the data i.e., tweets from Twitter and stores it
in Hive table. This uses JSON SerDe.
Hive Sink has parameter called serializer to tell the type of SerDe.
Supported serializers: DELIMITED and JSON
So please configure your Flume agent using any one of the way above solutions.
Please use this documentation link to get the more details about Sink Parameters (HDFS + Hive)
https://flume.apache.org/FlumeUserGuide.html
You can try adding this jar file
hive-serdes-1.0-SNAPSHOT.jar
You can follow the below blog for complete reference of performing sentiment analysis using Hive.
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.