Adding new documents automatically to Amazon CloudSearch Domain - amazon-s3

I have created a CloudSearch Domain. Using CLI data is successfully uploaded to the domain. Data is copied from S3 to CloudSearch domain using command-
cs-import-documents -d searchdev3 --source s3://mybucket/html
I am wondering how the data will be added to search domain later when a new file is added to S3 bucket.
Can we perform any of the following-
Create some kind of schedule that will upload the documents to Search Domain or
Any way to automatically detect if any new file is added to S3 and upload it directly to Search Domain.
Above options seems to be feasible but performing uploading operation manually every time does not seems to be a good idea at all.

I'd use the AWS Lambdas event processing service. It is pretty simple to set up an event stream based on S3 (see http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html)
Your Lambda would then submit a search document to CloudSearch based on the S3 event. For an example of submitting a document from a Lambda, see https://gist.github.com/fzakaria/4f93a8dbf483695fb7d5

Related

How can we test the s3 bucket with the access keys (using amazon sdk)

I am totally new to AWS. So we have this s3 endpoint already created by sysadmin and another S3 bucket created (which I need to access files from). We are using amazon sdk.(We have the composer aws/aws-sdk-php")
If two apache environment variables(AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY) are set for S3 access keys, how can we easily test it without doing a code? any frontend tool to check the connection?
I am trying to see the files in the s3 bucket has particular name and planning to code using PHP.

Is it possible for Vue.js to add data to a file in AWS S3 bucket?

I am developing a simple application by Vue.js, which is trying to combine couples of strings as a array and name it as a new name. Now the problem is how can I save the combined data to some place so that I can access the combined data in the future. I am expecting I can save the data into AWS S3, however local is acceptable(such as public folder in the Vue.js code structure)
You can't save data into anything local from the client application (browser-run), unless it's installed as a standalone. What you need is to use AWS S3 JS SDK to upload directly into the bucket.

Event-driven Elastic Transcoder?

Is there a way to setup a transcoding pipeline on AWS such that it automatically transcodes any new files uploaded to a particular S3 bucket, and places them in another bucket?
I know there is a REST API, and that in theory the uploader could also issue a REST request to the transcoder after it has uploaded the file, but for a variety of reasons, this isn't really an option.
This can now be accomplished using AWS Lambda.
Lambda basically allows you to trigger/run scripts based off of events. You could easily create a Lambda script that runs as soon as a new file is uploaded to a designated s3 bucket. The script would then start a transcoding job for that newly uploaded video file.
This is literally one of the example use cases provided in the Lambda documentation.

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks
S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.

Parallel Download from S3 to EC2

I was reading this blog entry about parallel upload into S3 using boto. Near the end it suggests a few tools for downloading using multiple connections (axel, aria2, and lftp). How can I go about using these with S3? I don't know how to pass the authentication keys to Amazon to access the file. I can, however, make the file public temporarily, but this solution is non-optimal.
Generate a signed url using the AWS API and use that for your downloads. Only someone with the signed url (which expires in the given timeout) can download the file.