How to code/implement AWS Batch? - amazon-s3

I am pretty new to AWS Batch. I have prior experience of batch implementation.
I gone through this link and got how to configure a batch on AWS.
I have to implement simple batch which will read incoming pipe separated file, read out data from that file, perform some transformation on that data and then save each line as a separate file in S3.
But I didn't find any document or example where I could see the implementation part. All or atleast most document talks only about AWS batch configuration.
Any idea from coding/implementation part? I would be using Java for implementation.

This might help you. The code is in python though!
AWSLABS/AWS-BATCH-GENOMICS

AWS Batch is running your Docker containers as jobs, so the implementation is not limited to the languages.
For Java, you could have some jars copied to your Dockerfile, and providing a entry point or CMD to start your code when the job is started in AWS Batch.

Related

FLINK: Is it possible in the same flink job to read data from kafka topic (file names) and then read files content from amazon s3?

I have a use-case where i need to process data from files stored in s3 and write the processed data to local files.
The s3 files are constantly added to the bucket.
Each time a file is added to the bucket, the full path is published to a kafka topic.
I want to achieve on a single job the following:
To read the file names from kafka (unbounded stream).
An evaluator that receives the file name, reads the content from s3 (second source) and creates a dataStream.
Process the dataStream (adding some logic to each row).
Sink to file.
I managed to do the first, third and forth part of the design.
Is there a way to achieve this?
Thanks in advance.
I don't believe there's any straightforward way to do this.
To do everything in a single job, maybe you could convince the FileSource to use a custom FileEnumerator that gets the paths from Kafka.
A simpler alternative would be to launch a new (bounded) job for every file to be ingested. The file to be read could be passed in as a parameter.
This is possible to implement in general, but as David Anderson has already suggested, there is currently no straightforward way to this with the vanilla Flink connectors.
Other approach could be writing the pipeline in Apache Beam, that already supports this and can use Flink as a runtime (which is a proof that this can be implemented with the existing primitives).
I think this is a legitimate use case that Flink should eventually support out of the box.

AWS Glue check file contents correctness

I have a project in AWS to insert data from some files, which will be in S3, to Redshift. The point is that the ETL has to be scheduled each day to find new files in S3 and then check if those files are correct. However, this has to be done with custom code as the files can have different formats depending of their kind, provider, etc.
I see that AWS Glue allows to schedule, crawl and do the ETL. However I'm lost at how to one can create its own code for the ETL and parse the files to check the correctness before ending up doing the copy instruction from S3 to Redshift. Do you know if that can be done and how?
Another issue is that if the correctness is OK then, the system should upload the data from S3 to a web via some API. But if it's not the file should be left into an ftp email. Here again, do you know if that can be done as well with the AWS Glue and how?
many thanks!
You can write your glue/spark code, upload it to s3 and create a glue job referring to this script/library. Anything you want to write in python can be done in glue. its just a wrapper around spark which in turn uses python....

Does Informatica Powercenter provide API to access session logs

Question - Does Informatica PowerCenter provide API to access session logs - I believe No but wanted to through in forum to be sure?
Objective -Actually I want to extract session logs and process them through Logstash and perform reactive analytics periodically.
Alternate - The same could be solved using Logstash input plugin for Informatica - but I did not find that either.
Usage - This will be used to determine common causes, analyze usage of cache at session level, throughput, and any performance bottlenecks.
You can call Informatica Webservice's getSessionLog. Here's a sample blog post with details: http://www.kpipartners.com/blog/bid/157919/Accessing-Informatica-Web-Services-from-3rd-Party-Apps
I suppose that the correct answer i 'yes', since there is a command line tool to convert logfiles to txt or even xml format.
The tool for session/workflow logs is called infacmd with the 'getsessionlog' argument. You can look it up in the help section of you powercenter clients or here:
https://kb.informatica.com/proddocs/Product%20Documentation/5/IN_101_CommandReference_en.pdf
That has always been enough for my needs..
But there is more to look into: when you run this commandline tool (which is really a BAT file) a java.exe does the bulk of the processing in a sub-process. The jar files used by this process could potentially be utilized by somebody else directly, but I don't know if it has been documented anywhere publicly available....?
Perhaps someone else knows the answer to that.

Extract data fom Marklogic 8.0.6 to AWS S3

I'm using Marklogic 8.0.6 and we also have JSON documents in it. I need to extract a lot of data from Marklogic and store them in AWS S3. We tried to run "mlcp" locally and them upload the data to AWS S3 but it's very slow because it's generating a lot of files.
Our Marklogic platform is already connected to S3 to perform backup. Is there a way to extract a specific database in aws s3 ?
It can be OK for me if I have one big file with one JSON document per line
Thanks,
Romain.
I don't know about getting it to s3, but you can use CORB2 to extract MarkLogic documents to one big file with one JSON document per line.
S3:// is a native file type in MarkLogic. So you can also iterate through all your docs and export them with xdmp:save("s3://...).
If you want to make agrigates, then You may want to marry this idea into Sam's suggestion of CORB2 to control the process and assist in grouping your whole database into multiple manageable aggregate documents. Then use a post-back task to run xdmp-save
Thanks guys for your answers. I do not know about CORB2, this is a great solution! But unfortunately, due to bad I/O I prefer a solution to write directly on s3.
I can use a basic Ml query and dump to s3:// with native connector but I always face memory error even launching with the "spawn" function to generate a background process.
Do you have any xquey example to extract each document on s3 one by one without memory permission?
Thanks

Moving files >5 gig to AWS S3 using a Data Pipeline

We are experiencing problems with files produced by Java code which are written locally and then copied by the Data Pipeline to S3. The error mentions file size.
I would have thought that if multipart uploads is required, then the Pipeline would figure that out. I wonder if there is a way of configuring the Pipeline so that it indeed uses multipart uploading. Because otherwise the current Java code which is agnostic about S3 has to write directly to S3 or has to do what it used to and then use multipart uploading -- in fact, I would think the code would just directly write to S3 and not worry about uploading.
Can anyone tell me if Pipelines can use multipart uploading and if not, can you suggest whether the correct approach is to have the program write directly to S3 or to continue to write to local storage and then perhaps have a separate program be invoked within the same Pipeline which will do the multipart uploading?
The answer, based on AWS support, is that indeed 5 gig files can't be uploaded directly to S3. And there is no way currently for a Data Pipeline to say, "You are trying to upload a large file, so I will do something special to handle this." It simply fails.
This may change in the future.
Data Pipeline CopyActivity does not support files larger than 4GB. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html
This is below the 5GB limit imposed by S3 for each file-part put.
You need to write your own script wrapping AWS CLI or S3cmd (older). This script may be executed as a shell activity.
Writing directly to S3 may be an issue as S3 does not support append operations - unless you can somehow write multiple smaller objects in a folder.