I am working on automating a pipeline for processing of NGS data and am a bit confused about what services are available and/or appropriate for this task. My ideal workflow can be seen below:
Raw data comes off instrument and is stored in some S3 bucket
This dump is recognized by some service, could be airflow, cloudwatch, s3 invokes a lambda function, etc?
Whatever that trigger may be, it kicks off my nextflow workflow (runs my_workflow.nf) which processes the data and dumps it back to s3 for further downstream analysis
The question that I would really like help with is: What would be considered best practice or a suitable option for the bolded bullet point? What tool/service/utility could I use as a trigger to run a script?
Thanks in advance!
Related
I have a pipeline that listens to a Kafka topic that receives the s3 file-name & path. The pipeline has to read the file from S3 and do some transformation & aggregation.
I see the Flink has support to read the S3 file directly as source connector, but this use case is to read as part of the transformation stage.
I don't believe this is currently possible.
An alternative might be to keep a Flink session cluster running, and dynamically create and submit a new Flink SQL job running in batch mode to handle the ingestion of each file.
Another approach you might be tempted by would be to implement a RichFlatMapFunction that accepts the path as input, reads the file, and emits its records one by one. But this is likely to not work very well unless the files are rather small because Flink really doesn't like to have user functions that run for long periods of time.
I've never used AWS Glue however believe it will deliver what I want and am after some advice. I have a monthly CSV data upload that I push to S3 that has a staging Athena table (all strings) associated to it. I want Glue to perform a Create Table As (with all necessary convert/cast) against this dataset in Parquet format, and then move that dataset from one S3 bucket to another S3 bucket, so the primary Athena Table can access the data.
As stated, never used Glue before, and want a starter for 10, so I don't go down rabbit holes.
I currently perform all these steps manually, so want to understand how to use Glue to automate my manual tasks.
Yes, you can use AWS Glue ETL jobs to do exactly what you described. However, it doesn't perform CREATE TABLE AS SELECT queries, instead it does it with ETL jobs based on spark. Here is github repo that describes such process in quite detailed way and here is more of official AWS documentation on ETL programming based on AWS Glue service. After the initial setup, you can define some trigger events/scheduling to run your Glue ETL jobs automatically.
However, one thing to remember is cost of using AWS Glue services. Since it is based on execution time, sometimes it is not that trivial to forecast the final cost. For the workflow you described, performing CTAS queries with Athena would work just fine to transform your data and write it into a different s3 bucket. In this case you would know exactly price since it depends on the size of your data. Then you can use AWS API to do some manipulation with metadata catalog, so that new information would be accessible and in once place.
Since you are new to AWS Glue ETL jobs, I would suggest to stick with CTAS queries for simple tasks (although you can come up with quite complicated queries) and look into an open source project Apache Airflow for automation/scheduling and orchestration. This is the approach the I am using for tasks similar to yours. Airflow is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. It even has hooks to interact with AWS services. Hell, Airflow even provides you with a dedicated operator for sending queries to Athena. I wrote a little bit more about this approach here.
Trying to get a handle on what I would use to schedule and run jobs to move data into S3, run scripts on it and move it around s3 afterward.
My requirement is to be able to ingest from API's and also directly from databases. Some formats to ingest will be XML, and others could be flat files. The raw files need to be joined and transformed and turned into a format that graphs could be produced with.
What is AWS glue is like as an ETL tool? My specific question is can you see the finished pipelines showing the data sources and processing parts in a graphical view once they are created?
I have used Azure Data Factory - and it had a graphical UI to view and monitor the pipelines which I found quite useful. Just wondering if AWS glue has a similar thing.
If not - would Nifi on AWS S3 be a good way to do this?
Thanks
If you are looking for the best GUI, I would recommend NiFi. It is commonly used with S3 and has many connectors out of the box for other data sources. It becomes even more interesting if you want to do things outside of the AWS cloud.
That being said, I would think that Glue will also get the job done.
Running Data Factory when you have a heavy AWS footprint feels like an anti-pattern.
Full Disclosure: Have not worked with Glue/Data Factory and work for Cloudera, the driving force behind NiFi
I'm currently using AWS Glue to extract data from DB into s3, manipulate the data and save it back to Redshift/S3 or send via API to my client. AWS Glue GUI is not that good, you won't see a diagram of your flow and sometimes you will need to use other tools like step functions, airflow to orchestrate your job. Also, most of my jobs I have to use PySpark because AWS Glue methods are too limited.
Related to monitoring, you can see if there is an error, how many CPU and memory is been consumed by your job, s3 bytes read/written. If you want additional information you need to use logger or print to send it to the logs.
I am new to AWS so I needed some advice on how to correctly create background jobs. I've got some data (about 30GB) that I need to:
a) download from some other server; it is a set of zip archives with links within an RSS feed
b) decompress into S3
c) process each file or sometime group of decompressed files, perform transformations of data, and store it into SimpleDB/S3
d) repeat forever depending on RSS updates
Can someone suggest a basic architecture for proper solution on AWS?
Thanks.
Denis
I think you should run an EC2 instance to perform all the tasks you need and shut it down when done. This way you will pay only for the time EC2 runs. Depending on your architecture however you might need to run it all the times, small instances are very cheap however.
download from some other server; it is a set of zip archives with links within an RSS feed
You can use wget
decompress into S3
Try to use s3-tools (github.com/timkay/aws/raw/master/aws)
process each file or sometime group of decompressed files, perform transformations of data, and store it into SimpleDB/S3
Write your own bash script
repeat forever depending on RSS updates
One more bash script to check updates + run the script by Cron
First off, write some code that does a) through c). Test it, etc.
If you want to run the code periodically, it's a good candidate for using a background process workflow. Add the job to a queue; when it's deemed complete, remove it from the queue. Every hour or so add a new job to the queue meaning "go fetch the RSS updates and decompress them".
You can do it by hand using AWS Simple Queue Service or any other background job processing service / library. You'd set up a worker instance on EC2 or any other hosting solution that will poll the queue, execute the task, and poll again, forever.
It may be easier to use Amazon Simple Workflow Service, which seems to be intended for what you're trying to do (automated workflows). Note: I've never actually used it.
I think deploying your code on an Elasticbeanstalk Instance will do the job for you at scale. Because I see that you are processing a huge chunk of data here, and using a normal EC2 Instance might max out resources mostly memory. Also the AWS SQS idea of batching the processing will also work to optimize the process and effectively manage time outs on your server-side
I am implementing a Hadoop Map reduce job that needs to create output in multiple S3 objects.
Hadoop itself creates only a single output file (an S3 object) but I need to partition the output into multiple files.
How do I achieve this?
I did this by just writing the output directly from my reducer method to S3, using an S3 toolkit. Since I was running on EC2, this was quick and free.
In general, you want Hadoop to handle your input and output as much as possible, for cleaner mappers and reducers; and, of course, you want to write to S3 at the very end of your pipeline, to let Hadoop's code moving do it's job over HDFS.
In any case, I recommend doing all of your data partitioning, and writing entire output sets to S3 in a final reduce task, one set per S3 file. This puts as little writer logic in your code as possible. This paid off for me because I ended up with a minimal Hadoop S3 toolkit which I used for several task flows.
I needed to write to S3 in my reducer code because the S3/S3n filesystems weren't mature; they might work better now.
Do you also know the MultipleOutputFormat?
It's not related to S3, but in general it allows to write output to multiple files, implementing a given logic.