I wish to automate runs of SQL (DML's and DML's) into the AWS redshift cluster, i.e. as soon as someone merge the SQL file into S3 bucket it should run in the configured environment say dev, preprod & prod.
Is there any way I can do this?
My investigation says that AWS codepipeline is one of the solution however, I am not sure how I will connect to the Redshift database in Codepipeline?
Another way is using Lambda function but it has its limitation of 5 minutes I guess and some of the DDL/DML might take more than 5 minutes to run.
Regards,
Shay
There are a lot of choices out there and which is best will depend on many factors including your team's skillset and your budget. I'll let the community weigh in on all the possibilities.
I would like to advise on using the AWS serverless ecosystem to perform these functions. First off the Lambda limit is now 15 min but this really isn't important. The most important development is Redshift Data API which lets you start queries in a Lambda and for other Lambdas check on completion later. See: https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html
With Redshift Data API for fire-and-forget access to Redshift and Step Functions to orchestrate the Lambda functions you can create a low cost, light weight infrastructure to perform all sort of integrations and actions. These can include triggering other tools / services as needed for you. This is not the best approach in all cases but Lambda based solutions should not be excluded due to run time limits.
Related
I have a running application on a Linux EC-2 instance and I would like to set the CloudWatch Agent.
I would like to know what are the CloudWatch Agent using CPU/Memory/Disk to what extent in order to collect the information.
Its a minor concern but still would like to know if Agent will affect the instance performance (Is it a minimal impact?).
Thanks in advance!
Golan
Anything that runs on a computer would impact performance. It is always a trade-off between running some code and the benefit that the code provides.
The Agent only collects data at regular intervals, so it should not have a large impact on the system.
I suggest you install CloudWatch Agent and measure the impact yourself.
I've never used AWS Glue however believe it will deliver what I want and am after some advice. I have a monthly CSV data upload that I push to S3 that has a staging Athena table (all strings) associated to it. I want Glue to perform a Create Table As (with all necessary convert/cast) against this dataset in Parquet format, and then move that dataset from one S3 bucket to another S3 bucket, so the primary Athena Table can access the data.
As stated, never used Glue before, and want a starter for 10, so I don't go down rabbit holes.
I currently perform all these steps manually, so want to understand how to use Glue to automate my manual tasks.
Yes, you can use AWS Glue ETL jobs to do exactly what you described. However, it doesn't perform CREATE TABLE AS SELECT queries, instead it does it with ETL jobs based on spark. Here is github repo that describes such process in quite detailed way and here is more of official AWS documentation on ETL programming based on AWS Glue service. After the initial setup, you can define some trigger events/scheduling to run your Glue ETL jobs automatically.
However, one thing to remember is cost of using AWS Glue services. Since it is based on execution time, sometimes it is not that trivial to forecast the final cost. For the workflow you described, performing CTAS queries with Athena would work just fine to transform your data and write it into a different s3 bucket. In this case you would know exactly price since it depends on the size of your data. Then you can use AWS API to do some manipulation with metadata catalog, so that new information would be accessible and in once place.
Since you are new to AWS Glue ETL jobs, I would suggest to stick with CTAS queries for simple tasks (although you can come up with quite complicated queries) and look into an open source project Apache Airflow for automation/scheduling and orchestration. This is the approach the I am using for tasks similar to yours. Airflow is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. It even has hooks to interact with AWS services. Hell, Airflow even provides you with a dedicated operator for sending queries to Athena. I wrote a little bit more about this approach here.
In gcp, I need to update a bigquery table whenever a file (multiple formats such as json,xml) gets uploaded into a bucket. I have two options but not sure what are the pros/cons of each of them. Can someone suggest which is a better solution and why?
Approach 1 :
File uploaded to bucket --> Trigger Cloud Function (which updates the bigquery table) -->Bigquery
Approach 2:
File uploaded to bucket --> Trigger Cloud Function (which triggers a dataflow job) -->Dataflow-->Bigquery.
In production env, which approach is better suited and why? If there are alternative approaches,pls let me know.
This is quite a broad question, so I wouldn't be surprised if it gets voted to be closed. That said however, I'd always go #2 (GCS -> CF -> Dataflow -> BigQuery).
Remember, with Cloud Funtions there is a max execution time. If you kick off a load job from the Cloud Function, you'll need to bake logic into it to poll and check the status (load jobs in BigQuery are async). If it fails, you'll need to handle it. But, what if it's still running and you hit the max execution of your Cloud Function?
At least by using Dataflow, you don't have the problem of max execution times and you can simply rerun your pipeline if it fails for some transient reason e.g. network issues.
We are developing an application which uses Spark & Hive to do static and ad-hoc reporting. For these static reports, they take a number of parameters and then run over a data set. We would like to make it easier to test performance of these reports on a cluster.
If we have a test cluster running with a sufficient sample data set which developers can share. To speed up development time, what is the best way to deploy a Spark application to a Spark cluster (in standalone) via an IDE?
I'm thinking we would create an SBT task which would run the spark submit script. Is there a better way?
Eventually this will feed into some automated performance testing which we plan to run as a twice daily Jenkins job. If its an SBT deploy task, it makes it easy to call in Jenkins. Is there a better way to do this?
I've found a project on GitHub, maybe you can get some inspiration.
Maybe just add a for loop for submitting jobs and increase the loop times to find the performance limit, not sure if I'm right or not.
I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(
You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html