I am thinking of building a work flow as follows:
I have an application that writes almost 1000 csv files to a folder MY_DIRECTORY in s3 bucket MY_BUCKET. Now I would like to parse those files from s3 bucket and load into MySQL database using Apache Airflow.
From reading a several posts here: Airflow S3KeySensor - How to make it continue running and Airflow s3 connection using UI, I think it would best to trigger my Airflow DAG using AWS lambda which will be called as soon as a file lands on the s3 folder.
Being new to Airflow and Lambda, I am not getting the idea how to set up the lambda to trigger Airflow DAG. In this regard, if anyone please give some pointers, it would be really helpful. Thanks.
Create the DAG that you want to trigger, then take advantage of the experimental REST APIs offered by Airflow.
You can read about them here: https://airflow.apache.org/docs/stable/api.html
In particular you want to use the following endpoint:
POST /api/experimental/dags/<DAG_ID>/dag_runs
You can pass the name of the DAG in the to trigger it correctly. Moreover you can explicitly pass the name of the file the DAG will have to process
curl -X POST \
http://localhost:8080/api/experimental/dags/<DAG_ID>/dag_runs \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{"conf":"{\"FILE_TO_PROCESS\":\"value\"}"}'
Then use a Hook within the DAG to read the file that you specified.
Related
I have one lambda function on AWS, which is storing it's logs over AWS cloudwatch. I want to store ALL these logs to S3 using CLI. My linux server is already configured with CLI and has all the necessary permissions to access AWS resources. I want that the logs that are getting displayed on my AWS cloudwatch console, should get created over an S3 bucket.
Once, these logs are stored to some location on S3, then I can easily export them to an SQL table over Redshift.
Any idea how to bring these logs to S3? Thanks for reading.
You can use boto3 in lambda and export logs into S3 need to write a lambda function thatcsubscribe to CloudWatch logs and triggered on cloud watch log events.
AWS Doc:
http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html#LambdaFunctionExample
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3Export.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasksConsole.html
Example: https://medium.com/dnx-labs/exporting-cloudwatch-logs-automatically-to-s3-with-a-lambda-function-80e1f7ea0187
Your question does not specify that you want to export it onetime or regular basis. Thus, there are two options to export the cloudwatch logs to s3 location:
Create a export task (onetime)
You can create a task with below command:
aws logs create-export-task \
--profile {PROFILE_NAME} \
--task-name {TASK_NAME} \
--log-group-name {CW_LOG_GROUP_NAME} \
--from {START_TIME_IN_MILLS} \
--to {END_TIME_IN_MILLS} \
--destination {BUCKET_NAME} \
--destination-prefix {BUCKET_DESTINATION_PREFIX}
You can refer this in detail.
A lambda to write the logs to s3 (event based from CloudWatch subscription)
exports.lambdaHandler = async (event, context) => {
// get the logs content from the event
// if any change to data
// write the data to s3 location
}
My recommendation would be to push the logs to ELK stack or any equivalent logging systems(Splunk, Loggly, etc) for better anylysis, visualization of the data.
I know that I can use aws cloudformation create-stack or aws cloudformation update-stack with --template-url switch to point to an existing template placed in S3 Bucket.
I would like to use aws cloudformation deploy, the same command for both creating and updating a CloudFormation stack for which I placed the template already in my S3 Bucket. Is it possible with any combination of the options?
The following syntax works, but it first uploads the template to the S3 Bucket:
aws cloudformation deploy \
--stack-name my-stack \
--template-file my-stack-template.yaml \
--s3-bucket my-bucket \
--s3-prefix templates \
--profile my-profile \
--region us-east-1
It first uploads the template my-stack-template.yaml as something like 1a381d4c65d9a3233450e92588a708b38.template in my-bucket/templates which I do not want. I would like to be able to deploy the stack through this method using the template already placed in the S3 Bucket and not needing it to be on my local computer.
Sadly there is no such way. The only way for the template to be not re-upload is when there are no changes to deploy. You have to use create-stack if you want to use pre-existing templates in S3.
When I use gsutil to connect to my bucket on Google Cloud Storage, I usually use the following command:
gcloud auth activate-service-account --key-file="pathKeyFile"
What should I do if two scripts that are running on the same machine at the same time need two different Service Accounts?
I would like to use a command such as:
gsutil ls mybucket --key-file="mykeyspath"
I say this because in the case my script is running and another script changes the Service Account which is actually active, my script would not have permission to access the bucket anymore.
You can do this with BOTO file. You can create one as explained in the documentation.
Then you can specify which file to use when you run your gsutil command (here an example in linux)
# if you have several GSUTIL command to run
export BOTO_CONFIG=/path/to/.botoMyBucket
gsutil ls myBucket
# For only one command, you can define an env var inline like this
BOTO_CONFIG=/path/to/.botoMyBucket2 gsutil ls myBucket2
I am looking for a persistent key DB which can be accessed via HTTP. I need to use it for storing postman test script data. I have heard of rocksdb and leveldb, but I am not sure whether they can be accessed via HTTP.
leveldb and rocksdb don't have a network component.
I created a small python project that does expose a document datastore like API that you can query using REST. Have a look at it https://github.com/amirouche/deuspy. It rely on leveldb for persistence.
There is a python asyncio client. You can create a client on your own it's very easy.
To get started, you can simply do the following:
pip3 install deuspy
python3 -m deuspy.server
And then start querying.
Here is an example curl-based session:
$ curl -X GET http://localhost:9990
{}
$ curl -X POST --data '{"héllo": "world"}' http://localhost:9990
3252169150753703489
$ $ curl -X GET http://localhost:9990/3252169150753703489
{"h\u00e9llo": "world"}
You can also filter documents. Look at how is implemented the asyncio client.
Take a look at Webdis which provides HTTP REST API access to Redis key value store. Redis has very good performance and scalability.
I am running a Spark cluster on Amazon EMR. I am running the PageRank example programs on the cluster.
While running the programs on my local machine, I am able to see the output properly. But the same doesn't work on EMR. The S3 folder only shows empty files.
The commands I am using:
For starting the cluster:
aws emr create-cluster --name SparkCluster --ami-version 3.2 --instance-type m3.xlarge --instance-count 2 \
--ec2-attributes KeyName=sparkproj --applications Name=Hive \
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark \
--log-uri s3://sampleapp-amahajan/output/ \
--steps Name=SparkHistoryServer,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://support.elasticmapreduce/spark/start-history-server
For adding the job:
aws emr add-steps --cluster-id j-9AWEFYP835GI --steps \
Name=PageRank,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--class,SparkPageRank,s3://sampleapp-amahajan/pagerank_2.10-1.0.jar,s3://sampleapp-amahajan/web-Google.txt,2],ActionOnFailure=CONTINUE
After a few unsuccessful attempts... I made a text file for the output of the job and it is successfully created on my local machine. But I am unable to view the same when I SSH into the cluster. I tried FoxyProxy to view the logs for the instances and neither does anything show up there.
Could you please let me know where I am going wrong?
Thanks!
How are you writing the text file locally? Generally, EMR jobs save their output to S3, so you could use something like outputRDD.saveToTextFile("s3n://<MY_BUCKET>"). You could also save the output to HDFS, but storing the results to S3 is useful for "ephemeral" clusters-- where you provision an EMR cluster, submit a job, and terminate upon completion.
"While running the programs on my local machine, I am able to see the
output properly. But the same doesn't work on EMR. The S3 folder only
shows empty files"
For the benefit of newbies:
If you are printing output to the console, it will be displayed in local mode but when you execute on EMR cluster, the reduce operation will be performed on worker nodes and they cant right to the console of the Master/Driver node!
With proper path you should be able to write results to s3.