S3 to redshift nifi - amazon-s3

I read a while about how to upload my S3 data to Redshift, COPY command, Glue, etc.
My pipeline is almost entirely in NIFI something like:
extract_data->insert to S3->excecute Lamda process to transform the data or enrich it using Athena, in 2 or 3 stages, into another S3 bucket (lets call it processed bucket).
Now I want to continue this pipeline, loading the data from the processed bucket and inserting it to redshift, I have an empty table created for this.
The idea is to add incrementally in some tables and in others to delete all the data loaded that day and reload it.
Can anyone give me a hint of where to start?
Thank you!

When data lands in your "processed bucket", you can fire a lambda function, that triggers a flow in Apache NiFi by calling an HTTP webhook. To expose such a webhook you use one of the following processors:
ListenHTTP
Starts an HTTP Server and listens on a given base path to transform
incoming requests into FlowFiles. The default URI of the Service will
be http://{hostname}:{port}/contentListener. Only HEAD and POST
requests are supported. GET, PUT, and DELETE will result in an error
and the HTTP response status code 405.
HandleHttpRequest
Starts an HTTP Server and listens for HTTP Requests. For each request,
creates a FlowFile and transfers to 'success'. This Processor is
designed to be used in conjunction with the HandleHttpResponse
Processor in order to create a Web Service
So the flow would be ListenHTTP -> FetchS3 -> Process -> PutSQL (with Redshift connection pool). The lambda function would call GET my-nifi-instance.com:PORT/my-webhook, such that ListenHTTP creates a flowfile for the incoming request.

Related

Test async data processing flows with Karate Labs

I'm looking for best practices or the recommended approach to test async code execution with Karate.
Our use cases are all pretty similar but a basic one is:
Client makes HTTP request to API
API accepts request and creates a messages which is added to a queue
API replies with ACCEPTED / 202
Worker picks up message from queue processes it and updates database
Eventually after the work is finished another endpoint delivers updated data
How can I check with Karate that after processing has finished other endpoints return the correct result?
Concrete real life example:
Client requests a processing intensive data export to API e.g. via HTTP POST /api/export
API create message with information for creating the export and puts it on AWS SQS queue
API replies with 202
Worker receives message and creates export, uploads result as ZIP to S3 and finally creates and entry in the database symbolizing this export
Client can now query list exports endpoint e.g. via HTTP GET /api/exports
API returns 200 with the list of exports incl. the newly created entry
Generally I have two ideas on how to approach this:
Use karate retry until on the endpoint that returns the list of exports
In the API response (step #3) return the message ID and use the HTTP API of SQS to poll that endpoint until the message has been processed and then query the list endpoint to check the result
Is either of those approach recommended or should I choose an entirely different solution?
The moment queuing comes into the picture, I would not recommend retry until. It would work if you are in a hurry, but if you are ok to write a little bit of Java code, please read on. Note that this Java "glue code" needs to be written only once, and then the team responsible for writing the functional flows will be up and running.
I personally would prefer option (2) just because when a test fails, you will have a lot more diagnostic information and traces to look at.
Pretty sure you won't have a problem using AWS Java libs to do things such as polling SQS.
I think this example will answer all your questions: https://twitter.com/getkarate/status/1417023536082812935

Upload to s3 which trigger lambda, wait for lambda to respond

So I have an architecture in which the user uploads there file onto the front end, which then sends it to a s3 bucket, which in turn triggers a lambda for validation and processing, which sends the response to the front end of successful upload or validation error.
I don't understand if there is way to implement this in JavaScript (or any other similar language).
In the normal scenario, the front end uploads to server 1, and waits for it's response. The server 1 then tells the front end whether it was a success or a failure, and that is what the front end tells the user.
But in this case, the upload is done to s3 (which is incapable of taking responses from lambda, and send it back to the user), and response is to be expected from the other one (the lambda).
How can this be implemented? If the architecture is flawed, please do suggest improvements.
How does server 2 respond to the front end? Surely server 2 responds to server 1 which in turn responds to the front end. If these operations are synchronous then it is no different from calling a function or API call. If you wanted to make it asynchronous then you've got a different pattern to manage.

Purge cache in Cloudflare using API

What would be best practice to refresh content that is already cached by CF?
We have few API that generate JSON and we cache them.
Once a while JSON should be updated and what we do right now is - purge them via API.
https://api.cloudflare.com/client/v4/zones/dcbcd3e49376566e2a194827c689802d/purge_cache
later on, when user hits the page with required JSON it will be cached.
But in our case we have 100+ JSON files that we purge at once and we want to send new cache to CF instead of waiting for users (to avoid bad experience for them).
Right now I consider to PING (via HTTP request) needed JSON endpoints just after we have purged cache.
My question if that is the right way and if CF already has some API to do what we need.
Thanks.
Currently, the purge API is the recommended way to invalidate cached content on-demand.
Another approach for your scenario could be to look at Workers and Workers KV, and combine it with the Cloudflare API. You could have:
A Worker reading the JSON from the KV and returning it to the user.
When you have a new version of the JSON, you could use the API to create/update the JSON stored in the KV.
This setup could be significantly performant, since the Worker code in (1) runs on each Cloudflare datacenter and returns quickly to the users. It is also important to note that KV is "eventually consistent" storage, so feasibility depends on your specific application.

How REST API handle continuous data update

I have REST backend api, and front end will call api to get data.
I was wondering how REST api handles continuous data update, for example,
in jenkins, we will see that if we execute build job, we can see the continous log output on page until job finishes. How REST accomplish that?
Jenkins will just continue to send data. That's it. It simply carries on sending (at least that's what I'd presume it does). Normally the response contains a header field indicating how much data the response contains (Content-Length). But this field is not necessary. The server can omit it. In such a case the response body ends when the server closes the connection. See RFC 7230:
Otherwise, this is a response message without a declared message body length, so the message body length is determined by the number of octets received prior to the server closing the connection.
Another possibility would be to use the chunked transfer encoding. Then the server sends a chunk of data having its own Content-Length header. The server terminates this by sending a zero-length last chunk.
Websocksts would be a third possibility.
I was searching for an answer myself and then the obvious solution struck me. In order to see what type of communication a service is using, you can simply view it from browser side using Developer Tools.
In Google Chrome it will be F12 -> Network.
In case of Jenkins, front-end is sending AJAX requests to backend for data:
every 5 seconds on Dashboards page
every second during Pipeline run (Console Output page), that you have mentioned.
I have also checked the approach in AWS. When checking the status of instances (example: Initializing... , Booting...), it queries the backend every second. It seems to be a standard interval for its services.
Additional note:
When running an AWS Remote Console though, it first sends requests for remote console instance status (backend answers with { status: "BOOTING" }, etc.). After backend returns status as "RUNNING", it starts a WebSocket session between your browser and AWS backend (you can notice it by applying WS filter in developer tools).
Then it is no longer REST API, but WebSockets, that is a different protocol (stateful).

How can I get notification about new S3 objects?

I have a scenario where we have many clients uploading to s3.
What is the best approach to knowing that there is a new file?
Is it realistic/good idea, for me to poll the bucket ever few seconds?
UPDATE:
Since November 2014, S3 supports the following event notifications:
s3:ObjectCreated:Put – An object was created by an HTTP PUT operation.
s3:ObjectCreated:Post – An object was created by HTTP POST operation.
s3:ObjectCreated:Copy – An object was created an S3 copy operation.
s3:ObjectCreated:CompleteMultipartUpload – An object was created by the completion of a S3 multi-part upload.
s3:ObjectCreated:* – An object was created by one of the event types listed above or by a similar object creation event added in the future.
s3:ReducedRedundancyObjectLost – An S3 object stored with Reduced Redundancy has been lost.
These notifications can be issued to Amazon SNS, SQS or Lambda. Check out the blog post that's linked in Alan's answer for more information on these new notifications.
Original Answer:
Although Amazon S3 has a bucket notifications system in place it does not support notifications for anything but the s3:ReducedRedundancyLostObject event (see the GET Bucket notification section in their API).
Currently the only way to check for new objects is to poll the bucket at a preset time interval or build your own notification logic in the upload clients (possibly based on Amazon SNS).
Push notifications are now built into S3:
http://aws.amazon.com/blogs/aws/s3-event-notification/
You can send notifications to SQS or SNS when an object is created via PUT or POST or a multi-part upload is finished.
Your best option nowadays is using the AWS Lambda service. You can write a Lambda using either node.js javascript, java or Python (probably more options will be added in time).
The lambda service allows you to write functions that respond to events from S3 such as file upload. Cost effective, scalable and easy to use.
You can implement a pub-sub mechanism relatively simply by using SNS, SQS, and AWS Lambda. Please see the below steps. So, whenever a new file is added to the bucket a notification can be raised and acted upon that (everything is automated)
Please see attached diagram explaining the basic pub-sub mechanism
Step 1
Simply configure the S3 bucket event notification to notify an SNS topic. You can do this from the S3 console (Properties tab)
Step 2
Make an SQS Queue subscribed to this topic. So whenever an object is uploaded to the S3 bucket a message will be added to the queue.
Step 3
Create an AWS Lambda function to read messages from the SQS queue. AWS Lambda supports SQS events as a trigger. Therefore, whenever a message appears in the SQS queue, Lambda will trigger and read the message. Once successfully read the message it will be automatically deleted from the queue. For the messages that can't be processed by Lambda (erroneous messages) will not be deleted. So those messages will pile up in the queue. To prevent this behavior using a Dead Letter Queue (DLQ) is a good idea.
In your Lambda function, add your logic to handle what to do when users upload files to the bucket
Note: DLQ is nothing more than a normal queue.
Step 4
Debugging and analyzing the process
Make use of AWS Cloud watch to log details. Each Lambda function creates a log under log groups. This is a good place to check if something went wrong.