How to use Apache Nifi to query a REST API? - apache

For a project i need to develop an ETL process (extract transform load) that reads data from a (legacy) tool that exposes its data on a REST API. This data needs to be stored in amazon S3.
I really like to try this with apache nifi but i honestly have no clue yet how i can connect with the REST API, and where/how i can implement some business logic to 'talk the right protocol' with the source system. For example i like to keep track of what data has been written so far so it can resume loading where it left of.
So far i have been reading the nifi documentation and i'm getting a better insight what the tool provdes/entails. However it's not clear to be how i could implement the task within the nifi architecture.
Hopefully someone can give me some guidance?
Thanks,
Paul

The InvokeHTTP processor can be used to query a REST API.
Here is a simple flow that
Queries the REST API at https://api.exchangeratesapi.io/latest every 10 minutes
Sets the output-file name (exchangerates_<ID>.json)
Stores the query response in the output file on the local filesystem (under /tmp/data-out)
I exported the flow as a NiFi template and stored it in a gist. The template can be imported into a NiFi instance and run as is.

Related

Apache nifi use in data science

I'm trying to create a web service dataflow using Apache NiFi. I've setup the request and response http processors however I can't seem to figure out how to update the flowfile from the request processor with data from say... another connection. Can someone please let me know how I can achieve this behaviour?
What is the use of Apache nifi, is it used in data science or it is just tool for working on some kind of data. What exactly Apache nifi does.
What is the use of Apache nifi, is it used in data science or it is just tool for working on some kind of data. What exactly Apache nifi does.
NiFi is a data orchestration and ETL/ELT tool. It's not meant for data science in the sense that data science is primarily about analytics. It's used by data engineers to process data, move data and things like that. These are the tasks that tend to happen prior to data science work.
I can't seem to figure out how to update the flowfile from the request processor with data from say
Use InvokeHttp. You can configure it to "always output response" and then you will have the response to work with. NiFi won't automagically merge the response with the data you send, so you would need to have the REST APIs you wrote send you output that you would find satisfactory on the NiFi side. That's a common use case with REST-based enrichment.

How to create a process in Dell Boomi that will get data from one Database and then will send data to a SaaS

I would like to know how do I create a process in Dell Boomi that will meet the following criteria:
Read data directly from Database poduction table then will send the data to SaaS (public internet) using REST API.
Another process will read data from SaaS (REST API) and then write it to another Database table.
Please see attached link as to what I have done so far and I really don't know how to proceed. Hope you can help me out. Thank you.Boomi DB connector
You are actually making a good start. For the first process (DB > Saas) you need to:
Ensure you have access to the DB - if your Atom is local than this shouldn't be much of an issue, but if it is on the Boomi Cloud,
then you need to enable access to this DB from the internet (not
something I would recommend).
Check what you need to read and define Boomi Operation - from the image you have linked I can see that you are doing that, but not
knowing what data you need and how it is structured, it is impossible to say if you have defined all correctly.
Transform data to the output system format - once you get the data from the DB, use the Map shape to map it to the Profile of the Saas you are sending your data to.
Send data to Saas - you can use HttpClient connector to send data in JSON or XML (or any other format you like) to the Saas Rest API
For the other process (Saas > DB) the steps are practically the same but in reverse order.

Using Google Cloud ecosystem vs building your own microservice architecture

Building in the Google Cloud ecosystem is really powerful. I really like how you can ingest files to Cloud Storage then Data Flow enriches, transforms and aggregates the data, and then finally stored in BigQuery or Cloud SQL.
I have a couple of questions to help me have a better understanding.
If you are to build a big data product using the Google services.
When a front-end web application (might be built in React) submits a file to Cloud storage it may take some time before it completely processes. The client might want to view the status the file in the pipeline. They then might want to do something with the result on completion. How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
If you currently have a microservice architecture in which each service does a different kind of processing. For example one might parse a file, another might processes messages. The services communicate using Kafka or RabbitMQ and store data in Postgres or S3.
If you adopt the Google services ecosystem could you replace that microservice architecture with Cloud storage, dataflow, Cloud SQL/Store?
Did you look at Cloud Pub/Sub (topic subscription/publication service).
Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
I believe Pub/Sub can mostly substitute Kafka or RabitMQ in your case.
How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
For example, if you are using dataflow API to process the file, Cloud dataflow can publish the progress and send the status to a topic. Your front end (app engine) just needs to subscribe to that topic and receive update.
1)
Dataflow does not offer inspection to intermediary results. If a frontend wants more progress about an element being processed in a Dataflow pipeline, custom progress reporting will need to be built into the Pipline.
One idea, is to write progress updates to a sink table and output molecules to that at various parts of the pipeline. I.e. have a BigQuery sink where you write rows like ["element_idX", "PHASE-1 DONE"]. Then a frontend can query for those results. (I would avoid overwriting old rows personally, but many approaches can work).
You cand do this by consuming the PCollection in both the new sink, and your pipeline's next step.
2)
Is your Microservice architecture using a "Pipes and filters" pipeline style approach? I.e. each service reads from a source (Kafka/RabbitMQ) and writes data out, then the next consumes it?
Probably the best way to do setup one a few different Dataflow pipelines, and output their results using a Pub/Sub or Kafka sink, and have the next pipeline consume that Pub/Sub sink. You may also wish to sink them to a another location like BigQuery/GCS, so that you can query out these results again if you need to.
There is also an option to use Cloud Functions instead of Dataflow, which have Pub/Sub and GCS triggers. A microservice system can be setup with several Cloud Functions.

Calling API from PigLatin

Complete newbie to PigLatin, but looking to pull data from the MetOffice DataPoint API e.g.:
http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/xml/350509?res=3hourly&key=abc123....
...into Hadoop.
My question is "Can this be undertaken using PigLatin (from within Pig View, in Ambari)"?
I've hunted round for how to format a GET request into the code, but without luck.
Am I barking up the wrong tree? Should I be looking to use a different service within the Hadoop framework to accomplish this?
It is very bad idea to make calls to external services from inside of map-reduce jobs. The reason being that when running on the cluster your jobs are very scalable whereas the external system might not be so. Modern resource managers like YARN make this situation even worse, when you swamp external system with the requests your tasks on the cluster will be mostly sleeping waiting for reply from the server. The resource manager will see that CPU is not being used by tasks and will schedule more of your tasks to run which will make even more requests to the external system, swamping it with the requests even more. I've seen modest 100 machine cluster putting out 100K requests per second.
What you really want to do is to either somehow get the bulk data from the web service or setup a system with a queue and few controlled number of workers that will pull from the external system at set rate.
As for your original question, I don't think PigLatin provides such service, but it could be easily done with UDFs either Python or Java. With Python you can use excellent requests library, which will make your UDF be about 6 lines of code. Java UDF will be little bit more verbose, but nothing terrible by Java standards.
"Can this be undertaken using PigLatin (from within Pig View, in
Ambari)"?
No, by default Pig load from HDFS storage, unless you write your own loader.
And i share same point with #Vlad, that this is not a good idea, you have many other other components used for data ingestion, but this not a use case of Pig !

Collect and Display Hadoop MapReduce resutls in ASP.NET MVC?

Beginner questions. I read this article about Hadoop/MapReduce
http://www.amazedsaint.com/2012/06/analyzing-some-big-data-using-c-azure.html
I get the idea of hadoop and what is map and what is reduce.
The thing for me is, if my application sits on top of a hadoop cluster
1) No need for database anymore?
2) How do I get my data into hadoop in the first place from my ASP.NET MVC application? Say it's Stackoverflow (which is coded in MVC). After I post this question, how can this question along with the title, body, tags get into hadoop?
3) In the above article, it collects data about "namespaces" used on Stakoverflow and how many times they were used.
If this site stackoverflow wants to display the result data from mapreducer in real time, how do you do that?
Sorry for the rookie questions. I'm just trying to get a clear pictures here one piece at a time.
1) That would depend on the application. Most likely you still need database for user management, etc.
2) If you are using Amazon EMR, you'd place the inputs into S3 using .NET API (or some other way) and get the results out the same way. You could also monitor your EMR account via API, fairly straight-forward.
3) Hadoop is not really a real-time environment, more of a batch system. You could simulate
realtime by continuous processing of incoming data, however it's still not true real-time.
I'd recommend taking a look at Amazon EMR .NET docs and pick up a good book on Hadoop (such as Hadoop in Practice to understand the stack and concepts and Hive (such as Programming Hive)
Also, you can, of course, mix the environments for what they are best at; for example, use Azure Websites and SQLAzure for your .NET app and Amazon EMR for hadoop/hive. No need to park everything in one place, considering cost models.
Hope this helps.