Apache nifi use in data science - data-science

I'm trying to create a web service dataflow using Apache NiFi. I've setup the request and response http processors however I can't seem to figure out how to update the flowfile from the request processor with data from say... another connection. Can someone please let me know how I can achieve this behaviour?
What is the use of Apache nifi, is it used in data science or it is just tool for working on some kind of data. What exactly Apache nifi does.

What is the use of Apache nifi, is it used in data science or it is just tool for working on some kind of data. What exactly Apache nifi does.
NiFi is a data orchestration and ETL/ELT tool. It's not meant for data science in the sense that data science is primarily about analytics. It's used by data engineers to process data, move data and things like that. These are the tasks that tend to happen prior to data science work.
I can't seem to figure out how to update the flowfile from the request processor with data from say
Use InvokeHttp. You can configure it to "always output response" and then you will have the response to work with. NiFi won't automagically merge the response with the data you send, so you would need to have the REST APIs you wrote send you output that you would find satisfactory on the NiFi side. That's a common use case with REST-based enrichment.

Related

How to create a process in Dell Boomi that will get data from one Database and then will send data to a SaaS

I would like to know how do I create a process in Dell Boomi that will meet the following criteria:
Read data directly from Database poduction table then will send the data to SaaS (public internet) using REST API.
Another process will read data from SaaS (REST API) and then write it to another Database table.
Please see attached link as to what I have done so far and I really don't know how to proceed. Hope you can help me out. Thank you.Boomi DB connector
You are actually making a good start. For the first process (DB > Saas) you need to:
Ensure you have access to the DB - if your Atom is local than this shouldn't be much of an issue, but if it is on the Boomi Cloud,
then you need to enable access to this DB from the internet (not
something I would recommend).
Check what you need to read and define Boomi Operation - from the image you have linked I can see that you are doing that, but not
knowing what data you need and how it is structured, it is impossible to say if you have defined all correctly.
Transform data to the output system format - once you get the data from the DB, use the Map shape to map it to the Profile of the Saas you are sending your data to.
Send data to Saas - you can use HttpClient connector to send data in JSON or XML (or any other format you like) to the Saas Rest API
For the other process (Saas > DB) the steps are practically the same but in reverse order.

Calling API from PigLatin

Complete newbie to PigLatin, but looking to pull data from the MetOffice DataPoint API e.g.:
http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/xml/350509?res=3hourly&key=abc123....
...into Hadoop.
My question is "Can this be undertaken using PigLatin (from within Pig View, in Ambari)"?
I've hunted round for how to format a GET request into the code, but without luck.
Am I barking up the wrong tree? Should I be looking to use a different service within the Hadoop framework to accomplish this?
It is very bad idea to make calls to external services from inside of map-reduce jobs. The reason being that when running on the cluster your jobs are very scalable whereas the external system might not be so. Modern resource managers like YARN make this situation even worse, when you swamp external system with the requests your tasks on the cluster will be mostly sleeping waiting for reply from the server. The resource manager will see that CPU is not being used by tasks and will schedule more of your tasks to run which will make even more requests to the external system, swamping it with the requests even more. I've seen modest 100 machine cluster putting out 100K requests per second.
What you really want to do is to either somehow get the bulk data from the web service or setup a system with a queue and few controlled number of workers that will pull from the external system at set rate.
As for your original question, I don't think PigLatin provides such service, but it could be easily done with UDFs either Python or Java. With Python you can use excellent requests library, which will make your UDF be about 6 lines of code. Java UDF will be little bit more verbose, but nothing terrible by Java standards.
"Can this be undertaken using PigLatin (from within Pig View, in
Ambari)"?
No, by default Pig load from HDFS storage, unless you write your own loader.
And i share same point with #Vlad, that this is not a good idea, you have many other other components used for data ingestion, but this not a use case of Pig !

How to use Apache Nifi to query a REST API?

For a project i need to develop an ETL process (extract transform load) that reads data from a (legacy) tool that exposes its data on a REST API. This data needs to be stored in amazon S3.
I really like to try this with apache nifi but i honestly have no clue yet how i can connect with the REST API, and where/how i can implement some business logic to 'talk the right protocol' with the source system. For example i like to keep track of what data has been written so far so it can resume loading where it left of.
So far i have been reading the nifi documentation and i'm getting a better insight what the tool provdes/entails. However it's not clear to be how i could implement the task within the nifi architecture.
Hopefully someone can give me some guidance?
Thanks,
Paul
The InvokeHTTP processor can be used to query a REST API.
Here is a simple flow that
Queries the REST API at https://api.exchangeratesapi.io/latest every 10 minutes
Sets the output-file name (exchangerates_<ID>.json)
Stores the query response in the output file on the local filesystem (under /tmp/data-out)
I exported the flow as a NiFi template and stored it in a gist. The template can be imported into a NiFi instance and run as is.

Testing a http-client

This project uses http clilent libraries to poll a http server for an xml file containing data gathered from hardware. Polling happens relatively fast. The data changes with time. Only one xml file is polled ever.
Is there a testing method/tool that can be used as the http server and feed the client an xml file based on the time it is polled?
Basically, what I'm trying to do is send xml data that may change on each poll. Each version of data is pre-determined for testing.
An idea I've thought is having a log rotator script cron'ed at polling frequency to check out and replace each version of the data into /var/log/www and let apache handle the rest. However, this does not tightly control which version will be served when it is polled as network delay may cause files to be replaced before the data is served. Each version of the data must be served and no versions may be skipped.
Any solutions/thoughts/methods/ideas will be appreciated.
Thanks
If you are attempting to perform Unit tests of specific functionality, I would suggest mocking the HTTP response and go from there. Relatively easy to setup and then very easy to modify.

Using an ESB system to replicate data among databases

I work in a small supermarket chain (4 stores). Each store has its own local database which contains information of each product, prices, and transactions that have ocurred on the store. In addition, each store needs to replicate this information back and forth to a central location.
Right now we are using something called SQLRemote, which is a feature of Sybase's SQL Anywhere database. It works, but sometimes fails and is difficult to manage. To its' credit, SQLRemote actually wasn't designed for this type of scenarios, so it could be said that we are using it incorrectly.
I was thinking that an ESB system such as Mule (or ChainBuilder which seems easier to set up) might be a good alternative to SQL remote. I understand that these systems can detect when changes occur in the database (i.e. when records are added, modified or deleted), and can be set up to deliver a message in a transaction.
Would this be a viable solution to my scenario?
Best regards,
Edgard
Yeah I am sure Mule should be able to do this.
However I work for a company which provides Fuse ESB which is using Apache projects such as Apache ServiceMix, Apache ActiveMQ, Apache Camel and Apache CXF.
We have a user story about a very big retailler in US which uses Fuse ESB to integrate their stores and warehouses and whatnot
http://fusesource.com/collateral/17
Fuse ESB
http://fusesource.com/products/enterprise-servicemix/
Yes, Mule can support this scenario thought it might be overkill. There are targeted database replication solutions out there. The advantage of Mule would be it's ability to handle failure and other scenarios where you need the workflow to be adapted based on what is happening. This allows you to build a very robust solution.
Mule flows could be a very good choice to address this problem. It's a new feature of Mule 3 designed for orchestrating integrations like this.