What is the recommended way to provide an API for Apache Spark application results

What is the recommended way to provide an API for Apache Spark application results - api

We have a huge set of data stored on hadoop cluster. We need to do some analysis to these data using apache spark and provide the result of this analysis to other applications via an API.
I have two ideas but I can not figure out which one is the recommended.
The first option is to make spark application(s) that make its analysis and store the result in another datastore (relation DB or even HDFS), then develop another application that reads the result of the analysis from the other datastore and provide an API for querying.
The second option is to make merge the two applications into one application. This way I deduce the need to another datastore but I this way the application will up running all the time.
What is the recommended way to go for in this situation? and if there is another options kindly list it.

It depends on How frequently the user going to hit the get api.as if client want real time result should go for in line api.else can use first aproach of storing result in another data storage.

Related

API which provide data from Elastic Search and not SQL

I have a system where there are large dataset(s) where I want to have quick searches, and elastic search is suitable for it. So the data resides in SQL, and is synced to ES. There is an obvious small delay in this sync.
There are consumers of this data which could work with slightly stale data. So if there's an API for UI which end users use to see the dataset. A delay of 3-4 seconds is acceptable. So API handler which deals with ES is perfect here.
Then there are consumers of this data (bots) who want to work with real time data. So for the almost same requirements, should I create another API just like that in UI consumer, which gets data from SQL?
What is the usual best practice which is followed, and I'm assuming this is a very common usecase.

You probably should stick to creating just a sinlge API and use a query string parameter to decide which of the two data sources to use. This will result in less code to maintain.

In DataFactory, what is a good strategy to migrate data into Dynamics365 using Dynamics Web API?

I need to migrate data using DataFactory to Dynamics365. The Dynamics365 connector is not enough for me since one of the requirements is to only update those attributes that have been modified since last migration - not the whole register. The other requirement is that sometimes we have to 'null' values in destination.
I believe that I can do that by generating a different JSON for register and migrate them using the Web API.
I thought in putting these calls in an Azure Functions, but I believe that they are not meant to be used like this - even though with the right pricing plan they can run with no limit of time.
I think I'm doing it wrong and I can't figure out the right way.
Could you share your experience or point of view?

The correct way to interact with Dynamics 365 from other application is either directly with the WebAPI or using C#'s SDK, in both scenarios, for create or update multiple records the best way to do it (as far as i know) is using ExecuteMultipleRequest Message, this allow you to set it with updates, creates, deletes and execute then in one request.

Database for live mobile tracking

I'm developing an app that allows to track a mobile device instantly (live) ... I need an of advice. The application must send the location to a webservice that in it's turn records the received data in a database.
What would be, in your opinion, the best way to store the location values?
I'm new in using bigdata and I'm afraid that simple sql requests wont be able to do the work properly ... I imagine if there is lot of users and each user send a request each 1sec I'll have issue with the database ...
An advice ? Thank you very much

i think you could have a look into the geospatial queries in mongo, if you chose to go ahead with mongodb.
Refer here
And here
for the design of the database would depend on the nature of the query (essentially the read and write).
Worth having a look into

Working at Cintric we landed on using elasticsearch. We process billions of location points in real time and provide advanced analytics to our users.
We started with mongoDB and ran into a lot of troubles, eventually leading to a painful migration.
Our stack currently has mobile devices dump location updates into AWS Kinesis, which are then processed by AWS Lambda handlers, and then dumped into elasticsearch. We're able to serve, process and store 300 million requests/month for only a few hundred dollars/month. Analytics for our dashboard add additional cost but for your needs I would highly recommend checking out your options on AWS.

What kind of server for operational transform operations?

I am hoping to use the Diff-Match-Patch algorithms available from google as apart of the Google-Mobwrite real time collaborative text editor protocol in order to embed a real time collaborative text editor in my program.
Anyways I was wondering what exactly might be the most efficient way of storing "global" copies of each document that users are editing. I would like to have each document stored on a server that is not local to any user and each time a user performs an "operation" ( delete insert paste cut ) that the diff is computed between their copy and the server and its patched etc... if you know the Google mobwrite protocol you probably understand what I am saying.
Should the servers text files be stored as a file that is changed or inside an sql database as a long string or what? Should I be using websockets to communicate with the server? I am honestly kind of an amateur when it comes to this but am generally a fast learner. Does anyone have any tips or resources I could follow perhaps? Thanks lot

This would be a big project to tackle from scratch, so I suggest you use one of the many open source projects in this area. For example, etherPad:
https://code.google.com/p/etherpad/

Mobwrite is using Differential Synchronization technique and its totally different from Operational Transformation technique.
Differential Synchronization suppose to have a communication circle that always starts from the client(the browser), which means you cant use web-sockets to send diffs from the server directly. The browser needs to request the server frequently to get the updates (lets say every 2 seconds), otherwise your shadow-copies will be out of sync.
For storing your shadow-copies when the user is active, you can use whatever you want, but its better to to use in-memory DB (Redis) since you need fast access to do the diffs and patches. And when the user leaves the session you don't need his copy anymore. But, If you need persistence in you app, you should persist only the server-copy not the shadow-copy (shadow-copies are used to find-out the diffs), then you can use MySQL or whatever you like.
But for Operational Transformation technique there are some nice libs out there
NodeJS:
ShareJS (sharejs.org): supports all operations for JSON.
RacerJS: synchronization model built on top of ShareJS
DerbyJS: Complete framework that uses RacerJS as its model.
OpenCoweb (opencoweb.org):
The server is either Java or Python, the client is built with Dojo

Where should we calculate fields?

I'm currently working in a Silverlight / MS SQL project where the Entity Framework has not been implemented and I would like to know what's the best practice to deal with calculated fields in this particular situation.
Considering that some external system might also consume my data directly in the DB or thru a web service, here's the 3 options I can see right now.
1) Force any external system to consume data thru a web service and create all the calculated fields in the objects only.
2) Create the calculated fields in a DB view and resync your object with the server each time a value needs to be calculated.
3) Replicate the calculation rules in the object and the database view.
Any other suggestions would also be welcomed.

I would recommend to follow two principles: data decoupling and minimum functionality duplication. Both would suggest to put your calculations in one place only, and serve them already calculated. So I would implement the calculations in the DB, and serve them via a web service.
However, you have to consider your particular case. For example, if the calculations are VERY heavy, you could delegate them to the client to spare server resources. This could even be the reason you are using Silverlight. I am in a similar situation on a project, and I found that the best compromise is to push raw data to the client and have it do the heavy computations.

Having a best practice or approach for this kind of problem is difficult as circumstances change what was formerly a good approach might start to seem less useful. That said where possible I would do anything data related at the DB level including calculated fields. This way you know no matter where you are looking at the data from you will see the same results. So your web service, SQL reporting and anything else that needs to look at or receive data will see the same result.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

What is the recommended way to provide an API for Apache Spark application results - api

It depends on How frequently the user going to hit the get api.as if client want real time result should go for in line api.else can use first aproach of storing result in another data storage.

Related

API which provide data from Elastic Search and not SQL

In DataFactory, what is a good strategy to migrate data into Dynamics365 using Dynamics Web API?

Database for live mobile tracking

What kind of server for operational transform operations?

Where should we calculate fields?

Categories

Resources