Is there a way to leverage Hadoop tools to mange parallel REST API calls to external sources? - api

I am writing software that creates a large graph database. The software needs to access dozens of different REST APIs with millions of total requests. The data will then be processed by the Hadoop cluster. Each of these APIs have rate limits that vary by requests/second, per window, per day and per user (typically via OAuth).
Does anyone have any suggestions on how I might use either a Map function or other Hadoop-ecosystem tool to manage these queries? The goal would to be to leverage the parallel processing in Hadoop.
Because of the varied rate limits, it often makes sense to switch to a different API query while waiting for the first limit to reset. An example would be one API call that creates nodes in the graph and another that enriches the data for that node. I could have the system go out and enrich the data for the new nodes while waiting for the first API limit to reset.
I have tried using SQS queuing on EC2 to manage the various API limits and states (creating a queue for each API call), but have found it to be ridiculously slow.
Any ideas?

It looks like the best option for my scenario will be using Storm, or specifically the Trident abstraction. It gives me the greatest flexibility for both workload management but process management as well

Related

Deploy REST API over multiple servers world-wide, but stay in sync

I've built a REST API with a pretty decent latency. Each request is answered in ~100 ms with a thousand requests per second. This is however with a relatively low physical distance to the data center. The users of this API would, however, be spread all over the globe. From the US for example (to a data center in Germany), the response time for a single request is ~400 ms under no load.
What would be the best approach to deploying this API? I suspect multiple servers at different locations, with a load balancer in front. How would I ensure that the MySQL database would stay in sync between the servers in that case?
With multiple servers and a load balancer, the costs rise exponentially, which is something I can hopefully afford in the future, but not at the moment.
I'd love to hear your ideas!
Afaik. for big projects people use event sourcing with an event storage and microservices and message queues between them or a basic solution is polling the event storage through a simple REST API something like send me the latest events since the last event I received. If you can accept eventual consistency, then I think this approach can work pretty well. It makes writing somewhat slower, but reading can be very fast with it. No need to sync the MySQL databases directly, you just pull the latest events and use a projection to update the local MySQL database. So the event storage is the single source of truth.

Is there a way to get Splunk Data to BigQuery?

I have some app data which is currently stored in Splunk. But i am looking for a way where I can input the Splunk data directly to BigQuery. My target is to analyze the app data on BigQuery and perhaps create Data Studio dashboards based on the BigQuery.
I know there are a lot of third party connectors that can help me with this, but I am looking for a solution where I can use features from Splunk or BigQuery to conncet both of them together and not rely on third party connectors.
Based on your comment indicating that you're interested in resources to egress data from Splunk into BigQuery with custom software, I would suggest using either tool's REST API on either side.
You don't indicate whether this is a one-time or a recurring asking - that may impact where you want the software to run that performs this operation. If it's a one-time thing and you've got a fair internet connection yourself, you may just want to write a console application from your own machine to perform the migration. If it's a recurring operation, you might instead look at any of the various "serverless" hosting options out there (e.g. Azure Functions, Google Cloud Functions, or AWS Lambda). In addition to development experience, note that you may have to pay an egress bandwidth cost for each on top of normal service charges.
Beyond that, you need to decide whether it makes more sense to do a bulk export from Splunk out to some external file that you load into Google Drive and then import into Big Query. But maybe it makes more sense to download the records as paged data via HTTPS so you can perform some ETL operation on top of it (e.g. replace nulls with empty strings, update Datetime types to match Google's exacting standards, etc.). If you go this route, it looks as though this is the documentation you'd use from Splunk and you can either use Google's newer, and higher-performance Storage Write API to receive the data or their legacy streaming API to ingest into BigQuery. Either option supports SDKs across varied languages (e.g. C#, Go, Ruby, Node.js, Python, etc.), though only the legacy streaming API supports plain HTTP REST calls.
Beyond that, don't forget your OAuth2 concerns to authenticate on either side of the operation, though this is typically abstracted away by the various SDKs offered by either party, and less of something you'd have to deal with the ins and outs of.

PubSub topic with binary data to BigQuery

I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.

Calling API from PigLatin

Complete newbie to PigLatin, but looking to pull data from the MetOffice DataPoint API e.g.:
http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/xml/350509?res=3hourly&key=abc123....
...into Hadoop.
My question is "Can this be undertaken using PigLatin (from within Pig View, in Ambari)"?
I've hunted round for how to format a GET request into the code, but without luck.
Am I barking up the wrong tree? Should I be looking to use a different service within the Hadoop framework to accomplish this?
It is very bad idea to make calls to external services from inside of map-reduce jobs. The reason being that when running on the cluster your jobs are very scalable whereas the external system might not be so. Modern resource managers like YARN make this situation even worse, when you swamp external system with the requests your tasks on the cluster will be mostly sleeping waiting for reply from the server. The resource manager will see that CPU is not being used by tasks and will schedule more of your tasks to run which will make even more requests to the external system, swamping it with the requests even more. I've seen modest 100 machine cluster putting out 100K requests per second.
What you really want to do is to either somehow get the bulk data from the web service or setup a system with a queue and few controlled number of workers that will pull from the external system at set rate.
As for your original question, I don't think PigLatin provides such service, but it could be easily done with UDFs either Python or Java. With Python you can use excellent requests library, which will make your UDF be about 6 lines of code. Java UDF will be little bit more verbose, but nothing terrible by Java standards.
"Can this be undertaken using PigLatin (from within Pig View, in
Ambari)"?
No, by default Pig load from HDFS storage, unless you write your own loader.
And i share same point with #Vlad, that this is not a good idea, you have many other other components used for data ingestion, but this not a use case of Pig !

Need Design & Implementation inputs on Cassandra based use case

I am planning to store high-volume order transaction records from a commerce website to a repository (Have to use cassandra here, that is our DB). Let us call this component commerceOrderRecorderService.
Second part of the problem is - I want to process these orders and push to other downstream systems. This component can be called batchCommerceOrderProcessor.
commerceOrderRecorderService & batchCommerceOrderProcessor both will run on a java platform.
I need suggestion on design of these components. Especially the below:
commerceOrderRecorderService
What is he best way to design the columns, considering performance and scalability? Should I store the entire order (complex entity) as a single JSON object. There is no search requirement on the order attributes. We can at least wait until they are processed by the batch processor. Consider - that a single order can contain many sub-items - at the time of processing each of which can be fulfilled differently. Designing columns for such data structure may be an overkill
What should be the key, given that data volumes would be high. 10 transactions per second let's say during peak. Any libraries or best practices for creating such transactional data in cassandra? Can TTL also be used effectively?
batchCommerceOrderProcessor
How should the rows be retrieved for processing?
How to ensure that a multi-threded implementation of the batch processor ( and potentially would be running on multiple nodes as well ) will have row level isolation. That is no two instance would read and process the same row at the same time. No duplicate processing.
How to purge the data after a certain period of time, while being friendly to cassandra processes like compaction.
Appreciate design inputs, code samples and pointers to libraries. Thanks.
Depending on the overall requirements of your system, it could be feasible to employ the architecture composed of:
Cassandra to store the orders, analytics and what have you.
Message queue - your commerce order recorder service would simple enqueue new order to the transactional and persistent queue and return. Scalability and performance should not be an issue here as you can easily achieve thousands of transactions per second with a single queue server. You may have a look at RabbitMQ as one of available choices.
Stream processing framework - you could read a stream of messages from the queue in a scalable fashion using streaming frameworks such as Twitter Storm. You could implement in Java than 3 simple pipelined processes in Storm:
a) Spout process that dequeues next order from the queue and pass it to
the second process
b) Second process called Bolt that inserts each next order to Cassandra and pass it to the third bolt
c) Third Bolt process that pushes the order to other downstream systems.
Such an architecture offers high-performance, scalability, and near real-time, low latency data processing. It takes into account that Cassandra is very strong in high-speed data writes, but not so strong in reading sequential list of records. We use Storm+Cassandra combination in our InnoQuant MOCA platform and handle 25.000 tx/second and more depending on hardware.
Finally, you should consider if such an architecture is not an overkill for your scenario. Nowadays, you can easily achieve 10 tx/second with nearly any single-box database.
This example may help a little. It loads a lot of transactions using the jmxbulkloader and then batches the results into files of a certain size to be transported else where. It multi-threaded but within the same process.
https://github.com/PatrickCallaghan/datastax-bulkloader-writer-example
Hope it helps. BTW it uses the latest cassandra 2.0.5.