Dynamic scheduler on GCP - api

Does GCP have a job scheduling service like Azure Scheduler, where jobs can be scheduled and managed dynamically via API?
Google Cron service is set in a static file and it seems like their answer to this is to use that to poke a roll your own service backed with PubSub and a data store. Looking for Quartz-like functionality, consumable by APP engine, which can be managed and invoked via API as opposed to managing a cluster, queue, and compute instance/VM deployment of Quartz (or the like) or rolling a custom solution. Should support 50 million simultaneous jobs per day with retry / recoverability and dynamic scheduling per tenant capabilities.
This is the cheapest and easiest way I can imagine building a solution today on top of an existing AppEngine based project:

As you observed, currently there is no such API/service directly available on GCP. There is an open feature request (on GAE) for it.
But, also as you observed, it is possible to build and use a custom solution, just like the one you proposed.
Depending on the context even simpler solutions are possible. For a GAE context check out, for example, How to schedule repeated jobs or tasks from user parameters in Google App Engine?.

Related

spring batch API to dynamically create tasks and steps

Our application allows the users to dynamically assemble/construct Tasks and steps ( Remote calls to various services). Once the user has defined/constructed a Task with various Steps, the system will allow the Task to be run manually or scheduled.
To implement above functionality, what would it take to use spring batch for programmatically create a given Task and its constituent Steps ? I'm assuming that such tasks can be scheduled with the help of Quartz, etc.
I understand that spring XD and spring cloud watch incorporate similar features -- some pointers to the relevant XD codebase/examples(or any other project codebase) which would help in my task would help.
What considerations should I keep in mind ? Any gotchas ?
Currently not using any cloud platform but the next version will be deployed to a cloud platform.
Thanks very much.

Using Google Cloud ecosystem vs building your own microservice architecture

Building in the Google Cloud ecosystem is really powerful. I really like how you can ingest files to Cloud Storage then Data Flow enriches, transforms and aggregates the data, and then finally stored in BigQuery or Cloud SQL.
I have a couple of questions to help me have a better understanding.
If you are to build a big data product using the Google services.
When a front-end web application (might be built in React) submits a file to Cloud storage it may take some time before it completely processes. The client might want to view the status the file in the pipeline. They then might want to do something with the result on completion. How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
If you currently have a microservice architecture in which each service does a different kind of processing. For example one might parse a file, another might processes messages. The services communicate using Kafka or RabbitMQ and store data in Postgres or S3.
If you adopt the Google services ecosystem could you replace that microservice architecture with Cloud storage, dataflow, Cloud SQL/Store?
Did you look at Cloud Pub/Sub (topic subscription/publication service).
Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
I believe Pub/Sub can mostly substitute Kafka or RabitMQ in your case.
How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
For example, if you are using dataflow API to process the file, Cloud dataflow can publish the progress and send the status to a topic. Your front end (app engine) just needs to subscribe to that topic and receive update.
1)
Dataflow does not offer inspection to intermediary results. If a frontend wants more progress about an element being processed in a Dataflow pipeline, custom progress reporting will need to be built into the Pipline.
One idea, is to write progress updates to a sink table and output molecules to that at various parts of the pipeline. I.e. have a BigQuery sink where you write rows like ["element_idX", "PHASE-1 DONE"]. Then a frontend can query for those results. (I would avoid overwriting old rows personally, but many approaches can work).
You cand do this by consuming the PCollection in both the new sink, and your pipeline's next step.
2)
Is your Microservice architecture using a "Pipes and filters" pipeline style approach? I.e. each service reads from a source (Kafka/RabbitMQ) and writes data out, then the next consumes it?
Probably the best way to do setup one a few different Dataflow pipelines, and output their results using a Pub/Sub or Kafka sink, and have the next pipeline consume that Pub/Sub sink. You may also wish to sink them to a another location like BigQuery/GCS, so that you can query out these results again if you need to.
There is also an option to use Cloud Functions instead of Dataflow, which have Pub/Sub and GCS triggers. A microservice system can be setup with several Cloud Functions.

Automating scaleup of Streaming units - Stream analytics job

We would like to automate scale up of streaming units for certain stream analytics job if the 'SU utilization' is high. Is it possible to achieve this using PowerShell? Thanks.
Firstly, as Pete M said, we could call REST API to create or update a transformation within a job.
Besides, Azure Stream Analytics Cmdlets New-AzureRmStreamAnalyticsTransformation could be used to update a transformation within a job.
Depends on what you mean by "automate". You can update a transformation via the API from a scheduled job, including streaming unit allocation. I'm not sure if you can do this via the PS object model but you can always make a rest call:
https://learn.microsoft.com/en-us/rest/api/streamanalytics/stream-analytics-transformation
If you mean you want to use powershell to create and configure a job to automatically scale on its own, unfortunately today that isn't possible regardless of how you create the job. ASA doesn't support elastic scaling. You have to do it "manually", either by hand or some manner of scheduled webjob or similar.
It is three years later now, but I think you can use App Insights to automatically create an alert rule based on percent utilization. Is it an absolute MUST that you use powershell? If so, there is an Azure Automation Script on Github:
https://github.com/Azure/azure-stream-analytics/blob/master/Autoscale/StepScaleUp.ps1

How to proceed with query automation using Import.io

I've successfully created a query with the Extractor tool found in Import.io. It does exactly what I want it to do, however I need to now run this once or twice a day. Is the purpose of Import.io as an API to allow me to build logic such as data storage and schedules tasks (running queries multiple times a day) with my own application or are there ways to scheduled queries and make use of long-term storage of my results completely within the Import.io service?
I'm happy to create a Laravel or Rails app to make requests to the API and store the information elsewhere but if I'm reinventing the wheel by doing so and they provides the means to address this then that is a true time saver.
Thanks for using the new forum! Yes, we have moved this over to Stack Overflow to maximise the community atmosphere.
At the moment, Import does not have the ability to schedule crawls. However, this is something we are going to roll out in the near future.
For the moment, there is the ability to set a Cron job to run when you specify.
Another solution if you are using the free version is to use a CI tool like travis or jenkins to schedule your API scripts.
You can query live the extractors so you don't need to make them run manually every time. This will consume one of your requests from your limit.
The endpoint you can use is:
https://extraction.import.io/query/extractor/extractor_id?_apikey=apikey&url=url
Unfortunately the script will not be a very simple one since most websites have very different respond structures towards import.io and as you may already know, the premium version of the tool provides now with scheduling capabilities.

Is there a way to leverage Hadoop tools to mange parallel REST API calls to external sources?

I am writing software that creates a large graph database. The software needs to access dozens of different REST APIs with millions of total requests. The data will then be processed by the Hadoop cluster. Each of these APIs have rate limits that vary by requests/second, per window, per day and per user (typically via OAuth).
Does anyone have any suggestions on how I might use either a Map function or other Hadoop-ecosystem tool to manage these queries? The goal would to be to leverage the parallel processing in Hadoop.
Because of the varied rate limits, it often makes sense to switch to a different API query while waiting for the first limit to reset. An example would be one API call that creates nodes in the graph and another that enriches the data for that node. I could have the system go out and enrich the data for the new nodes while waiting for the first API limit to reset.
I have tried using SQS queuing on EC2 to manage the various API limits and states (creating a queue for each API call), but have found it to be ridiculously slow.
Any ideas?
It looks like the best option for my scenario will be using Storm, or specifically the Trident abstraction. It gives me the greatest flexibility for both workload management but process management as well