Automating scaleup of Streaming units - Stream analytics job - azure-stream-analytics

We would like to automate scale up of streaming units for certain stream analytics job if the 'SU utilization' is high. Is it possible to achieve this using PowerShell? Thanks.

Firstly, as Pete M said, we could call REST API to create or update a transformation within a job.
Besides, Azure Stream Analytics Cmdlets New-AzureRmStreamAnalyticsTransformation could be used to update a transformation within a job.

Depends on what you mean by "automate". You can update a transformation via the API from a scheduled job, including streaming unit allocation. I'm not sure if you can do this via the PS object model but you can always make a rest call:
https://learn.microsoft.com/en-us/rest/api/streamanalytics/stream-analytics-transformation
If you mean you want to use powershell to create and configure a job to automatically scale on its own, unfortunately today that isn't possible regardless of how you create the job. ASA doesn't support elastic scaling. You have to do it "manually", either by hand or some manner of scheduled webjob or similar.

It is three years later now, but I think you can use App Insights to automatically create an alert rule based on percent utilization. Is it an absolute MUST that you use powershell? If so, there is an Azure Automation Script on Github:
https://github.com/Azure/azure-stream-analytics/blob/master/Autoscale/StepScaleUp.ps1

Related

Dynamic scheduler on GCP

Does GCP have a job scheduling service like Azure Scheduler, where jobs can be scheduled and managed dynamically via API?
Google Cron service is set in a static file and it seems like their answer to this is to use that to poke a roll your own service backed with PubSub and a data store. Looking for Quartz-like functionality, consumable by APP engine, which can be managed and invoked via API as opposed to managing a cluster, queue, and compute instance/VM deployment of Quartz (or the like) or rolling a custom solution. Should support 50 million simultaneous jobs per day with retry / recoverability and dynamic scheduling per tenant capabilities.
This is the cheapest and easiest way I can imagine building a solution today on top of an existing AppEngine based project:
As you observed, currently there is no such API/service directly available on GCP. There is an open feature request (on GAE) for it.
But, also as you observed, it is possible to build and use a custom solution, just like the one you proposed.
Depending on the context even simpler solutions are possible. For a GAE context check out, for example, How to schedule repeated jobs or tasks from user parameters in Google App Engine?.

Loading data to BigQuery using python API vs bq load

Got a new requirement. In GCS bucket have around 130+ files and these files need to be loaded into different tables on BigQuery on daily basis.
After researching, I found two options.
1) Use "bq load" command to load (Shell Script/Python Script)
2) Create a Python API to load the data to BigQuery
Which option is best. If I go with Python API, I need use APPENGINE to schedule it.
is there any better option other than this?
Thanks,
However you do it, you'll be creating load jobs. So from the BigQuery side of things, it doesn't really matter which option you choose.
As far as scheduling goes, you do have some options on Google Cloud Platform:
App Engine standard environment cron service.
See this example for using this to reliably schedule tasks via Pub/Sub.
Your operating system's cron or systemd timers on a Compute Engine instance.
A cron job on a Kubernetes cluster, using Container Engine.
There are a few differences:
a) BQ Load:
-You can have some issues using special chars as delimiters, like ^ and |.
-You don't need a service account (You can use a user account)
-You can't use it on google cloud functions
b) API
-You don't have the special chars trouble.
-You can use it on google cloud functions
-And if you create a python script, you can schedule it on Scheduled Tasks (On Windows)

Calling API from PigLatin

Complete newbie to PigLatin, but looking to pull data from the MetOffice DataPoint API e.g.:
http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/xml/350509?res=3hourly&key=abc123....
...into Hadoop.
My question is "Can this be undertaken using PigLatin (from within Pig View, in Ambari)"?
I've hunted round for how to format a GET request into the code, but without luck.
Am I barking up the wrong tree? Should I be looking to use a different service within the Hadoop framework to accomplish this?
It is very bad idea to make calls to external services from inside of map-reduce jobs. The reason being that when running on the cluster your jobs are very scalable whereas the external system might not be so. Modern resource managers like YARN make this situation even worse, when you swamp external system with the requests your tasks on the cluster will be mostly sleeping waiting for reply from the server. The resource manager will see that CPU is not being used by tasks and will schedule more of your tasks to run which will make even more requests to the external system, swamping it with the requests even more. I've seen modest 100 machine cluster putting out 100K requests per second.
What you really want to do is to either somehow get the bulk data from the web service or setup a system with a queue and few controlled number of workers that will pull from the external system at set rate.
As for your original question, I don't think PigLatin provides such service, but it could be easily done with UDFs either Python or Java. With Python you can use excellent requests library, which will make your UDF be about 6 lines of code. Java UDF will be little bit more verbose, but nothing terrible by Java standards.
"Can this be undertaken using PigLatin (from within Pig View, in
Ambari)"?
No, by default Pig load from HDFS storage, unless you write your own loader.
And i share same point with #Vlad, that this is not a good idea, you have many other other components used for data ingestion, but this not a use case of Pig !

How to use Apache Nifi to query a REST API?

For a project i need to develop an ETL process (extract transform load) that reads data from a (legacy) tool that exposes its data on a REST API. This data needs to be stored in amazon S3.
I really like to try this with apache nifi but i honestly have no clue yet how i can connect with the REST API, and where/how i can implement some business logic to 'talk the right protocol' with the source system. For example i like to keep track of what data has been written so far so it can resume loading where it left of.
So far i have been reading the nifi documentation and i'm getting a better insight what the tool provdes/entails. However it's not clear to be how i could implement the task within the nifi architecture.
Hopefully someone can give me some guidance?
Thanks,
Paul
The InvokeHTTP processor can be used to query a REST API.
Here is a simple flow that
Queries the REST API at https://api.exchangeratesapi.io/latest every 10 minutes
Sets the output-file name (exchangerates_<ID>.json)
Stores the query response in the output file on the local filesystem (under /tmp/data-out)
I exported the flow as a NiFi template and stored it in a gist. The template can be imported into a NiFi instance and run as is.

How to proceed with query automation using Import.io

I've successfully created a query with the Extractor tool found in Import.io. It does exactly what I want it to do, however I need to now run this once or twice a day. Is the purpose of Import.io as an API to allow me to build logic such as data storage and schedules tasks (running queries multiple times a day) with my own application or are there ways to scheduled queries and make use of long-term storage of my results completely within the Import.io service?
I'm happy to create a Laravel or Rails app to make requests to the API and store the information elsewhere but if I'm reinventing the wheel by doing so and they provides the means to address this then that is a true time saver.
Thanks for using the new forum! Yes, we have moved this over to Stack Overflow to maximise the community atmosphere.
At the moment, Import does not have the ability to schedule crawls. However, this is something we are going to roll out in the near future.
For the moment, there is the ability to set a Cron job to run when you specify.
Another solution if you are using the free version is to use a CI tool like travis or jenkins to schedule your API scripts.
You can query live the extractors so you don't need to make them run manually every time. This will consume one of your requests from your limit.
The endpoint you can use is:
https://extraction.import.io/query/extractor/extractor_id?_apikey=apikey&url=url
Unfortunately the script will not be a very simple one since most websites have very different respond structures towards import.io and as you may already know, the premium version of the tool provides now with scheduling capabilities.