I am trying to set up a Stream Analytics Job that accepts input from an Event Hub, processess the input via a ML model, and sends the output e.g. to a Power BI dashboard.
I deployed an ONNX model on an ACI (Azure Container Instance) instance following the documentation here: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where . This seems to be fine and I do get the automated swagger definition and can use the service via REST.
How can I connect to my ML deployment from within the Stream Analytics query? There is the "Functions" setting under "Job Topology" of the "Stream Analytics job" page, but I cannot figure out how to add it there. This ( https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-machine-learning-integration-tutorial ) suggests that it is possible, although it uses Azure Machine Learning Studio (as opposed to Azure Machine Learning without "Studio". I'm quite new to Azure and don't know if this matters or not but find it a bit confusing).
There is an ongoing limited preview which you can sign up to get access to this functionality. You will then be able to use the ONNX model you have deployed on ACI in your Stream Analytics job. We expect to roll out this functionality more broadly in the coming weeks :)
Related
I'd like to be able to check the encoding of a input file in the flow of my pipeline. Any idea about to do that thanks to one of the activity provided by Azure Data Factory?
Thanks for the tips
It's actually not supported by any of the activities "on the box" at this time, but you are able to do that using other services with connectors available on ADF like Azure Function for example. But you will need to develop the algorithm to detect the encoding and an azure function service to do that ... (Of course other services like Azure Batch, Notebooks ... could be used)
Saying that, it could be really usefull to add this information into the Get Metadata Activity (just posted the idea to https://feedback.azure.com/forums/270578-data-factory/suggestions/37452187-add-encoding-into-the-get-a-file-s-metadata-activi)
Building in the Google Cloud ecosystem is really powerful. I really like how you can ingest files to Cloud Storage then Data Flow enriches, transforms and aggregates the data, and then finally stored in BigQuery or Cloud SQL.
I have a couple of questions to help me have a better understanding.
If you are to build a big data product using the Google services.
When a front-end web application (might be built in React) submits a file to Cloud storage it may take some time before it completely processes. The client might want to view the status the file in the pipeline. They then might want to do something with the result on completion. How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
If you currently have a microservice architecture in which each service does a different kind of processing. For example one might parse a file, another might processes messages. The services communicate using Kafka or RabbitMQ and store data in Postgres or S3.
If you adopt the Google services ecosystem could you replace that microservice architecture with Cloud storage, dataflow, Cloud SQL/Store?
Did you look at Cloud Pub/Sub (topic subscription/publication service).
Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
I believe Pub/Sub can mostly substitute Kafka or RabitMQ in your case.
How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
For example, if you are using dataflow API to process the file, Cloud dataflow can publish the progress and send the status to a topic. Your front end (app engine) just needs to subscribe to that topic and receive update.
1)
Dataflow does not offer inspection to intermediary results. If a frontend wants more progress about an element being processed in a Dataflow pipeline, custom progress reporting will need to be built into the Pipline.
One idea, is to write progress updates to a sink table and output molecules to that at various parts of the pipeline. I.e. have a BigQuery sink where you write rows like ["element_idX", "PHASE-1 DONE"]. Then a frontend can query for those results. (I would avoid overwriting old rows personally, but many approaches can work).
You cand do this by consuming the PCollection in both the new sink, and your pipeline's next step.
2)
Is your Microservice architecture using a "Pipes and filters" pipeline style approach? I.e. each service reads from a source (Kafka/RabbitMQ) and writes data out, then the next consumes it?
Probably the best way to do setup one a few different Dataflow pipelines, and output their results using a Pub/Sub or Kafka sink, and have the next pipeline consume that Pub/Sub sink. You may also wish to sink them to a another location like BigQuery/GCS, so that you can query out these results again if you need to.
There is also an option to use Cloud Functions instead of Dataflow, which have Pub/Sub and GCS triggers. A microservice system can be setup with several Cloud Functions.
I need to setup a scheduled daily job that pulls data using a REST API call and then inserts that data into BigQuery.
I traditionally have done these types of tasks using Node.js running on Heroku. My current boss wants me to achieve this using the Google Cloud Platform.
What are some ways to achieve this using Google Cloud Platform?
A few options on GCP:
Spin up a GCE instance and use a cron (a little old school, but will work).
Use Google App Engine and schedule your job(s) that way.
Unfortunately, Google Cloud Functions don't yet support schedulers. Otherwise, that would be perfect.
Does GCP have a job scheduling service like Azure Scheduler, where jobs can be scheduled and managed dynamically via API?
Google Cron service is set in a static file and it seems like their answer to this is to use that to poke a roll your own service backed with PubSub and a data store. Looking for Quartz-like functionality, consumable by APP engine, which can be managed and invoked via API as opposed to managing a cluster, queue, and compute instance/VM deployment of Quartz (or the like) or rolling a custom solution. Should support 50 million simultaneous jobs per day with retry / recoverability and dynamic scheduling per tenant capabilities.
This is the cheapest and easiest way I can imagine building a solution today on top of an existing AppEngine based project:
As you observed, currently there is no such API/service directly available on GCP. There is an open feature request (on GAE) for it.
But, also as you observed, it is possible to build and use a custom solution, just like the one you proposed.
Depending on the context even simpler solutions are possible. For a GAE context check out, for example, How to schedule repeated jobs or tasks from user parameters in Google App Engine?.
Is there a recommended way in the Azure ecosystem to join the JSON messages sent by two or more separate devices at approximately the same time in order to run them through, for example, an Azure ML webservice.
The goal of this would be running a real time analysis with data coming from multiple devices.
Thank you
Edit :
Perhaps I should have phrased my question better, but I am currently using Azure Stream Analytics in order to capture the data sent from a device to Azure ML, which works fine (from learn.microsoft.com/en-us/azure/iot-hub/…). Now I want to do the same thing but with multiple devices that each send part of the information that Azure ML needs.
I think what you are looking for is Azure Stream Analytics which allows you to work on windows of time.
This article shows how to integrate ASA with Machine Learning.
And you can easily set the input of an ASA job to an IoT Hub.