Is there any way I can get some information about an orchestration in Matillion expl when was it first created or who created it, using API
The only thing I find is data about the job like components and variables
Related
Goal:
I have two SQL server databases (DB-A and DB-B) located on two different severs in same network.
DB-A has a table T1 and I want to copy data from DB-A's Table T1 (source) to DB-B's Table T2 (Destination). This DB sync should take palace anytime any record in T1 is added, updated, and deleted.
Please note: All db to db data syc options are out of consideration, I must use MuleSoft API for this job.
Background:
I am new to MuleSoft and its offered products, I am told mule soft platform can help with building and managing API’s.
I explored web for MuleSoft offering, there are many articles (mentioned below) which are suggesting that MuleSoft itself can read and write from one DB table and write to another DB table (using DB connectors etc).
Questions:
Is it possible that MuleSoft itself can get this data sync job done without us writing own MuleSoft API invoker or MuleSoft API Consumer (to trigger MuleSoft API from one end or to receive data from MuleSoft API on the other end and write to DB table)?
What are all key steps to get this data transfer working? If you can provide any reference which shows step by step journey to achieve the goal will be huge help.
Links:
https://help.mulesoft.com/s/question/0D52T00004mXXGDSA4/copy-data-from-one-oracle-table-to-another-oracle-table
https://help.mulesoft.com/s/question/0D52T00004mXStnSAG/select-insert-data-from-one-database-to-another
https://help.mulesoft.com/s/question/0D72T000003rpJJSAY/detail
First let's clarify the terminology since the questions mixes several concepts in a confusing way. MuleSoft is a company that has several products that may apply. A MuleSoft API should be considered an API created by MuleSoft. Since you clearly are talking about APIs created by you or your organization that would be an incorrect description. What you are talking about are really Mule applications, which are applications that are deployed and executed in a Mule runtime. Mule applications may implement your APIs, or may implement integrations. After all Mule originally was an ESB product used to integrate other systems, before REST APIs where a thing. You may deploy Mule applications to Anypoint Platform. Specifically to the CloudHub component of the platform, or to an on-prem instance of Mule runtime.
In any case, a Mule application is perfectly capable of implementing APIs, integrations or both. There is no need that it implements an API or call another API if that is not what you want. You need to trigger the flow somehow, either reading directly from the database to find new rows, with a scheduler to execute a query at a given time, an HTTP request or even have an API listening for requests to trigger the flow.
As an example the application can use the <db:listener> source of the Database connector to start the flow fetching rows. You need to take care of any watermark columns configurations to detect only new rows. See the documentation https://docs.mulesoft.com/db-connector/1.13/database-documentation#listener for details.
Alternatively you can trigger the flow in another way and just use a select operation.
After that use DataWeave to transform the records as needed. Then use insert or update operations.
There are examples in the documentation that can help you to get started. If you are not familiar with Mule you should start with reading the documentation and do some training until you get the concepts.
Building in the Google Cloud ecosystem is really powerful. I really like how you can ingest files to Cloud Storage then Data Flow enriches, transforms and aggregates the data, and then finally stored in BigQuery or Cloud SQL.
I have a couple of questions to help me have a better understanding.
If you are to build a big data product using the Google services.
When a front-end web application (might be built in React) submits a file to Cloud storage it may take some time before it completely processes. The client might want to view the status the file in the pipeline. They then might want to do something with the result on completion. How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
If you currently have a microservice architecture in which each service does a different kind of processing. For example one might parse a file, another might processes messages. The services communicate using Kafka or RabbitMQ and store data in Postgres or S3.
If you adopt the Google services ecosystem could you replace that microservice architecture with Cloud storage, dataflow, Cloud SQL/Store?
Did you look at Cloud Pub/Sub (topic subscription/publication service).
Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
I believe Pub/Sub can mostly substitute Kafka or RabitMQ in your case.
How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
For example, if you are using dataflow API to process the file, Cloud dataflow can publish the progress and send the status to a topic. Your front end (app engine) just needs to subscribe to that topic and receive update.
1)
Dataflow does not offer inspection to intermediary results. If a frontend wants more progress about an element being processed in a Dataflow pipeline, custom progress reporting will need to be built into the Pipline.
One idea, is to write progress updates to a sink table and output molecules to that at various parts of the pipeline. I.e. have a BigQuery sink where you write rows like ["element_idX", "PHASE-1 DONE"]. Then a frontend can query for those results. (I would avoid overwriting old rows personally, but many approaches can work).
You cand do this by consuming the PCollection in both the new sink, and your pipeline's next step.
2)
Is your Microservice architecture using a "Pipes and filters" pipeline style approach? I.e. each service reads from a source (Kafka/RabbitMQ) and writes data out, then the next consumes it?
Probably the best way to do setup one a few different Dataflow pipelines, and output their results using a Pub/Sub or Kafka sink, and have the next pipeline consume that Pub/Sub sink. You may also wish to sink them to a another location like BigQuery/GCS, so that you can query out these results again if you need to.
There is also an option to use Cloud Functions instead of Dataflow, which have Pub/Sub and GCS triggers. A microservice system can be setup with several Cloud Functions.
Trying to get a handle on what I would use to schedule and run jobs to move data into S3, run scripts on it and move it around s3 afterward.
My requirement is to be able to ingest from API's and also directly from databases. Some formats to ingest will be XML, and others could be flat files. The raw files need to be joined and transformed and turned into a format that graphs could be produced with.
What is AWS glue is like as an ETL tool? My specific question is can you see the finished pipelines showing the data sources and processing parts in a graphical view once they are created?
I have used Azure Data Factory - and it had a graphical UI to view and monitor the pipelines which I found quite useful. Just wondering if AWS glue has a similar thing.
If not - would Nifi on AWS S3 be a good way to do this?
Thanks
If you are looking for the best GUI, I would recommend NiFi. It is commonly used with S3 and has many connectors out of the box for other data sources. It becomes even more interesting if you want to do things outside of the AWS cloud.
That being said, I would think that Glue will also get the job done.
Running Data Factory when you have a heavy AWS footprint feels like an anti-pattern.
Full Disclosure: Have not worked with Glue/Data Factory and work for Cloudera, the driving force behind NiFi
I'm currently using AWS Glue to extract data from DB into s3, manipulate the data and save it back to Redshift/S3 or send via API to my client. AWS Glue GUI is not that good, you won't see a diagram of your flow and sometimes you will need to use other tools like step functions, airflow to orchestrate your job. Also, most of my jobs I have to use PySpark because AWS Glue methods are too limited.
Related to monitoring, you can see if there is an error, how many CPU and memory is been consumed by your job, s3 bytes read/written. If you want additional information you need to use logger or print to send it to the logs.
We would like to automate scale up of streaming units for certain stream analytics job if the 'SU utilization' is high. Is it possible to achieve this using PowerShell? Thanks.
Firstly, as Pete M said, we could call REST API to create or update a transformation within a job.
Besides, Azure Stream Analytics Cmdlets New-AzureRmStreamAnalyticsTransformation could be used to update a transformation within a job.
Depends on what you mean by "automate". You can update a transformation via the API from a scheduled job, including streaming unit allocation. I'm not sure if you can do this via the PS object model but you can always make a rest call:
https://learn.microsoft.com/en-us/rest/api/streamanalytics/stream-analytics-transformation
If you mean you want to use powershell to create and configure a job to automatically scale on its own, unfortunately today that isn't possible regardless of how you create the job. ASA doesn't support elastic scaling. You have to do it "manually", either by hand or some manner of scheduled webjob or similar.
It is three years later now, but I think you can use App Insights to automatically create an alert rule based on percent utilization. Is it an absolute MUST that you use powershell? If so, there is an Azure Automation Script on Github:
https://github.com/Azure/azure-stream-analytics/blob/master/Autoscale/StepScaleUp.ps1
For a project i need to develop an ETL process (extract transform load) that reads data from a (legacy) tool that exposes its data on a REST API. This data needs to be stored in amazon S3.
I really like to try this with apache nifi but i honestly have no clue yet how i can connect with the REST API, and where/how i can implement some business logic to 'talk the right protocol' with the source system. For example i like to keep track of what data has been written so far so it can resume loading where it left of.
So far i have been reading the nifi documentation and i'm getting a better insight what the tool provdes/entails. However it's not clear to be how i could implement the task within the nifi architecture.
Hopefully someone can give me some guidance?
Thanks,
Paul
The InvokeHTTP processor can be used to query a REST API.
Here is a simple flow that
Queries the REST API at https://api.exchangeratesapi.io/latest every 10 minutes
Sets the output-file name (exchangerates_<ID>.json)
Stores the query response in the output file on the local filesystem (under /tmp/data-out)
I exported the flow as a NiFi template and stored it in a gist. The template can be imported into a NiFi instance and run as is.