Our application allows the users to dynamically assemble/construct Tasks and steps ( Remote calls to various services). Once the user has defined/constructed a Task with various Steps, the system will allow the Task to be run manually or scheduled.
To implement above functionality, what would it take to use spring batch for programmatically create a given Task and its constituent Steps ? I'm assuming that such tasks can be scheduled with the help of Quartz, etc.
I understand that spring XD and spring cloud watch incorporate similar features -- some pointers to the relevant XD codebase/examples(or any other project codebase) which would help in my task would help.
What considerations should I keep in mind ? Any gotchas ?
Currently not using any cloud platform but the next version will be deployed to a cloud platform.
Thanks very much.
Related
My organization moves data for customers between systems, these integrations are in BizTalk and are done by file, sometimes to/from APIs. More and more customers are switching to APIs so we are facing more and more API to API integrations.
I'm mostly a backend developer but have been tasked with finding out how we can find a more generic pattern or system to make these integrations, we are talking close to a thousand of integrations.
But not thousands of different APIs, many customers use the same sort of systems.
What I want is a solution that:
Fetches data from the source api
Transforms the data to the format for the target api
Sends the data to the target api
Another requirement is that it should be possible to set a schedule when these jobs should run.
This is easily done in BizTalk but as mentioned there will be thousands of integrations and if we need to change something in one of the steps it will be a lot of work.
My vision is something that holds interfaces to all APIs that we communicate with and also contains the scheduled jobs we want to be run between them. Preferrably with logging/tracking.
There must be something out there that does this?
Suggestions?
NOTE: No cloud-based solutions since they are not allowed in our organization.
You can easily implement this using temporal.io open source project. You can code your integrations using a general-purpose programming language. Temporal ensures that the integration runs to completion in the presence of all sorts of intermittent failures. Scheduling is also supported out of the box.
Disclaimer: I'm a founder of the Temporal project.
We are developing a Web API using .Net Core. To perform background tasks we have used Hosted Services.
System has been hosted in AWS Beantalk Environment with the Load Balancer. So based on the load Beanstalk creates/remove new instances of the system.
Our problem is,
Since background services also runs inside the API, When load balancer increases the instances, number of background services also get increased and there is a possibility to execute same task multiple times. Ideally there should be only one instance of background services.
One way to tackle this is to stop executing background services when in a load balanced environment and have a dedicated non-load balanced single instance environment for background services only.
That is a bit ugly solution. So,
1) Is there a better solution for this?
2) Is there a way to identify the primary instance while in a load balanced environment? If so I can conditionally register Hosted services.
Any help is really appreciated.
Thanks
I am facing the same scenario and thinking of a way to implement a custom service architecture that can run normally on all of the instance but to take advantage of pub/sub broker and distributed memory service so those small services will contact each other and coordinate what's to be done. It's complicated to develop yes but a very robust solution IMO.
You'll "have to" use a distributed "lock" system. You'll have to use, for example, a distributed memory cache who put a lock when someone (a node of your cluster) is working on background. If another node is trying to do the same job, he'll be locked by the first lock if the work isn't done yet.
What i mean, if all your nodes doesn't have a "sync handler" you can't handle this kind of situation. It could be SQL app lock, distributed memory cache or other things ..
There is something called Mutex but even that won't control this in multi-instance environment. However, there are ways to control it to some level (may be even 100%). One way would be to keep a tracker in the database. e.g. if the job has to run daily, before starting your job in the background service you might wanna query the database if there is any entry for today, if not then you will insert an entry and start your job.
I'm trying to give useful information but I am far from being a data engineer.
I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month.
I don't really know Beam or Airflow, I quickly read through the docs and it seems that both can achieve that. Which one should I use ?
The other answers are quite technical and hard to understand. I was in your position before so I'll explain in simple terms.
Airflow can do anything. It has BashOperator and PythonOperator which means it can run any bash script or any Python script.
It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.
Also, it is easy to setup and everything is in familiar Python code.
Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron) scripts all over the place.
Nowadays (roughly year 2020 onwards), we call it an orchestration tool.
Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.
The intent is so you just learn Beam and can run on multiple backends (Beam runners).
If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.
Google Cloud Platform's Cloud Dataflow is one backend for running Beam on.
They call it the Dataflow runner.
GCP's offering, Cloud Composer, is a managed Airflow implementation as a service, running in a Kubernetes cluster in Google Kubernetes Engine (GKE).
So you can either:
manual Airflow implementation, doing data processing on the instance itself (if your data is small (or your instance is powerful enough), you can process data on the machine running Airflow. This is why many are confused if Airflow can process data or not)
manual Airflow implementation calling Beam jobs
Cloud Composer (managed Airflow as a service) calling jobs in Cloud Dataflow
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment itself, using Airflow's KubernetesPodOperator (KPO)
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment with Airflow's KPO, but this time in a better isolated fashion by creating a new node-pool and specifying that the KPO pods are to be run in the new node-pool
My personal experience:
Airflow is lightweight and not difficult to learn (easy to implement), you should use it for your data pipelines whenever possible.
Also, since many companies are looking for experience using Airflow, if you're looking to be a data engineer you should probably learn it
Also, managed Airflow (I've only used GCP's Composer so far) is much more convenient than running Airflow yourself, and managing the airflow webserver and scheduler processes.
Apache Airflow and Apache Beam look quite similar on the surface. Both of them allow you to organise a set of steps that process your data and both ensure the steps run in the right order and have their dependencies satisfied. Both allow you to visualise the steps and dependencies as a directed acyclic graph (DAG) in a GUI.
But when you dig a bit deeper there are big differences in what they do and the programming models they support.
Airflow is a task management system. The nodes of the DAG are tasks and Airflow makes sure to run them in the proper order, making sure one task only starts once its dependency tasks have finished. Dependent tasks don't run at the same time but only one after another. Independent tasks can run concurrently.
Beam is a dataflow engine. The nodes of the DAG form a (possibly branching) pipeline. All the nodes in the DAG are active at the same time, and they pass data elements from one to the next, each doing some processing on it.
The two have some overlapping use cases but there are a lot of things only one of the two can do well.
Airflow manages tasks, which depend on one another. While this dependency can consist of one task passing data to the next one, that is not a requirement. In fact Airflow doesn't even care what the tasks do, it just needs to start them and see if they finished or failed. If tasks need to pass data to one another you need to co-ordinate that yourself, telling each task where to read and write its data, e.g. a local file path or a web service somewhere. Tasks can consist of Python code but they can also be any external program or a web service call.
In Beam, your step definitions are tightly integrated with the engine. You define the steps in a supported programming language and they run inside a Beam process. Handling the computation in an external process would be difficult if possible at all*, and is certainly not the way Beam is supposed to be used. Your steps only need to worry about the computation they're performing, not about storing or transferring the data. Transferring the data between different steps is handled entirely by the framework.
In Airflow, if your tasks process data, a single task invocation typically does some transformation on the entire dataset. In Beam, the data processing is part of the core interfaces so it can't really do anything else. An invocation of a Beam step typically handles a single or a few data elements and not the full dataset. Because of this Beam also supports unbounded length datasets, which is not something Airflow can natively cope with.
Another difference is that Airflow is a framework by itself, but Beam is actually an abstraction layer. Beam pipelines can run on Apache Spark, Apache Flink, Google Cloud Dataflow and others. All of these support a more or less similar programming model. Google has also cloudified Airflow into a service as Google Cloud Compose by the way.
*Apache Spark's support for Python is actually implemented by running a full Python interpreter in a subprocess, but this is implemented at the framework level.
Apache Airflow is not a data processing engine.
Airflow is a platform to programmatically author, schedule, and
monitor workflows.
Cloud Dataflow is a fully-managed service on Google Cloud that can be used for data processing. You can write your Dataflow code and then use Airflow to schedule and monitor Dataflow job. Airflow also allows you to retry your job if it fails (number of retries is configurable). You can also configure in Airflow if you want to send alerts on Slack or email, if your Dataflow pipeline fails.
I am doing the same as you with airflow, and I've got very good results. I am not very sure about the following: Beam is machine learning focused and airflow is for anything you want.
Finally you can create a hive with kubernetes +airflow.
Does GCP have a job scheduling service like Azure Scheduler, where jobs can be scheduled and managed dynamically via API?
Google Cron service is set in a static file and it seems like their answer to this is to use that to poke a roll your own service backed with PubSub and a data store. Looking for Quartz-like functionality, consumable by APP engine, which can be managed and invoked via API as opposed to managing a cluster, queue, and compute instance/VM deployment of Quartz (or the like) or rolling a custom solution. Should support 50 million simultaneous jobs per day with retry / recoverability and dynamic scheduling per tenant capabilities.
This is the cheapest and easiest way I can imagine building a solution today on top of an existing AppEngine based project:
As you observed, currently there is no such API/service directly available on GCP. There is an open feature request (on GAE) for it.
But, also as you observed, it is possible to build and use a custom solution, just like the one you proposed.
Depending on the context even simpler solutions are possible. For a GAE context check out, for example, How to schedule repeated jobs or tasks from user parameters in Google App Engine?.
I've successfully created a query with the Extractor tool found in Import.io. It does exactly what I want it to do, however I need to now run this once or twice a day. Is the purpose of Import.io as an API to allow me to build logic such as data storage and schedules tasks (running queries multiple times a day) with my own application or are there ways to scheduled queries and make use of long-term storage of my results completely within the Import.io service?
I'm happy to create a Laravel or Rails app to make requests to the API and store the information elsewhere but if I'm reinventing the wheel by doing so and they provides the means to address this then that is a true time saver.
Thanks for using the new forum! Yes, we have moved this over to Stack Overflow to maximise the community atmosphere.
At the moment, Import does not have the ability to schedule crawls. However, this is something we are going to roll out in the near future.
For the moment, there is the ability to set a Cron job to run when you specify.
Another solution if you are using the free version is to use a CI tool like travis or jenkins to schedule your API scripts.
You can query live the extractors so you don't need to make them run manually every time. This will consume one of your requests from your limit.
The endpoint you can use is:
https://extraction.import.io/query/extractor/extractor_id?_apikey=apikey&url=url
Unfortunately the script will not be a very simple one since most websites have very different respond structures towards import.io and as you may already know, the premium version of the tool provides now with scheduling capabilities.