How to catch dataflow serialization errors in direct pipeline tests - serialization

I am working on implementing a set of DoFn and transforms in dataflow for some of which I need to use 3rd party libraries which are not serializable.
I found a workaround to overcome the serialization issue for these with making the variable transient and initializing it lazily.
However, since my tests are running in direct pipeline (locally), I can only catch these cases if I run my pipeline on cloud.
Is there any recommended way to test dataflow pipelines locally in order to catch these kinds of exceptions/errors?

Related

How to test my pipeline before changing it?

When changing the pipelines for my company I often see the pipeline breaking under some specific condition that we did not anticipate. We use yaml files to describe the pipelines (Azure Devops)
We have multiple scenarios, such as:
Pipelines are run by automatic triggers, by other pipelines and manually
Pipelines share the same templates
There are IF conditions for some jobs/steps based on parameters (user input)
In the end, I keep thinking of testing all scenarios before merging changes, we could create scripts to do that. But it's unfeasible to actually RUN all scenarios because it would take forever, so I wonder how to test it without running it. Is it possible? Do you have any ideas?
Thanks!
I already tried the Preview endpoints from Azure REST api, which is good, but it only validates the input, such as variables and parameters. We also needed to make sure which steps are running and the variables being set in those
As far as I know (I am still new to our ADO solutions), we have to fully run/schedule the pipeline to see that it runs and then wait a day or so for the scheduler to complete the auto executions. At this point I do have some failing pipelines for a couple of days that I need o fix.
I do get emails when a pipeline fails like this in the json that holds the metadata to create a job:
"settings": {
"name": "pipelineName",
"email_notifications": {
"on_failure": [
"myEmail#email.com"
],
"no_alert_for_skipped_runs": true
},
Theres an equivalent extension that can be added in this link, but I have not done it this way and cannot verify if it works.
Azure Pipelines: Notification on Job failure
https://marketplace.visualstudio.com/items?itemName=rvo.SendEmailTask
I am not sure what actions your pipeline does but if there are jobs being scheduled there on external computes like Databricks, there should be a email alert system you can use to detect failures.
Other than that if you had multiple environments (dev, qa, prod) you could test in non production environment.
Or if you have a dedicated storage location that is only for testing a pipeline, use that for the first few days and then reschedule the pipeline in the real location after testing it completes a few runs.

spring batch API to dynamically create tasks and steps

Our application allows the users to dynamically assemble/construct Tasks and steps ( Remote calls to various services). Once the user has defined/constructed a Task with various Steps, the system will allow the Task to be run manually or scheduled.
To implement above functionality, what would it take to use spring batch for programmatically create a given Task and its constituent Steps ? I'm assuming that such tasks can be scheduled with the help of Quartz, etc.
I understand that spring XD and spring cloud watch incorporate similar features -- some pointers to the relevant XD codebase/examples(or any other project codebase) which would help in my task would help.
What considerations should I keep in mind ? Any gotchas ?
Currently not using any cloud platform but the next version will be deployed to a cloud platform.
Thanks very much.

Apache Airflow or Apache Beam for data processing and job scheduling

I'm trying to give useful information but I am far from being a data engineer.
I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month.
I don't really know Beam or Airflow, I quickly read through the docs and it seems that both can achieve that. Which one should I use ?
The other answers are quite technical and hard to understand. I was in your position before so I'll explain in simple terms.
Airflow can do anything. It has BashOperator and PythonOperator which means it can run any bash script or any Python script.
It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.
Also, it is easy to setup and everything is in familiar Python code.
Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron) scripts all over the place.
Nowadays (roughly year 2020 onwards), we call it an orchestration tool.
Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.
The intent is so you just learn Beam and can run on multiple backends (Beam runners).
If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.
Google Cloud Platform's Cloud Dataflow is one backend for running Beam on.
They call it the Dataflow runner.
GCP's offering, Cloud Composer, is a managed Airflow implementation as a service, running in a Kubernetes cluster in Google Kubernetes Engine (GKE).
So you can either:
manual Airflow implementation, doing data processing on the instance itself (if your data is small (or your instance is powerful enough), you can process data on the machine running Airflow. This is why many are confused if Airflow can process data or not)
manual Airflow implementation calling Beam jobs
Cloud Composer (managed Airflow as a service) calling jobs in Cloud Dataflow
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment itself, using Airflow's KubernetesPodOperator (KPO)
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment with Airflow's KPO, but this time in a better isolated fashion by creating a new node-pool and specifying that the KPO pods are to be run in the new node-pool
My personal experience:
Airflow is lightweight and not difficult to learn (easy to implement), you should use it for your data pipelines whenever possible.
Also, since many companies are looking for experience using Airflow, if you're looking to be a data engineer you should probably learn it
Also, managed Airflow (I've only used GCP's Composer so far) is much more convenient than running Airflow yourself, and managing the airflow webserver and scheduler processes.
Apache Airflow and Apache Beam look quite similar on the surface. Both of them allow you to organise a set of steps that process your data and both ensure the steps run in the right order and have their dependencies satisfied. Both allow you to visualise the steps and dependencies as a directed acyclic graph (DAG) in a GUI.
But when you dig a bit deeper there are big differences in what they do and the programming models they support.
Airflow is a task management system. The nodes of the DAG are tasks and Airflow makes sure to run them in the proper order, making sure one task only starts once its dependency tasks have finished. Dependent tasks don't run at the same time but only one after another. Independent tasks can run concurrently.
Beam is a dataflow engine. The nodes of the DAG form a (possibly branching) pipeline. All the nodes in the DAG are active at the same time, and they pass data elements from one to the next, each doing some processing on it.
The two have some overlapping use cases but there are a lot of things only one of the two can do well.
Airflow manages tasks, which depend on one another. While this dependency can consist of one task passing data to the next one, that is not a requirement. In fact Airflow doesn't even care what the tasks do, it just needs to start them and see if they finished or failed. If tasks need to pass data to one another you need to co-ordinate that yourself, telling each task where to read and write its data, e.g. a local file path or a web service somewhere. Tasks can consist of Python code but they can also be any external program or a web service call.
In Beam, your step definitions are tightly integrated with the engine. You define the steps in a supported programming language and they run inside a Beam process. Handling the computation in an external process would be difficult if possible at all*, and is certainly not the way Beam is supposed to be used. Your steps only need to worry about the computation they're performing, not about storing or transferring the data. Transferring the data between different steps is handled entirely by the framework.
In Airflow, if your tasks process data, a single task invocation typically does some transformation on the entire dataset. In Beam, the data processing is part of the core interfaces so it can't really do anything else. An invocation of a Beam step typically handles a single or a few data elements and not the full dataset. Because of this Beam also supports unbounded length datasets, which is not something Airflow can natively cope with.
Another difference is that Airflow is a framework by itself, but Beam is actually an abstraction layer. Beam pipelines can run on Apache Spark, Apache Flink, Google Cloud Dataflow and others. All of these support a more or less similar programming model. Google has also cloudified Airflow into a service as Google Cloud Compose by the way.
*Apache Spark's support for Python is actually implemented by running a full Python interpreter in a subprocess, but this is implemented at the framework level.
Apache Airflow is not a data processing engine.
Airflow is a platform to programmatically author, schedule, and
monitor workflows.
Cloud Dataflow is a fully-managed service on Google Cloud that can be used for data processing. You can write your Dataflow code and then use Airflow to schedule and monitor Dataflow job. Airflow also allows you to retry your job if it fails (number of retries is configurable). You can also configure in Airflow if you want to send alerts on Slack or email, if your Dataflow pipeline fails.
I am doing the same as you with airflow, and I've got very good results. I am not very sure about the following: Beam is machine learning focused and airflow is for anything you want.
Finally you can create a hive with kubernetes +airflow.

Calling API from PigLatin

Complete newbie to PigLatin, but looking to pull data from the MetOffice DataPoint API e.g.:
http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/xml/350509?res=3hourly&key=abc123....
...into Hadoop.
My question is "Can this be undertaken using PigLatin (from within Pig View, in Ambari)"?
I've hunted round for how to format a GET request into the code, but without luck.
Am I barking up the wrong tree? Should I be looking to use a different service within the Hadoop framework to accomplish this?
It is very bad idea to make calls to external services from inside of map-reduce jobs. The reason being that when running on the cluster your jobs are very scalable whereas the external system might not be so. Modern resource managers like YARN make this situation even worse, when you swamp external system with the requests your tasks on the cluster will be mostly sleeping waiting for reply from the server. The resource manager will see that CPU is not being used by tasks and will schedule more of your tasks to run which will make even more requests to the external system, swamping it with the requests even more. I've seen modest 100 machine cluster putting out 100K requests per second.
What you really want to do is to either somehow get the bulk data from the web service or setup a system with a queue and few controlled number of workers that will pull from the external system at set rate.
As for your original question, I don't think PigLatin provides such service, but it could be easily done with UDFs either Python or Java. With Python you can use excellent requests library, which will make your UDF be about 6 lines of code. Java UDF will be little bit more verbose, but nothing terrible by Java standards.
"Can this be undertaken using PigLatin (from within Pig View, in
Ambari)"?
No, by default Pig load from HDFS storage, unless you write your own loader.
And i share same point with #Vlad, that this is not a good idea, you have many other other components used for data ingestion, but this not a use case of Pig !

SQLite/Fluent NHibernate integration test harness initialization not repeatable after large data session

In one of my main data integration test harnesses I create and use Fluent NHibernate's SingleConnectionSessionSourceForSQLiteInMemoryTesting, to get a fresh session for each test. After each test, I close the connection, session, and session factory, and throw out the nested StructureMap container they came from. This works for almost any simple data integration test I can think of, including ones that utilize Fluent NHib's PersistenceSpecification object.
The trouble starts when I test the application's lengthy database bootstrapping process, which creates and saves thousands of domain objects. It's not that the first setup and tear-down of the test harness fails, in fact, the test harness successfully bootstraps the in-memory database as the application would bootstrap the real database in the production environment. The problem occurs when the database is bootstrapped a second time on a new in-memory database, with a new session and session factory.
The error is:
NHibernate.StaleStateException : Unexpected row count: 0; expected: 1
The row count is indeed Unexpected, the row that the application under test is looking for should be in the session. You see, it's not that any data from the last integration test is sticking around, it's that for some reason the session just stops working mid-database-boostrap. And I've looked everywhere for a place I might be holding on to an old session and I can't find one.
I've searched through the code for static singleton objects, but there are none anywhere near the code in question. I have a couple StructureMap InstanceScope singleton's but they are getting thrown out with each nested container that is lost after every test teardown.
I've tried every possible variation on disposing and closing every object involved with each test teardown and it still fails on this lengthy database bootstrap. I've even messed around with current_session_context_class to no avail. But non-bootstrap related database tests appear to work fine. I'm starting to run out of options and I may have to surrender lengthy database integration tests in favor of WatiN-based acceptance tests.
Can anyone give me any clue about how I can figure out why some of my SingleConnectionSessionSourceForSQLiteInMemoryTesting aren't repeatable?
Any advice at all, about how to make an NHibernate SqlLite database integration test harness repeatable for large data sessions?
Here is how we do it http://www.gears4.net/blog/archive/14/nhibernate-integration-testing
Hope it helps
I was able to solve this problem by not using an in memory database, and instead saving a hard copy file after initialization, once per test suite run. Then instead of reinitializing the database after each test, the file-based SqlLite database is copied and that new copy is used for testing.
Essentially, I only setup the initial database data once and save that database off to the side and it is copied for each test. There is a definite possibility that the problem could be on my end, but I suspect that there is an issue with large in-memory SqlLite databases. So I recommend using the file-mode of the database, if you are running into trouble with a large in-memory sqllite database.