Which one to choose Apache Oozie or Apache Airflow? Need a comparison - jobs

I am new to job schedulers and was looking out for one to run jobs on big data cluster. I was quite confused with the available choices. Found Oozie to have many limitations as compared to the already existing ones such as TWS, Autosys, etc.
Need some comparison points on Oozie vs. Airflow.
Appreciate your help.

In my experience Airflow is the best data pipeline right now. It's best suited for managing complex, long running workflows. UI and modularity are over the top.
Airflow
+ Python Code for DAGs
+ Has connectors for every major service/cloud provider
+ More versatile
+ Advanced metrics
+ Better UI and API
+ Capable of creating extremely complex workflows
+ Jinja Templating
+ Can be used as an Orchestrator for the Tensorflow Extended ecosystem
= Can be parallelized
= Native Connections to HDFS, HIVE, PIG etc..
= Graph as DAG
Oozie
--- Java or XML for DAGs
- hard to build complex pipelines
- smaller, less active community
- worse WEB GUI
- Java API
= Can be parallelized
= Native Connections to HDFS, HIVE, PIG etc..
= Graph as DAG
As you see, Airflow is an easier to use (especially in large heteregenoeus team), more versatile and powerful option than Oozie.
As I said: go with Airflow.
Article you may find interesting

Related

Kubernetes + TF serving - how to use hundred of ML models without running hundred of idle pods up and running?

I have hundreds of models, based on categories, projects,s, etc. Some of the models are heavily used while other models are not used very frequently.
How can I trigger a scale-up operation only in case needed (For the models that are not frequently used), instead of running hundreds of pods serving hundreds of models while most of them are not being used - which is a huge waste of computing resources.
What you are trying to do is to scale deployment to zero when these are not used.
K8s does not provide such functionality out of the box.
You can achieve it using Knative Pod Autoscaler.
Knative is probably the most mature solution available at the moment of writing this answer.
There are also some more experimental solutions like osiris or zero-pod-autoscaler you may find interesting and that may be a good fit for your usecase.

Developing and deploy jobs in Apache Flink

we started to develop some jobs with Flink. Our current dev / deployment process looked like this:
1. develop code in local IDE and compile
2. upload jar-file to server (via UI)
3. register new job
However it turns out that the generate jar-file is ~70MB and upload process takes a few minutes. What is the best way to speed up development (e.g. using a on-server ide?)
One solution is to use a version control system and after committing and pushing your changes, you might build the jar on the server itself. You could write a simple script for this.
The other solution, which would take time and effort is to set up a CI CD Pipeline which would automate the entire process and much manual effort would be minimised.
Also, try not to use fat jar to minimise the jar size if you have to scp your jar to the cluster.
First off, uploading 70 mb shouldn't take minutes nowadays. You might want to check your network configurations. Of course, if your internet connection is not good, you can't help it.
In general, I'd try to avoid cluster test runs as much as possible. It's slow and hard to debug. It should only ever be used for performance tests or right before releasing into production.
All logic should be unit tested. The complete job should be integration tested and ideally you'd also have an end-to-end test. I recommend to use a docker-based approach for external systems and use things like test containers for Kafka, such that you are able to run all tests from your IDE.
Going onto the test cluster should then be a rare thing. If you find any issue that has not been covered by your automatic tests, you need to add a new test case and solve it locally, such that there is a high probability that it will be solved on the test cluster.
edit (addressing comment):
If it's much easier for you to write a Flink job to generate data, then it sounds like a viable option. I'm just fearing that you would also need to test that job...
It rather sounds like you want to have an end-2-end setup where you run several Flink jobs in succession and compare the final results. That's a common setup for complex pipelines consisting of several Flink jobs. However, it's rather hard to maintain and may have unclear ownership status if it involves components from several teams. I like to rather solve it by having tons of metrics (how many records are produced, how many invalid records are filtered in each pipeline...) and having specific validation jobs that just assess the quality of (intermediate) output (potentially involving humans).
So from my experience, either you can test the logic in some local job setup or it's so big that you spend much more time in setting and maintaining the test setup than actually writing production code. In the latter case, I'd rather trust and invest in the monitoring and quality assurance capabilities of (pre-)production that you need to have anyways.
If you really just want to test one Flink job with another, you can also run the Flink job in testcontainers, so technically it's not an alternative but an addition.

BigQuery replaced most of my Spark jobs, am I missing something?

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.
The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :
easier to reason about (less verbose)
easier to maintain (SQL vs scala/python code)
you can run it easily on the GUI if needed
fast without having to really reason about partitioning, caching and so on...
In the end, I only use Spark when I've got something to do that I can't express using SQL.
To be clear, my workflow is often like :
preprocessing (previously in Spark, now in SQL)
feature engineering (previously in Spark, now mainly in SQL)
machine learning model and predictions (Spark ML)
Am I missing something ?
Is there any con in using BigQuery this way instead of Spark ?
Thanks
A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.
If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:
BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.
Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.
I'm using PySpark (on GCP Dataproc), BigQuery and we have jobs in both. I will summarize my vision about Pros and Cons of one system against the other. And I do admit that your environment could be different, so that something which I think is Pros might not be like this for you.
Pros of Spark:
better testing of the code, simpler to build unit tests and run them with mocked data and classes, rather in trying to do this with BigQuery
it's possible to use SQL (SparkSQL) for operations and even combine operations over different data sources (DB, files, BQ)
we have JSON files in the format which is not valid for BigQuery, and it cannot parse them (while files have valid JSON format)
possible to implement naturally more complicated logic for some cases, for example, traversing arrays in nested fields and other complicated calculations
better custom monitoring is possible, when we need to check specific metrics in the pipeline we can send related metrics (StatsD, etc.) easier
more natural for CI/CD processes
Pros of BigQuery (all with a note: if all data is available):
simplicity of SQL, when all data is available in a convenient format
DBAs who are not familiar with Python/Scala still could contribute (bcs they know SQL)
awesome infrastructure behind the scene, very performant
With both approaches it's possible to check quickly the result in GUI. For example, Jupyter Notebook allows to run PySpark instantly. I cannot add my notes about ML related traits, though.

Optimal data streaming and processing solution for enormous datasets into tf.data.Dataset

Context:
My text input pipeline currently consists of two main parts:
I. A complex text preprocessing and exporting of tf.SequenceExamples to tfrecords (custom tokenization, vocabulary creation, statistics calculation, normalization and many more over the full dataset as well as per each individual example). That is done once for each data configuration.
II. A tf.Dataset (TFRecords) pipeline that does quite a bit of processing during training, too (string_split into characters, table lookups, bucketing, conditional filtering, etc.).
Original Dataset is present across multiple locations (BigQuery, GCS, RDS, ...).
Problem:
The problem is that as the production dataset increases rapidly (several terabytes), it is not feasible to recreate a tfrecords files for each possible data configuration (part 1 has a lot of hyperparameters) as each will have an enormous size of hundreds of terabytes. Not to mention, that tf.Dataset reading speed surprisingly slows down when tf.SequenceExamples or tfrecords grow in size.
There are quite a few possible solutions:
Apache Beam + Cloud DataFlow + feed_dict;
tf.Transform;
Apache Beam + Cloud DataFlow + tf.Dataset.from_generator;
tensorflow/ecosystem + Hadoop or Spark
tf.contrib.cloud.BigQueryReader
, but neither of the following seem to fully fulfill my requierments:
Streaming and processing on the fly data from BigQuery, GCS, RDS, ... as in part I.
Sending data (protos?) directly to tf.Dataset in one way or another to be used in part II.
Fast and reliable for both training and inference.
(optional) Being able to pre-calculate some full pass statistics over the selected part of the data.
EDIT: Python 3 support would be just wonderful.
What is the most suitable choice for the tf.data.Dataset pipeline? What are the best practices in this case?
Thanks in advance!
I recommend to orchestrate the whole use case using Cloud Composer(GCP integration of Airflow).
Airflow provided operators which let you orchestrate a pipeline with a script.
In your case you can use the dataflow_operator to have the dataflow job spin up when you have enough data to process.
To get the data from BigQuery you can use the bigquery_operator.
Furthermore you can use the python operator or the bash operator to monitor and pre-calculate statistics.

Setting up environment for Hadoop datawarehousing (Hive)

I am new to Hadoop and trying to learn it on datawarehousing and analytical front.
Can someone advise me on how to set up my practice machines, especially with regards to
1.Number of machines/nodes required to start learning
2.Is it advisable to set up on Windows?
3.What software needs to be installed
4.Availability of test/sample data
Also I would like to get advice on the best way to perform BI actions with Hive.
Thank you.
I would suggest to download cloudera VM if you more interested in hadoop machinery. Another way to jump start immidiately - to use amazon EMR (elastic mapreduce). There is an option to create interactive hive cluster there and start playing with datasets stored in S3.
Regarding number of nodes - it depends on goals. If you interested to "feel" some hadoop performance - try at least 4-6 nodes.
Both ways listed above are good if you do not have access to organization's internal hadoop / hive cluster. And even in this case - I would suggest to try with them to gain some hands-on before using shared environment.