How are Apache Pig UDFs distributed to data nodes? - apache-pig

There are plenty of documentation about how to write Pig UDFs in the various languages but I haven't found anything on how they are distributed to the data nodes.
Are they done automatically when pig script is invoked? If it makes any difference, I'd be writing UDF in Java.

Let me make it more clear. Whenever we wite a UDF and the pig is in hdfs mode. Then UDFs, which initially resides in the local or the client side, is carried to the cluster as per the internal architecture of hadoop. Now the UDFs task is performed by the task tracker and it becomes the duty of the job tracker to assign the the UDFs to task tracker, which is near to the data node where the input file resides.
Note: Its always the job tracker(component of name node), which actually decides which task tracker should perform the execution of the UDFs.
If the input file is in local file system(local mode), then the UFDs get executed in the local JVM.

The fact is apache pig works in two modes
1) local mode
2) hdfs mode
To answer you question, which belongs to pig running in hdfs mode, we only made sure that the input file that we are loading is present in the hdfs(data node). When the question comes for UDF, this is simply a function that is used to process the input file, just link pig latin language. We are writing UDFs, pig latin via the client side node and thus all the data related to this will be stored in the client side machine.
Above all, we have configure the pig so that client can interact with the hdfs to process the required result.
Hope this helps

Related

Read S3 file based on the path that comes in Kafka - Apache Flink

I have a pipeline that listens to a Kafka topic that receives the s3 file-name & path. The pipeline has to read the file from S3 and do some transformation & aggregation.
I see the Flink has support to read the S3 file directly as source connector, but this use case is to read as part of the transformation stage.
I don't believe this is currently possible.
An alternative might be to keep a Flink session cluster running, and dynamically create and submit a new Flink SQL job running in batch mode to handle the ingestion of each file.
Another approach you might be tempted by would be to implement a RichFlatMapFunction that accepts the path as input, reads the file, and emits its records one by one. But this is likely to not work very well unless the files are rather small because Flink really doesn't like to have user functions that run for long periods of time.

Apache Airflow or Apache Beam for data processing and job scheduling

I'm trying to give useful information but I am far from being a data engineer.
I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month.
I don't really know Beam or Airflow, I quickly read through the docs and it seems that both can achieve that. Which one should I use ?
The other answers are quite technical and hard to understand. I was in your position before so I'll explain in simple terms.
Airflow can do anything. It has BashOperator and PythonOperator which means it can run any bash script or any Python script.
It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.
Also, it is easy to setup and everything is in familiar Python code.
Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron) scripts all over the place.
Nowadays (roughly year 2020 onwards), we call it an orchestration tool.
Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.
The intent is so you just learn Beam and can run on multiple backends (Beam runners).
If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.
Google Cloud Platform's Cloud Dataflow is one backend for running Beam on.
They call it the Dataflow runner.
GCP's offering, Cloud Composer, is a managed Airflow implementation as a service, running in a Kubernetes cluster in Google Kubernetes Engine (GKE).
So you can either:
manual Airflow implementation, doing data processing on the instance itself (if your data is small (or your instance is powerful enough), you can process data on the machine running Airflow. This is why many are confused if Airflow can process data or not)
manual Airflow implementation calling Beam jobs
Cloud Composer (managed Airflow as a service) calling jobs in Cloud Dataflow
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment itself, using Airflow's KubernetesPodOperator (KPO)
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment with Airflow's KPO, but this time in a better isolated fashion by creating a new node-pool and specifying that the KPO pods are to be run in the new node-pool
My personal experience:
Airflow is lightweight and not difficult to learn (easy to implement), you should use it for your data pipelines whenever possible.
Also, since many companies are looking for experience using Airflow, if you're looking to be a data engineer you should probably learn it
Also, managed Airflow (I've only used GCP's Composer so far) is much more convenient than running Airflow yourself, and managing the airflow webserver and scheduler processes.
Apache Airflow and Apache Beam look quite similar on the surface. Both of them allow you to organise a set of steps that process your data and both ensure the steps run in the right order and have their dependencies satisfied. Both allow you to visualise the steps and dependencies as a directed acyclic graph (DAG) in a GUI.
But when you dig a bit deeper there are big differences in what they do and the programming models they support.
Airflow is a task management system. The nodes of the DAG are tasks and Airflow makes sure to run them in the proper order, making sure one task only starts once its dependency tasks have finished. Dependent tasks don't run at the same time but only one after another. Independent tasks can run concurrently.
Beam is a dataflow engine. The nodes of the DAG form a (possibly branching) pipeline. All the nodes in the DAG are active at the same time, and they pass data elements from one to the next, each doing some processing on it.
The two have some overlapping use cases but there are a lot of things only one of the two can do well.
Airflow manages tasks, which depend on one another. While this dependency can consist of one task passing data to the next one, that is not a requirement. In fact Airflow doesn't even care what the tasks do, it just needs to start them and see if they finished or failed. If tasks need to pass data to one another you need to co-ordinate that yourself, telling each task where to read and write its data, e.g. a local file path or a web service somewhere. Tasks can consist of Python code but they can also be any external program or a web service call.
In Beam, your step definitions are tightly integrated with the engine. You define the steps in a supported programming language and they run inside a Beam process. Handling the computation in an external process would be difficult if possible at all*, and is certainly not the way Beam is supposed to be used. Your steps only need to worry about the computation they're performing, not about storing or transferring the data. Transferring the data between different steps is handled entirely by the framework.
In Airflow, if your tasks process data, a single task invocation typically does some transformation on the entire dataset. In Beam, the data processing is part of the core interfaces so it can't really do anything else. An invocation of a Beam step typically handles a single or a few data elements and not the full dataset. Because of this Beam also supports unbounded length datasets, which is not something Airflow can natively cope with.
Another difference is that Airflow is a framework by itself, but Beam is actually an abstraction layer. Beam pipelines can run on Apache Spark, Apache Flink, Google Cloud Dataflow and others. All of these support a more or less similar programming model. Google has also cloudified Airflow into a service as Google Cloud Compose by the way.
*Apache Spark's support for Python is actually implemented by running a full Python interpreter in a subprocess, but this is implemented at the framework level.
Apache Airflow is not a data processing engine.
Airflow is a platform to programmatically author, schedule, and
monitor workflows.
Cloud Dataflow is a fully-managed service on Google Cloud that can be used for data processing. You can write your Dataflow code and then use Airflow to schedule and monitor Dataflow job. Airflow also allows you to retry your job if it fails (number of retries is configurable). You can also configure in Airflow if you want to send alerts on Slack or email, if your Dataflow pipeline fails.
I am doing the same as you with airflow, and I've got very good results. I am not very sure about the following: Beam is machine learning focused and airflow is for anything you want.
Finally you can create a hive with kubernetes +airflow.

File structure of Apache Beam DynamicDestinations write to BigQuery

I am using DynamicDestinations (from BigQueryIO) to export data from one Cassandra table to multiple Google BigQuery tables. The process consists of several steps including writing prepared data to Google Cloud Storage (as files in JSON format) and then loading the files to BQ via load jobs.
The problem is that export process has ended with out of memory error at the last step (loading files from Google Storage to BQ). But there are prepared files with all of the data in GCS remaining. There are 3 directories in BigQueryWriteTemp location:
And there a lot of files with not obvious names:
The question is what is the storage structure of the files? How can I match the files with tables (table names) they prepared for? How can I use the files to continue export process from load jobs step? Can I use some piece of Beam code for that?
These files, if you're using Beam 2.3.0 or earlier, contain JSON data to be imported into BigQuery using its load job API. However:
This is an implementation detail that you can not rely on, in general. It is very likely to change in future versions of Beam (JSON is horribly inefficient).
It is not possible to match these files with the tables they are intended for - that was stored in the internal state of the pipeline that has failed.
There is also no way to know how much data was written to these files and how much wasn't. The files may contain only partial data: maybe your pipeline failed before creating some of the files, or after some of them were already loaded into BigQuery and deleted.
Basically, you'll need to rerun the pipeline and fix the OOM issue so that it succeeds.
For debugging OOM issues, I suggest using a heap dump. Dataflow can write heap dumps to GCS using --dumpHeapOnOOM --saveHeapDumpsToGcsPath=gs://my_bucket/. You can examine these dumps using any Java memory profiler, such as Eclipse MAT or YourKit. You can also post your code as a separate SO question and ask for advice reducing its memory usage.

Calling API from PigLatin

Complete newbie to PigLatin, but looking to pull data from the MetOffice DataPoint API e.g.:
http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/xml/350509?res=3hourly&key=abc123....
...into Hadoop.
My question is "Can this be undertaken using PigLatin (from within Pig View, in Ambari)"?
I've hunted round for how to format a GET request into the code, but without luck.
Am I barking up the wrong tree? Should I be looking to use a different service within the Hadoop framework to accomplish this?
It is very bad idea to make calls to external services from inside of map-reduce jobs. The reason being that when running on the cluster your jobs are very scalable whereas the external system might not be so. Modern resource managers like YARN make this situation even worse, when you swamp external system with the requests your tasks on the cluster will be mostly sleeping waiting for reply from the server. The resource manager will see that CPU is not being used by tasks and will schedule more of your tasks to run which will make even more requests to the external system, swamping it with the requests even more. I've seen modest 100 machine cluster putting out 100K requests per second.
What you really want to do is to either somehow get the bulk data from the web service or setup a system with a queue and few controlled number of workers that will pull from the external system at set rate.
As for your original question, I don't think PigLatin provides such service, but it could be easily done with UDFs either Python or Java. With Python you can use excellent requests library, which will make your UDF be about 6 lines of code. Java UDF will be little bit more verbose, but nothing terrible by Java standards.
"Can this be undertaken using PigLatin (from within Pig View, in
Ambari)"?
No, by default Pig load from HDFS storage, unless you write your own loader.
And i share same point with #Vlad, that this is not a good idea, you have many other other components used for data ingestion, but this not a use case of Pig !

Running multiple Kettle transformation on single JVM

We want to use pan.sh to execute multiple kettle transformations. After exploring the script I found that it internally calls spoon.sh script which runs in PDI. Now the problem is every time a new transformation starts it create a separate JVM for its executions(invoked via a .bat file), however I want to group them to use single JVM to overcome memory constraints that the multiple JVM are putting on the batch server.
Could somebody guide me on how can I achieve this or share the documentation/resources with me.
Thanks for the good work.
Use Carte. This is exactly what this is for. You can startup a server (on the local box if you like) and then submit your jobs to it. One JVM, one heap, shared resource.
Benefit of that is then scalability, so when your box becomes too busy just add another one, also using carte and start sending some of the jobs to that other server.
There's an old but still current blog here:
http://diethardsteiner.blogspot.co.uk/2011/01/pentaho-data-integration-remote.html
As well as doco on the pentaho website.
Starting the server is as simple as:
carte.sh <hostname> <port>
There is also a status page, which you can use to query your carte servers, so if you have a cluster of servers, you can pick a quiet one to send your job to.