Using Oozie to schedule a Hive job that contains a Python UDF - hive

Is it possible to schedule an Oozie job with a Hive action that itself contains a user defined function written in Python? If so, then how does one supply the Python function to Oozie?
Many thanks

Related

Saving output from parsing json file and passing it to Bigqueryinsertjoboperator

I need some advise on solving this requirement for auditing purpose . I am using airflow composer - dataflow java operator job which spits out json file after job completion with status and error message details (into airflow data folder ) . I want to extract the status and error message from json file via some operator and then pass the variable to next pipeline job Bigqueryinsertjoboperator which calls the stored proc and passes status and error message as input parameter and finally gets written into BQ dataset table.
Thanks
You need to do XCom and JINJA templating. When you return meta-data from the operator, the data is stored in XCom and you can retrieve it using JINJA templating or Python code in Python operator (or Python code in your custom operator).
Those are two very good articles from Marc Lamberti (who also has really nice courses on Airlfow) describing how templating and jinja can be leveraged in Airflow https://marclamberti.com/blog/templates-macros-apache-airflow/ and this one describes XCom: https://marclamberti.com/blog/airflow-xcom/
By combining the two you can get what you want.

How to monitor Databricks jobs using CLI or Databricks API to get the information about all jobs

I want to monitor the status of the jobs to see whether the jobs are running overtime or it failed. if you have the script or any reference then please help me with this. thanks
You can use the databricks runs list command to list all the jobs ran. This will list all jobs and their current status RUNNING/FAILED/SUCCESS/TERMINATED.
If you wanted to see if a job is running over you would then have to use databricks runs get --run-id command to list the metadata from the run. This will return a json which you can parse out the start_time and end_time.
# Lists job runs.
databricks runs list
# Gets the metadata about a run in json form
databricks runs get --run-id 1234
Hope this helps get you on track!

Oozie solution to execute a query and get results back from sql & Hive

I am trying to solve the below problem using oozie. Any suggestions about solution are much appreciated.
Back ground : I had developed a code to import data from SQL database using (oozie - Sqoop import) and done some transformation and loaded the data to Hive. Now I need to do a count check between SQL and Hive for reconciliation
Is there any way I can do that using oozie.
I am thinking about executing sql query using "sqoop eval" and hive query using "hive action" from oozie , but I am wondering how can we get the results back to oozie / capture the results after the query execution .
Once the results are available I need to do a reconciliation in subsequent action
I had implemented it using a py-spark action , by executing sqoop eval and Hive Dataframe counts. Its working fine.

Read hive table in oozie mapreduce action

I have a workflow, in which an oozie mapreduce action is supposed to read data from hive table and give it to appropriate mapper. I have not been able to find corresponding settings/properties for workflow.
The data of the hive table is stored in a folder, why don't you read it from there?
create table t01 (line String) STORED AS TextFile LOCATION "pathToStoredFile";

Validate a Sqoop with use of QUERY and WHERE clauses

I am ope-rationalizing a data import process that takes data from an existing database and partitions it within a scheme of HDFS. By default, the job is split into four map processes, and right now I have the job configured to do this on a daily interval through Apache Oozie.
Since Oozie is DAG oriented, is there the capacity to create a validationStep within the Oozie workflow such that:
Run HIVE query on newly imported data to return count of rows
Run SQL query to return count of rows in original source of data
Compare the two values
If not match, return FAIL and KILL JOB, if match, return TRUE and OK
I understand there is a validate process within sqoop, but it is my understanding that since I am not running this against a single table that this is not applicable (each of my sqoop import is partitioned by a specific date).
Is this possible?