Oozie solution to execute a query and get results back from sql & Hive - sql

I am trying to solve the below problem using oozie. Any suggestions about solution are much appreciated.
Back ground : I had developed a code to import data from SQL database using (oozie - Sqoop import) and done some transformation and loaded the data to Hive. Now I need to do a count check between SQL and Hive for reconciliation
Is there any way I can do that using oozie.
I am thinking about executing sql query using "sqoop eval" and hive query using "hive action" from oozie , but I am wondering how can we get the results back to oozie / capture the results after the query execution .
Once the results are available I need to do a reconciliation in subsequent action

I had implemented it using a py-spark action , by executing sqoop eval and Hive Dataframe counts. Its working fine.

Related

Presto Trino - Execute from SQL file

Beginner. Using CLI/Presto/Trino. Not sure the right term. It looks to be command line and we are using Hive.
I can run select, create tables. I'm trying to run multiple queries at once. I created SQL file and uploaded to Hive folder structure. I think I can execute all of them at once instead of going one by one.
How do I initiate the process of running SQL query from file?
--execute file user/hivefile.sql > result getting nowhere

aws Glue: Is it possible to pull only specific data from a database?

I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.

weird issue with Hive 0.12 in BigInsights 3.0

I have this simple query which is fine in hive 0.8 in IBM BigInsights2.0:
SELECT * FROM patient WHERE hr > 50 LIMIT 5
However when I run this query using hive 0.12 in BigInsights3.0 it runs forever and returns no results.
Actually the scenario is the same for following query and many others:
INSERT OVERWRITE DIRECTORY '/Hospitals/dir' SELECT p.patient_id FROM
patient1 p WHERE p.readingdate='2014-07-17'
If I exclude the WHERE part then it would be all fine in both versions.
Any idea what might be wrong with hive 0.12 or BigInsights3.0 when including WHERE clause in the query?
When you use a WHERE clause in the Hive query, Hive will run a map-reduce job to return the results. That's why it usually takes longer to run the query because without the WHERE clause, Hive can simply return the content of the file that represents the table in HDFS.
You should check the status of the map-reduce job that is triggered by your query to find out if an error happened. You can do that by going to the Application Status tab in the BigInsights web console and clicking on Jobs, or by going to the job tracker web interface. If you see any failed tasks for that job, check the logs of the particular task to find out what error occurred. After fixing the problem, run the query again.

Hive query in oozie coordinator

I running 10 hive scripts using oozie coordinator, it is getting stuck in one of the script in reduce stage at same percentage without any error, the scripts are simple insert statements and I tested them on command line they just work fine, how do I debug this?
It was data skew issue, 80% of the data was mapped to single key. Once we updated to Hive 10, skew optimization join resolved the issue.

Pig step execution details

I am newbee to pig .
I have written a small script in pig , where in i first load the data from two different tables and further right outer join the two tables ,later also i have next join of tables for two different st of data .It works fine .But i want to see
the steps of execution , like in which step my data is loaded that way i can note the time
needed for loading later details of step for data joining like how much time it is
taking for these much records to be joined .
Basically i want to know which part of my pig script is taking longer time to run so
that way i can further optimize my pig script .
Anyway we could println within the script and find which steps got executed which has started to execute .
Through jobtracker details link i could not get much info , just could see mapper is running & reducer is running , but idealy mapper for which part of script is running could not find that.
For example for a hive job run we can see in the jobtracker details link which step is currently getting executed.
Any information will be really helpfull.
Thanks in advance .
I'd suggest you to have a look at the followings:
Pig's Progress Notification Listener
Penny : this is a monitoring tool but I'm afraid that it hasn't been updated in the recent past (e.g: it won't compile for Pig 0.12.0 unless you do some code changes)
Twitter's Ambrose project. https://github.com/twitter/ambrose
On the other, after executing the script you can see a detailed statistics about the execution time of each alias (see: Job Stats (time in seconds)).
Have a look at the EXPLAIN operator. This doesn't give you real-time stats as your code is executing, but it should give you enough information about the MapReduce plan your script generates that you'll be able to match up the MR jobs with the steps in your script.
Also, while your script is running you can inspect the configuration of the Hadoop job. Look at the variables "pig.alias" and "pig.job.feature". These tell you, respectively, which of your aliases (tables/relations) is involved in that job and what Pig operations are being used (e.g., HASH_JOIN for a JOIN step, SAMPLER or ORDER BY for an ORDER BY step, and so on). This information is also available in the job stats that are output to the console upon completion.