Is there a way to pass multiple values of the same variable into a Hive job in Hue? - hive

I have a Hive query in Hue with one input variable, a string (for example a date like '20160117').
I'd like to execute this Hive query in Hue and pass it multiple values for that single variable.
Is it possible? If yes, how would you guys do it?

Oozie runs Direct Acyclic Graphs (DAG). And Acyclic comes down to no loop, ever. But of course there are workarounds.
So, if you must run the same HQL script exactly N times with a different parameter value...
either copy/paste the Hive Action N times, in a chain, with a different param value (quick and dirty)
or build a Sub-Workflow with just the Hive action and call it N times, in a chain, with a different param value
On the other hand, if you must adapt dynamically the number and the value of executions, then you must work out the "loop" logic outside of Oozie proper...
for instance, start with a Shell action that creates an empty HQL file, then adds N queries in a loop, then uploads the file to HDFS; next, a Hive action that executes the HQL script as-is (quick and dirty, but not ideal for exception handling)
or develop a Java program that connects to HiveServer2 via JDBC, submits a PreparedStatement with 1 bind variable, and executes the statement N times in a loop with different values of the variable.
And maybe, someday, Hive will support some kind of procedural language similar to PL/SQL, T-SQL, PgSQL etc. and you will be able to pass a comma-separated list of values and process it inside of Hive.

Related

How do I repeatedly run a Hive query using each line of a multi line input as the parameter?

Using Hue, I've got a Hive query that will take an input (eg. an ID number) and return a record based on that. I need to handle multiple numbers to look up in one go (in serial or parallel) and collate the results (i.e. list the records for each, one after the other) so input might be:
1234567890
45345353
32423422
1323122
etc...
I've got access to Hue (which I'm supposed to use), Hive, Oozie and Beeline. How do I:
1.) extract the number for each line
2.) repeatedly call my HiveQL query passing in each number in turn
3.) supply the total output to the user in one go
I don't know Python if that's relevant but could attempt a shell script.
I'm guessing one way might be to get the multi-line user input via Oozie (can it prompt a user for input?), then pass that to a shell script which extracts the number from each line and uses beeline to repeatedly run my Hive query with the next number as the parameter?
Thanks

Variable values stored outside of SSIS

This is merely a SSIS question for advanced programmers. I have a sql table that holds clientid, clientname, Filename, Ftplocationfolderpath, filelocationfolderpath
This table holds a unique record for each of my clients. As my client list grows I add a new row in my sql table for that client.
My question is this: Can I use the values in my sql table and somehow reference each of them in my SSIS package variables based on client id?
The reason for the sql table is that sometimes we get request to change the delivery or file name of a file we send externally. We would like to be able to change those things dynamically on the fly within the sql table instead of having to export the package each time and manually change then re-import the package. Each client has it's own SSIS package
let me know if this is feasible..I'd appreciate any insight
Yes, it is possible. There are two ways to approach this and it depends on how the job runs. First is if you are running for a single client for a single job run or if you are running for multiple clients for a single job run.
Either way, you will use the Execute SQL Task to retrieve data from the database and assign it to your variables.
You are running for a single client. This is fairly straightforward. In the Result Set, select the option for Single Row and map the single row's result to the package variables and go about your processing.
You are running for multiple clients. In the Result Set, select Full Result Set and assign the result to a single package variable that is of type Object - give it a meaningful name like ObjectRs. You will then add a ForEachLoop Enumerator:
Type: Foreach ADO Enumerator
ADO object source variable: Select the ObjectRs.
Enumerator Mode: Rows in all the tables (ADO.NET dataset only)
In Variable mappings, map all of the columns in their sequential order to the package variables. This effectively transforms the package into a series of single transactions that are looped.
Yes.
I assume that you run your package once per client or use some loop.
At the beginning of the "per client" code read all required values from the database into SSIS varaibles and the use these variables to define what you need. You should not hardcode client specific information in the package.

Sequential Teradata Queries

I have a collection of SQL queries that need to run in a specific order using Teradata. How can this be done?
I've considered writing an application in some other language (like Python or C++) to sequentially call each query, but am unsure how to get live data there from Teradata. I also want to keep the queries as separate SQL files (like it is currently).
Goal is to minimize the need for human interaction ie. I want to hit "Run" and let it take care of the rest.
BTEQ scripts are your Go-To solution.
Have each query, or at least, logical blocks of several statements, in single bteq script.
Then create a script that will call the the BTEQ with needed settings, i.e. TD logon command and have this script be called in a batch with parameters like this:
start /wait C:\Teradata\BTEQ.bat Script_1.txt
start /wait C:\Teradata\BTEQ.bat Script_2.txt
start /wait C:\Teradata\BTEQ.bat Script_3.txt
pause
Then you can create several batch files, split in logical blocks and have them executed at will, or scheduled.

hive query that will fail if table is empty

I have a series of workflows in oozie that will periodically fail silently by simply not filling the target table. The failures are a result of, among other things, a change input like a non-ascii character or a double escape sneaking into the data, that kind of thing. However, the job actually finishes successfully. I would like the jobs to fail if the table does not fill. Is there any easy way to do this directly in Oozie, or with a simple Hive query that will fail on an empty table?
Oozie doesn't fail the action as oozie sees that the hive query has been successfully executed , it doesn't care about anything else
A workaround for your case :
hive action that loads the table
another hive action that checks the count of the table , capture output.
use decision node such that if above captured output value is 0 then kill the workflow.
hope this workaround helps .

SSIS Script Component - only to change variables

I have a series of task that are very similar:
SELECT a,b FROM c
Lookup in another table and change value in column b.
Save new value back to c and if not match, send the result on to an error table.
That part is pretty straight forward and illustrated here:
Source ==> Lookup =match=> SQL Update command
=No match=> SQL Save Error command
(Hope you understand what I mean - but it works!)
I now have to repeat this a number of times, where my source-sql changes. So what I want to do is to insert a Script Component in front of the Source and set my User::Sql variable like:
Variables.Sql = "SELECT d, e FROM f"
All of the above is contained in a Data Flow. When I have created one I can then copy that one and only change the Sql variable in the script and then it should all work.
My problem is: When I insert the Script Command it asks me if it is a Source, Destination or Transscript script. And by only setting the variable it does not produce any rows for output and cannot connect to my Source.
Anyone know how to make that work?
(I have simplified the above. I actually want to update multiple variables and use those in my Source, Lookup and Error update as well - therefore it is not more simple just to change the SQL script in the initial Source! But being able to do the above, I will be able to achieve what I want :-))
You should set your variable containing the SQL query in the control flow, before you execute the dataflow.
Then you need to use that variable as an expression in your Dataflow. You can parametrize the query used in the lookup or any other parameters of your dataflow.
If your dataflows really have always the same structure, you could even generate a list of queries and call your dataflow task in a loop, preventing the duplication of the same tasks.