How do I repeatedly run a Hive query using each line of a multi line input as the parameter? - hive

Using Hue, I've got a Hive query that will take an input (eg. an ID number) and return a record based on that. I need to handle multiple numbers to look up in one go (in serial or parallel) and collate the results (i.e. list the records for each, one after the other) so input might be:
1234567890
45345353
32423422
1323122
etc...
I've got access to Hue (which I'm supposed to use), Hive, Oozie and Beeline. How do I:
1.) extract the number for each line
2.) repeatedly call my HiveQL query passing in each number in turn
3.) supply the total output to the user in one go
I don't know Python if that's relevant but could attempt a shell script.
I'm guessing one way might be to get the multi-line user input via Oozie (can it prompt a user for input?), then pass that to a shell script which extracts the number from each line and uses beeline to repeatedly run my Hive query with the next number as the parameter?
Thanks

Related

JMeter - Execute SSH Commands in parallel

I need to simulate the below:
1. SSH (only once)
2. Execute a command on all the rows in a csv file at once.
Number of rows in the csv file is dynamic. If 10, the command needs to be executed over all the 10 rows in parallel.
Am not sure of using SSH Command Sampler here. SSH and Command are to be entered in the same sampler. How do I separate these? i.e. SSH only once and then executing the commands in parallel. Which JMeter components do I use here?
Note: Increasing the number of Threads is not an efficient option. While doing this many sessions get created. In turn hanging the terminal. This option works fine up to 10 users. Not sure if there's a limit on the number of sessions.
Thanks for your support.
Regards,
Ajith
Why do you think that Increasing the number of Threads is not an efficient option?
I would suggest moving the SSH (only once) to setUp Thread Group and put Execute a command on all the rows in a csv file at once. bit under the normal Thread Group
If the number of rows in the CSV file is dynamic - you can make the number of threads dynamic as well using __groovy() function like:
${__groovy(new File('/path/to/your/file.csv').readLines().size,)}
If you want to execute all the 10 requests (or whatever is the number of lines) at exactly the same moment you can add a Synchronizing Timer

Sequential Teradata Queries

I have a collection of SQL queries that need to run in a specific order using Teradata. How can this be done?
I've considered writing an application in some other language (like Python or C++) to sequentially call each query, but am unsure how to get live data there from Teradata. I also want to keep the queries as separate SQL files (like it is currently).
Goal is to minimize the need for human interaction ie. I want to hit "Run" and let it take care of the rest.
BTEQ scripts are your Go-To solution.
Have each query, or at least, logical blocks of several statements, in single bteq script.
Then create a script that will call the the BTEQ with needed settings, i.e. TD logon command and have this script be called in a batch with parameters like this:
start /wait C:\Teradata\BTEQ.bat Script_1.txt
start /wait C:\Teradata\BTEQ.bat Script_2.txt
start /wait C:\Teradata\BTEQ.bat Script_3.txt
pause
Then you can create several batch files, split in logical blocks and have them executed at will, or scheduled.

Is there a way to pass multiple values of the same variable into a Hive job in Hue?

I have a Hive query in Hue with one input variable, a string (for example a date like '20160117').
I'd like to execute this Hive query in Hue and pass it multiple values for that single variable.
Is it possible? If yes, how would you guys do it?
Oozie runs Direct Acyclic Graphs (DAG). And Acyclic comes down to no loop, ever. But of course there are workarounds.
So, if you must run the same HQL script exactly N times with a different parameter value...
either copy/paste the Hive Action N times, in a chain, with a different param value (quick and dirty)
or build a Sub-Workflow with just the Hive action and call it N times, in a chain, with a different param value
On the other hand, if you must adapt dynamically the number and the value of executions, then you must work out the "loop" logic outside of Oozie proper...
for instance, start with a Shell action that creates an empty HQL file, then adds N queries in a loop, then uploads the file to HDFS; next, a Hive action that executes the HQL script as-is (quick and dirty, but not ideal for exception handling)
or develop a Java program that connects to HiveServer2 via JDBC, submits a PreparedStatement with 1 bind variable, and executes the statement N times in a loop with different values of the variable.
And maybe, someday, Hive will support some kind of procedural language similar to PL/SQL, T-SQL, PgSQL etc. and you will be able to pass a comma-separated list of values and process it inside of Hive.

Write results of SQL query to multiple files based on field value

My team uses a query that generates a text file over 500MB in size.
The query is executed from a Korn Shell script on an AIX server connecting to DB2.
The results are ordered and grouped by a specific field.
My question: Is it possible, using SQL, to write all rows with this specific field value to its own text file?
For example: All rows with field VENDORID = 1 would go to 1.txt, VENDORID = 2 to 2.txt, etc.
The field in question currently has 1000+ different values, so I would expect the same amount of text files.
Here is an alternative approach that gets each file directly from the database.
You can use the DB2 export command to generate each file. Something like this should be able to create one file :
db2 export to 1.txt of DEL select * from table where vendorid = 1
I would use a shell script or something like Perl to automate the execution of such a command for each value.
Depending on how fancy you want to get, you could just hardcode the extent of vendorid, or you could first get the list of distinct vendorids from the table and use that.
This method might scale a bit better than extracting one huge text file first.

Pentaho Data Integration: How to select output of sql query as a filename for Microsoft Excel Input.

I have files abc.xlsx, 1234.xlsx, and xyz.xlsx in some folder. My requirement is to develop a transformation where the Microsoft Excel Input in PDI (Pentaho Data Integration) should only pick the file based on the output of a sql query. If the output query is abc.xlsx. Microsoft Excel Input should pick of abc.xlsx for further processing. How do I achieve this? Would really appreciate your help. Thanks.
Transformations in Kettle run asynchronously, so you're probably looking into needing a job for this.
Files to create
Create a transformation that performs the SQL query you're looking for and populates a variable based on the result
Create a transformation that pulls data from the Excel file, using the variable populated as the filename
Create a job that executes the first transformation, then steps into the second transformation
Jobs run sequentially, so it will execute the first transformation, perform the query, get the result, and set a variable. Variables need to be set and retrieved in different transformations because of their asynchronous nature. This is the reason for the second transformation; the job won't step into the second transformation until the first one is done running (therefore, not until the variable is populated).
This is all assuming you only want to run the transformation once, expecting a single result from the query. If you want to loop it, pulling data from a set, then setup is a little bit different.
The Excel input step has a "accept filenames from previous step" option. You can have a table input build the full path of the file you want to read (or you somehow build it later knowing the base dir and the short filename), pass the filename to the excel input, tick that box and specify the step and the field you want to use for the filename.