Built-in hive udf for identity - hive

I have a requirement where I need a function which takes a input and returns it as output without performing any operation in hive, just like placeholder.

Related

How to Set default value of empty data of column in copy activity from csv file using azure data factory v2

I've multiple csv files and multiple tables.
The table name is file name and column name is first row of csv file.
Now I want to add default value of empty string to the sink table.
Consider my scenario,
employee:
id int, name varchar, is_active bit NULL
employee.csv:
id|name|is_active
1|raja|
Now I'm trying to copy the csv data to PostgreSQL table its throwing error.
Expected result is default value if its empty value.
You can use NULLIF in PostgreSQL:
NULLIF(argument_1,argument_2);
The NULLIF function returns a null value if argument_1 equals to argument_2, otherwise it returns argument_1.
This way you can replace NULL value with some other value
If your error is related to Type mismatch then consider typecasting the column first
Thanks!
As per the issue, tried to repro the scenario and here is the following outcome which was successfully copied. You have to use
Source Dataset: employee.csv from Azure Blob Storage
Sink Dataset : Here, I have used the sink as Azure SQL DB for some limitations but as you have used PostgreSQL is almost similar.
Copy Activity Settings:
Under the mapping settings there will be type conversion, where you have to import schema else you can dynamically add
Output:
Alternative to use DataFlow - if you have multiple data fields, you need to use the derived column transformation to generate new columns in your data flow or to modify existing fields.
For more details, refer Derived column transformation in mapping data flow.
You can even refer to this Microsoft Q&A post for more insights: Copy Task failure because of conversion failure

How do I repeatedly run a Hive query using each line of a multi line input as the parameter?

Using Hue, I've got a Hive query that will take an input (eg. an ID number) and return a record based on that. I need to handle multiple numbers to look up in one go (in serial or parallel) and collate the results (i.e. list the records for each, one after the other) so input might be:
1234567890
45345353
32423422
1323122
etc...
I've got access to Hue (which I'm supposed to use), Hive, Oozie and Beeline. How do I:
1.) extract the number for each line
2.) repeatedly call my HiveQL query passing in each number in turn
3.) supply the total output to the user in one go
I don't know Python if that's relevant but could attempt a shell script.
I'm guessing one way might be to get the multi-line user input via Oozie (can it prompt a user for input?), then pass that to a shell script which extracts the number from each line and uses beeline to repeatedly run my Hive query with the next number as the parameter?
Thanks

Using elemMatch in Hive with json field

I am using Hive for json storage. Then, I have created a table with only one string column containing all the json. I have tested the get_json_object function that Hive offers but I am not able to create a query that iterates all the subdocuments in a list and finds a value in a specific field.
In MongoDB, this problem can be solved by using $elemMatch as the documentation says.
Is there any way to do something like this in Hive?

How to regist hive udf before spark read table from jdbc

I created one UDF in hive, for example:
create function mydb.level as 'com.my.udf.level' using jar
'hdfs://hadoop01:8020/user/hive/udf_jars/dbtools-1.0-SNAPSHOT.jar';
Now , I want to read data from hive table using spark like this:
spark.read().jdbc(myurl, "(select level(id) from my_tbl)t", prop);
it was failed.
How can I use level() in jdbc api.
When you register Hive UDF as permanent function it's tied to database in which it was created and you need to indicate the database as well when you call the UDF. So in your case you need to call the udf as follows:
spark.read().jdbc(myurl, "(select mydb.level(id) from my_tbl)t", prop);

Is there a way to pass multiple values of the same variable into a Hive job in Hue?

I have a Hive query in Hue with one input variable, a string (for example a date like '20160117').
I'd like to execute this Hive query in Hue and pass it multiple values for that single variable.
Is it possible? If yes, how would you guys do it?
Oozie runs Direct Acyclic Graphs (DAG). And Acyclic comes down to no loop, ever. But of course there are workarounds.
So, if you must run the same HQL script exactly N times with a different parameter value...
either copy/paste the Hive Action N times, in a chain, with a different param value (quick and dirty)
or build a Sub-Workflow with just the Hive action and call it N times, in a chain, with a different param value
On the other hand, if you must adapt dynamically the number and the value of executions, then you must work out the "loop" logic outside of Oozie proper...
for instance, start with a Shell action that creates an empty HQL file, then adds N queries in a loop, then uploads the file to HDFS; next, a Hive action that executes the HQL script as-is (quick and dirty, but not ideal for exception handling)
or develop a Java program that connects to HiveServer2 via JDBC, submits a PreparedStatement with 1 bind variable, and executes the statement N times in a loop with different values of the variable.
And maybe, someday, Hive will support some kind of procedural language similar to PL/SQL, T-SQL, PgSQL etc. and you will be able to pass a comma-separated list of values and process it inside of Hive.