I am new to the hive and I am working on a project where I need to create a few UDFs for data wrangling. During my research, I came across two syntaxes for creating UDF from added jars
CREATE FUNCTION country AS 'com.hiveudf.employeereview.Country';
CREATE TEMPORARY FUNCTION country AS 'com.hiveudf.employeereview.Country';
I am not able to find any difference in the above two ways. Can someone explain it to me or guide me to right material?
The main difference between create function and create tmp function is this:
In Hive 0.13 or later, functions can be registered to the metastore, so they can be referenced in a query without having to create a temporary function each session.
If we use CREATE TEMPORARY FUNCTION , we will have to recreate the function every-time we start a new session.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction
CREATE TEMPORARY FUNCTION creates a new function which you can use in Hive queries as long as the session lasts . This is faster as we don't need to register the functions to the megastore.
Whereas CREATE FUNCTION acts more permanently. These functions can be registered to the metastore, so they can be referenced in a query without having to create a temporary function each session.
When to use:
The intermediate functions can be created using TEMPORARY which aims to just compute and can be later on used by any permanent functions.
Reference
Related
I cached a table in Databricks (SQL Notebook) using
CACHE TABLE work_details AS SELECT (....)
The problem is that I can only access the cached table if I am in the same notebook. I want to use the table in a different notebook (same cluster) but it throws the error table or view not found
Is there any workaround for this?
EDIT:
Note, I cannot use views here because the cached table has a lot of rows and is further used to join different tables to create the final required table. If I use VIEWS instead of the cached table, the time taken to create the final table is increased which I do not want.
Ques: Why can't I cache the table again in the new notebook?
Ans: This is the solution I am using right now, but I need a workaround where I can use this table across multiple notebooks without having to cache it again and again and still have the same performance.
You typically create Temp view when creating cached table
• A Temp View is available across the context of a Notebook and is a common way of sharing data
• A Global Temp View is available to all Notebooks running on that Databricks Cluster
Workaround:
Create Global Temp View which will be accessible on all Notebooks running on that Cluster.
%sql
CREATE GLOBAL TEMP VIEW <global-view-name>
To access Global Temp View use below query
%sql
select * from global_temp.<global-view-name>;
My hive version is 1.2.0
I am doing hive hbase integration where my hbase table already present.
While creating hive table, I was checking if I can use few of hive's built-in date functions as a candidate for virtual columns/derived columns, which is something like this -
create external table `Hive_Test`(
*existing hbase columns*,
*new_column* AS to_date(from_unixtime(unix_timestamp(*existing_column*,'yyyy/MM/dd HH:mm:ss')...
)CLUSTERED BY (..) SORTED BY (new_colulmn) INTO n BUCKETS
..
WITH SERDEPROPERTIES(
hbase.columns.mappings=':key,cf:*,:timestamp',
..
)
If there is any other way where I can use built-in functions capability in create table, then please let me know.
Thanks.
With reference to - Hive Computed Column, i think you are defining a logic when creating a table which is not possible with hive.
You can refer this article for Apache Hive Derived Column Support and Alternative
A better way is to create a view on top of the non-native table created for Hive-HBase integration, with which you can do almost any kind of mapping that facilitates your business.
I created one UDF in hive, for example:
create function mydb.level as 'com.my.udf.level' using jar
'hdfs://hadoop01:8020/user/hive/udf_jars/dbtools-1.0-SNAPSHOT.jar';
Now , I want to read data from hive table using spark like this:
spark.read().jdbc(myurl, "(select level(id) from my_tbl)t", prop);
it was failed.
How can I use level() in jdbc api.
When you register Hive UDF as permanent function it's tied to database in which it was created and you need to indicate the database as well when you call the UDF. So in your case you need to call the udf as follows:
spark.read().jdbc(myurl, "(select mydb.level(id) from my_tbl)t", prop);
Is there a way to locate the hdfs/local path of a particular UDF's jar/class file?
When I run "show functions" I am able to see this UDF but I want to find out it's location.
Hive has two types of functions permanent/Built-in and temporary.
Permenent: Built-in functions are part of hive-exec**.jar under package "org.apache.hadoop.hive.ql.udf". Jar is under HIVE_HOME/lib/hive-exec*.jar.
Temporary: functions are added manually. in case of temporary function you will find details in your hive hql file. e.g.
ADD JAR xyz.jar;
CREATE TEMPORARY FUNCTION temp AS 'com.example.hive.udf.Temp';
Hive of 0.13+ UDF can also be added permanently using plugin.
Find out details about hive function:
DESCRIBE FUNCTION EXTENDED function_name;
DESCRIBE FUNCTION function_name;
Is there a way to create a temporary table using the RMySQL package? If so what is the correct way to do it? In particular I am trying to write a dataframe from my R session to the temporary table. I have several processes running in parallel and I don't want to worry about name conflicts, that's why I want to make them temporary so they are only visible to each individual session. The solution should somehow involve dbWritetable and not dbSendQuery("create temporary table tbl;").
NOTE: I found some stuff on the net that suggests creating a temporary table manually using dbSendQuery(con, "create temporary table x (x int)") and then simply overriding it with dbWriteTable(). This does not work.
Depending on your mysql account restriction can you not do
dbSendQuery(con, "create temporary table x (x int);")
dbSendQuery(con, "drop temporary table x;")
etc..
For this type of job, I would avoid reinventing the wheel and use
https://code.google.com/p/sqldf/
By default, it is for sqlite, but it also works for MySQL (which I never tried). This package is rock-solid and well documented.
This is actually a known issue in RMySQL. Your best bet might be to write the data to a temporary file and then construct your own LOAD DATA LOCAL INFILE statement, using RMySQL::mysqlWriteTable as a guide.
For bonus points, if you can patch RMySQL::mysqlWriteTable to work with tempfiles, send a pull request to the github repo.