I created one UDF in hive, for example:
create function mydb.level as 'com.my.udf.level' using jar
'hdfs://hadoop01:8020/user/hive/udf_jars/dbtools-1.0-SNAPSHOT.jar';
Now , I want to read data from hive table using spark like this:
spark.read().jdbc(myurl, "(select level(id) from my_tbl)t", prop);
it was failed.
How can I use level() in jdbc api.
When you register Hive UDF as permanent function it's tied to database in which it was created and you need to indicate the database as well when you call the UDF. So in your case you need to call the udf as follows:
spark.read().jdbc(myurl, "(select mydb.level(id) from my_tbl)t", prop);
Related
I am new to the hive and I am working on a project where I need to create a few UDFs for data wrangling. During my research, I came across two syntaxes for creating UDF from added jars
CREATE FUNCTION country AS 'com.hiveudf.employeereview.Country';
CREATE TEMPORARY FUNCTION country AS 'com.hiveudf.employeereview.Country';
I am not able to find any difference in the above two ways. Can someone explain it to me or guide me to right material?
The main difference between create function and create tmp function is this:
In Hive 0.13 or later, functions can be registered to the metastore, so they can be referenced in a query without having to create a temporary function each session.
If we use CREATE TEMPORARY FUNCTION , we will have to recreate the function every-time we start a new session.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction
CREATE TEMPORARY FUNCTION creates a new function which you can use in Hive queries as long as the session lasts . This is faster as we don't need to register the functions to the megastore.
Whereas CREATE FUNCTION acts more permanently. These functions can be registered to the metastore, so they can be referenced in a query without having to create a temporary function each session.
When to use:
The intermediate functions can be created using TEMPORARY which aims to just compute and can be later on used by any permanent functions.
Reference
My hive version is 1.2.0
I am doing hive hbase integration where my hbase table already present.
While creating hive table, I was checking if I can use few of hive's built-in date functions as a candidate for virtual columns/derived columns, which is something like this -
create external table `Hive_Test`(
*existing hbase columns*,
*new_column* AS to_date(from_unixtime(unix_timestamp(*existing_column*,'yyyy/MM/dd HH:mm:ss')...
)CLUSTERED BY (..) SORTED BY (new_colulmn) INTO n BUCKETS
..
WITH SERDEPROPERTIES(
hbase.columns.mappings=':key,cf:*,:timestamp',
..
)
If there is any other way where I can use built-in functions capability in create table, then please let me know.
Thanks.
With reference to - Hive Computed Column, i think you are defining a logic when creating a table which is not possible with hive.
You can refer this article for Apache Hive Derived Column Support and Alternative
A better way is to create a view on top of the non-native table created for Hive-HBase integration, with which you can do almost any kind of mapping that facilitates your business.
I'm new to Nifi so could you help me understand this platform and its capabilities.
Would I be able to use a Nifi process to create a new table in Hive and move data into it weekly from a teradata database in the way I've defined below?
How would I go about it? Not sure if I'm building a sensible flow.
Would the following process suffice: QueryDatabaseTable (and configure a pooling service for teradata and define a new tablename and schedule ingestion) --> PutHiveStreaming (create the table defined earlier)
and then how do i pull the teradata schema into the new table?
If you want to create new hive table along with the ingestion process then
Method1:
Using ConvertAvroToOrc processor adds hive.ddl(external table) attribute to the flowfile as we can use this attribute and execute using PutHiveQL processor then we are able to create table in hive.
If you want to create transactional table then needs to change the hive.ddl attribute.
Refer to this link for more details.
If you wan to pull only the delta records from the source then you can use
ListDatabaseTables(list all tables from source db) + GenerateTableFetch(stores the state) Processors
Flow:
Method2:
QuerydatabaseTable processor will result flowfile in Avro Format then you can use ExtractAvroMetaData processor to extract the avro schema by using some script we can create a new attribute with the required schema(i.e. managed/external/transactional table).
I am using Spark 1.4. HiveContext is used to connect Hive. I did the following
val hx = new HiveContext(sc)
import hx.implicits._
hx.sql("select * from tab").show
// it is fine, result was shown as expected
then, I inserted a few records into tab from beeline console
hx.refreshTable("tab")
hx.sql("select * from tab").show
// still old records, no newly inserted records
My question is: why the HiveContext didn't retrieve the newly inserted records?
hiveContext.refreshTable(tableName: String) - this will refresh only metadata of the table (not the actual data)
Notes from official documentaition : (credits: https://spark.apache.org)
refreshTable(tableName: String): Unit
Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache
To retrive newly inserted records:- uncache first and cache again using , uncacheTable(String tableName) and cacheTable(String tableName)
If the target table is partitioned, You need to insert with 'partition' option. If you miss out the partition, data will not be visible.
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...) SELECT col1,col2,.... FROM tablename2
On a differently slight case, I have an RDD coming from a Spark SQL statement via HiveContext. The solution which worked for me after some experiments was to actually regenerate the RDD itself.
It does not matter whether you are using the DDL by Spark SQL or sending SQL statements directly via hiveContext.sql.
I have seen around people using a "count trick" in order to force the recomputation of a dataset but at least in my attempts I couldn't get to see the new data this way.
Anyway trying caching, refreshing and friends did not work for me, if somebody has some proper pattern here please share.
I've been trying to store csv data into a table in a database using a pig script.
But instead of inserting the data into a table in a database I created a new file in the metastore.
Can someone please let me know if it is possible to insert data into a table in a database with a pig script, and if so what that script might look like?
You can take a look at DBStorage, but be sure to include the JDBC jar in your pig script and declaring the UDF.
The documentation for the storage UDF is here:
http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/DBStorage.html
you can use:
STORE into tablename USING org.apache.hcatalog.pig.HCatStorer()