i want generate dynamic table name in hql which runing using beeline.
in db2, i can implement this requirement using ||.
for example, using year to generate table name 'as400.trxfintrx_' || year(current date), but how can i implement this in hive'hql?
If i understand it correct, you want the table name to be paramterized,
For that you can use hive variables,
create table dbName.table1_${hivevar:yearMonthDate}
(
c1 int,
c2 int
)
stored as orc
tblproperties('ZLIB');
$ hive -f test_create_table.hql --hivevar yearMonthDate=20190215
OK
Time taken: 1.149 seconds
$ hive
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> use dbname;
OK
Time taken: 0.726 seconds
hive> desc table1_20190215;
OK
c1 int
c2 int
Time taken: 0.302 seconds, Fetched: 2 row(s)
you can refer https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution
From beeline terminal, you cannot define any function to set parameter value and then use them in you queries.
Hope this helps
Related
I run a particular query every week that creates a weekly table of the required data.
The table names are in the format as mentioned below
db_name.subscriptions_wk29 -- a table created for week 29 data
db_name.subscriptions_wk30 -- a table created for week 30 data
db_name.subscriptions_wk31 -- a table created for week 31 data
Since this is a repetitive task I want to schedule this query so that it will automatically run on every monday of a particular week to get previous week's data.
The problem that I am facing is, I dont know how do I change the my table name dynamically as I run my query every week.
So when I run my query next time it should automatically create a table with name db_name.subscriptions_wk32. I can get value 32 from weekofyear('2019-08-05') but dont know how to put it in table name
Currently I write it as below
CREATE TABLE db_name.subscriptions_wk30 AS -- a hardcoded name
SELECT *
FROM ..........
What I want is
CREATE TABLE db_name.subscriptions_wkCAST(weekofyear('2019-08-05') AS varchar) -- a dynamic name
SELECT *
FROM ..........
Which will result into
CREATE TABLE db_name.subscriptions_wk32
SELECT *
FROM ..........
PS I am using Hive/Hue as RDBMS
It is not possible to calculate table name in the query itself, but it is possible to pass parameter to the script.
You can calculate parameter in a shell and execute script from shell:
#You can provide date:
varDate=2019-08-05
#And calculate weekyear
weeknumber=$(date --date=${varDate} +%V)
echo "${weeknumber}"
#returns 32
#Or calculate current date weekyear
weeknumber=`date +%V`
#Or calculate previous week date
weeknumber="$(date -d "7 days ago" +"%V")"
#And call hive script like this:
hive -e "CREATE TABLE db_name.subscriptions_wk${weeknumber} -- parametrized name suffix
SELECT *
FROM ...
"
Or you can use -hivevar parameter in the hive command line to call script file (-f option), suppose weeknumber is already calculated like before:
hive -hivevar weeknumber="$weeknumber" -f script_file_name
You can use below shell to create a dynamic weekly table and can schedule it to run on every Monday using oozie scheduler or as a cron job.
#!/bin/bash
echo "Executing the hive query - get current week and store it in shell variable"
#current_week=$(hive -e "select weekofyear(current_date);")
#echo $current_week
previous_week=$(hive -e "select weekofyear(date_sub(current_date, 7));")
echo $previous_week
hive --hiveconf dbname=test_dev_db --hiveconf weekname=$previous_week -f hdfs://xxx.host.com:8020/user/xxx/dev/hadoop/hivescripts/createweektable.hql
echo "Executing the hive query - ends"
hive (test_dev_db)> desc test_dev_db.subscriptions_wk31;
OK
user_id int
country string
last_modified_date date
Time taken: 0.345 seconds, Fetched: 3 row(s)
Update-
This is how you can refer your shell variable in your hql script.
CREATE TABLE ${hiveconf:dbname}.subscriptions_wk${hiveconf:weekname}
row format delimited
fields terminated by '|'
STORED AS ORC
AS select * from test_dev_db.test_data;
Don't do this! Having multiple parallel tables with the same structure is a really bad idea.
Instead, have a single table db_name.subscriptions and add a column that specifies the week -- perhaps the first Monday or last Sunday of the week.
Then, instead of creating separate tables, just insert rows for each week.
You will find advantages to having a single table:
The database will not be cluttered with lots of tables with similar names.
SQL statements that run on the report can run on any week by changing the where clause (which can be parameterized) rather than changing the from clause (which cannot be).
It is easy to write queries that look at changes over time.
It is easy to see what weeks are available by querying the table.
And making the weeks partitions in the same table is very useful, if each week produces a non-trivial number of rows.
I have a table which has a partition of type int but which I want to convert to string. However, I can't figure out how to do this.
The table description is:
Col1 timestamp
Col2 string
Col3 string
Col4 string
Part_col int
# Partition information
# col_name data_type comment
Part_col int
The partitions I have created are Part_col=0, Part_col=1, ..., Part_col=23
I want to change them to Part_col='0' etc
I run this command in hive:
set hive.exec.dynamic.partitions = true;
Alter table tbl_name partition (Part_col=0) Part_col Part_col string;
I have also tried using "partition (Part_col)" to change all partitions at once.
I get the error "Invalid column reference Part_col"
I am using the example from https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types for conversion of decimal columns but can't figure out what dec_column_name represents.
Thanks
A bit of digging revealed that there was a hive JIRA to have a command exactly for updating partition column data type (https://issues.apache.org/jira/browse/HIVE-3672)
alter table {table_name} partition column ({column_name} {column_type});
According to JIRA the command was implemented, but it's apparent that it was never documented on Hive Wiki.
I used it on my Hive 0.14 system and it worked as expected.
I think yo should redefine the table's schema and redefine that your partition value is not gonna be a integer anymore and this is now gonna be a string type.
What I recommend you to do is:
Make your table external (in case you define this as a non-external table). In this case you can drop the table without removing the data in the directories.
Drop the table.
Create again the table with the new schema (Partition value as a string).
The steps above, physically (structure folders) is not gonna make any difference with the structure that you already had. The difference is gonna be in the Hive metastore, specifically in the "virtual column" created when you make partitions.
Also, now instead making queries like: part_col = 1, now you are gonna be able to make queries like: part_col = '1'.
Try this and tell me how this goes.
I'm trying to fetch last modified timestamp of a table in Hive.
Please use the below command:
show TBLPROPERTIES table_name ('transient_lastDdlTime');
Get the transient_lastDdlTime from your Hive table.
SHOW CREATE TABLE table_name;
Then copy paste the transient_lastDdlTime in below query to get the value as timestamp.
SELECT CAST(from_unixtime(your_transient_lastDdlTime_value) AS timestamp);
With the help of above answers I have created a simple solution for the forthcoming developers.
time_column=`beeline --hivevar db=hiveDatabase --hivevar tab=hiveTable --silent=true --showHeader=false --outputformat=tsv2 -e 'show create table ${db}.${tab}' | egrep 'transient_lastDdlTime'`
time_value=`echo $time_column | sed 's/[|,)]//g' | awk -F '=' '{print $2}' | sed "s/'//g"`
tran_date=`date -d #$time_value +'%Y-%m-%d %H:%M:%S'`
echo $tran_date
I used beeline alias. Make sure you setup alias properly and invoke the above script. If there are no alias used then use the complete beeline command(with jdbc connection) by replacing beeline above. Leave a question in the comment if any.
Here there is already an answer for how to see last modified date for a hive table. I am just sharing how to check last modified date for a hive table partition.
Connect to hive cluster to run hive queries. In most of the cases, you can simply connect by running hive command : hive
DESCRIBE FORMATTED <database>.<table_name> PARTITION(<partition_column>=<partition_value>);
In the response you will see something like this : transient_lastDdlTime 1631640957
SELECT CAST(from_unixtime(1631640957) AS timestamp);
You may get the timestamp by executing
describe formatted table_name
you can execute the below command and convert the output of transient_lastDdlTime from timestamp to date.It will give the last modified timestamp for the table.
show create table TABLE_NAME;
if you are using mysql as metadata use following...
select TABLE_NAME, UPDATE_TIME, TABLE_SCHEMA from TABLES where TABLE_SCHEMA = 'employees';
Kind of edge case, when saving parquet table in Spark SQL with partition,
#schema definitioin
final StructType schema = DataTypes.createStructType(Arrays.asList(
DataTypes.createStructField("time", DataTypes.StringType, true),
DataTypes.createStructField("accountId", DataTypes.StringType, true),
...
DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD);
df.coalesce(1)
.write()
.mode(SaveMode.Append)
.format("parquet")
.partitionBy("year")
.saveAsTable("tblclick8partitioned");
Spark warns:
Persisting partitioned data source relation into Hive metastore in
Spark SQL specific format, which is NOT compatible with Hive
In Hive:
hive> describe tblclick8partitioned;
OK
col array<string> from deserializer
Time taken: 0.04 seconds, Fetched: 1 row(s)
Obviously the schema is not correct - however if I use saveAsTable in Spark SQL without partition the table can be queried without problem.
Question is how can I make a parquet table in Spark SQL compatible with Hive with partition info?
That's because DataFrame.saveAsTable creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable. An example from SPARK-14927 looks like this:
hc.sql("create external table tmp.partitiontest1(val string) partitioned by (year int)")
Seq(2012 -> "a", 2013 -> "b", 2014 -> "c").toDF("year", "val")
.write
.partitionBy("year")
.mode(SaveMode.Append)
.saveAsTable("tmp.partitiontest1")
A solution is to create the table with Hive and then save the data with ...partitionBy("year").insertInto("default.mytable").
In my experience, creating the table in Hive and then using ...partitionBy("year").saveAsTable("default.mytable") did not work. This is with Spark 1.6.2.
If I write a hive sql like
ALTER TABLE tbl_name ADD PARTITION (dt=20131023) LOCATION 'hdfs://path/to/tbl_name/dt=20131023;
How can I query this location about partition later? Because I found there is some data in location but I can't query them, hive sql like
SELECT data FROM tbl_name where dt=20131023;
Do a describe on the partition instead of the full table.
This will show the linked location if it's an external table.
describe formatted tbl_name partition (dt='20131023')
show table extended like 'tbl_name' partition (dt='20131023');
Show Tables/Partitions Extended
SHOW TABLE EXTENDED will list information for all tables matching the given regular expression. Users cannot use regular expression for table name if a partition specification is present. This command's output includes basic table information and file system information like totalNumberFiles, totalFileSize, maxFileSize, minFileSize, lastAccessTime, and lastUpdateTime. If partition is present, it will output the given partition's file system information instead of table's file system information.
If you have multiple nested partitions, the syntax is:
describe formatted table_name partition (day=123,hour=2);
If you want to know the location of files you're reading, use
SELECT INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE FROM <table> WHERE <part_name> = '<part_key>'
Then you get
hdfs:///user/hive/warehouse/<db>/<table>/<part_name>=<part_key>/000000_0.snappy, 0
hdfs:///user/hive/warehouse/<db>/<table>/<part_name>=<part_key>/000000_1.snappy, 0
This is the format of the command I use to get the exact HDFS location of a specific partition in a specific table:
show table extended like flight_context_fused_record partition(date_key='20181013', partition_id='P-DUK2nESsv', custom_partition_1='ZMP');
In the command above, the partition spec consists of three separate fields. Your example may have more or less.
See results below. Notice the "location:" field shows the HDFS folder location.
hive (nva_test)> show table extended like flight_context_fused_record partition(date_key='20181013', partition_id='P-DUK2nESsv', custom_partition_1='ZMP');
OK
tableName:flight_context_fused_record
owner:nva-prod
location:hdfs://hdp1-ha/tmp/vfisher/cms-context-acquisition-2019-06-13/FlightContextFusedRecord/2018/10/13/ZMP/P-DUK2nESsv
inputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
outputformat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
columns:struct columns { string primary_key, string facility, string position, i32 dalr_channel, i64 start_time_unix_millis, i64 end_time_unix_millis, string foreign_key_to_audio_segment, struct<on_frequency_flight_list:list<struct<acid:string,ac_type:string>>,transfer_list:list<struct<primary_key:string,acid:string,data_id:string,ac_type:string,from_facility:string,from_position:string,transition_time:i64,transition_time_start:i64,transtition_time_end:i64,to_facility:string,to_position:string,source:string,source_info:string,source_time:i64,confidence:double,confidence_description:string,uuid:string>>,source_list:list<string>,domain:string,domains:list<string>> flight_context}
partitioned:true
partitionColumns:struct partition_columns { i32 date_key, string partition_id, string custom_partition_1}
totalNumberFiles:1
totalFileSize:247075687
maxFileSize:247075687
minFileSize:247075687
lastAccessTime:1561122938361
lastUpdateTime:1561071155639
The generic form of the command (taking out my specific values and putting in argument specifiers) looks like this:
show table extended like <your table name here> partition(<your partition spec here>);
you can simply do this:
DESC FORMATTED tablename PARTITION (yr_no='y2019');
OR
DESC EXTENDED tablename PARTITION (yr_no='y2019');
You can get the location of the Hive partitions on HDFS by running any of the following Hive commands.
DESCRIBE FORMATTED tbl_name PARTITION(dt=20131023);
SHOW TABLE EXTENDED LIKE tbl_name PARTITION(dt=20131023);
Alternatively, you can also get by running HDFS list command
hdfs dfs -ls <your Hive store location>/<tablename>
Link: Hive show or list all partitions
Thanks,
NNK
You can get this info via Hive Metastore Thrift protocol, e.g. with hmsclient library:
Hive cli:
hive> create table test_table_with_partitions(f1 string, f2 int) partitioned by (dt string);
OK
Time taken: 0.127 seconds
hive> alter table test_table_with_partitions add partition(dt=20210504) partition(dt=20210505);
OK
Time taken: 0.152 seconds
Python cli:
>>> with client as c:
... partition = c.get_partition_by_name(db_name='default',
tbl_name='test_table_with_partitions',
part_name='dt=20210504')
...
>>> partition.sd.location
'hdfs://hdfs.master.host:8020/user/hive/warehouse/test_table_with_partitions/dt=20210504'