I have created an external table using Hive. My
hive> desc <table_name>;
shows the following output:
OK
transactiontype string
transactionid int
sourcenumber int
destnumber int
amount int
assumedfield1 int
transactionstatus string
assumedfield2 int
assumedfield3 int
transactiondate date
customerid int
# Partition Information
# col_name data_type comment
transactiondate date
customerid int
Time taken: 0.094 seconds, Fetched: 17 row(s)
But when I execute the following command:
hive> show partitions <dbname.tablename>;
OK
Time taken: 0.11 seconds
No partitions are shown. What might be the problem? When i see the hive.log, data in the table seems to be paritioned properly according to the 'transactiondate' and the 'customerid' fields. What is the max number of partitions that a single node should have? I have set 1000 partitions.
2015-06-15 10:33:44,713 INFO [LocalJobRunner Map Task Executor #0]: exec.FileSinkOperator (FileSinkOperator.java:createBucketForFileIdx(593)) - Writing to temp file: FS hdfs://localhost:54310/home/deepak/mobile_money_jan.txt/.hive-staging_hive_2015-06-15_10-30-53_308_5507019849041735537-1/_task_tmp.-ext-10002/transactiondate=2015-01-16/customerid=34560544/_tmp.000002_0
I am running hive on a single node hadoop cluster.
Try adding partitions manually
> alter table db.table add IF NOT EXISTS
> partition(datadate='2017-01-01') location
>'hdfs_location/datadate=2017-01-01'
HI whenever we create an external table it's location is changed to a specified location in HIVE METADATA,it means now this changes reflects in hive meta store too.
BUT the partition information remain unchanged ,partition information is not updated in hive meta store so we need to add those partitions manually.
ALTER TABLE "your-table" ADD PARTITION(transactiondate='datevalue',customerid='id-value');
Related
I have text file with snappy compression partitioned by field 'process_time' (result of Flume job). Example: hdfs://data/mytable/process_time=25-04-2019
This is my script for create table:
CREATE EXTERNAL TABLE mytable
(
...
)
PARTITIONED BY (process_time STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/mytable/'
TBLPROPERTIES("textfile.compress"="snappy");
The result of queries against this table are allways 0 (but I know that there are some data). Any help?
Thanks!
As you are creating external table on top of HDFS directory then to add the partitions to the hive table we need to run either of these commands.
if any partition added to HDFS directly(instead of using insert queries) then hive doesn't know about the newly added partitions, so we need to run either msck (or) add partitions to add newly added partitions to hive table.
To add all partitions to hive table:
hive> msck repair table <db_name>.<table_name>;
(or)
To manually add each partition to hive table:
hive> alter table <db_name>.<table_name> add partition(process_time="25-04-2019")
location '/data/mytable/process_time=25-04-2019';
For more details refer to this link.
Is it possible to create n number of external tables are pointing to a single hdfs path using Hive. If yes what are the advantages and its limitations.
It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS.
Creating tables with exactly the same schema on top of the same data is not useful at all, but you can create different tables with different number of columns for example or with differently parsed columns using RegexSerDe for example, so you can have different schemas in these tables. And you can have different permissions on these tables in Hive. Also table can be created on top of the sub-folder of some other tables folder, in this case it will contain a sub-set of data. Better use partitions in single table for the same.
And the drawback is that it is confusing because you can rewrite the same data using more than one table and also you may drop it accidentally, thinking this data belongs to the only table and you can drop data because you do not need that table any more.
And this is few tests:
Create table with INT column:
create table T(id int);
OK
Time taken: 1.033 seconds
Check location and other properties:
hive> describe formatted T;
OK
# col_name data_type comment
id int
# Detailed Table Information
Database: my
Owner: myuser
CreateTime: Fri Jan 04 04:45:03 PST 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://myhdp/user/hive/warehouse/my.db/t
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1546605903
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.134 seconds, Fetched: 26 row(s)
sts)
Create second table on top of the same location but with STRING column:
hive> create table T2(id string) location 'hdfs://myhdp/user/hive/warehouse/my.db/t';
OK
Time taken: 0.029 seconds
Insert data:
hive> insert into table T values(1);
OK
Time taken: 33.266 seconds
Check data:
hive> select * from T;
OK
1
Time taken: 3.314 seconds, Fetched: 1 row(s)
Insert into second table:
hive> insert into table T2 values( 'A');
OK
Time taken: 23.959 seconds
Check data:
hive> select * from T2;
OK
1
A
Time taken: 0.073 seconds, Fetched: 2 row(s)
Select from first table:
hive> select * from T;
OK
1
NULL
Time taken: 0.079 seconds, Fetched: 2 row(s)
String was selected as NULL because this table is defined as having INT column.
And now insert STRING into first table (INT column):
insert into table T values( 'A');
OK
Time taken: 84.336 seconds
Surprise, it is not failing!
What was inserted?
hive> select * from T2;
OK
1
A
NULL
Time taken: 0.067 seconds, Fetched: 3 row(s)
NULL was inserted, because during previous insert string was converted to int and this resulted in NULL
Now let's try to drop one table and select from another one:
hive> drop table T;
OK
Time taken: 4.996 seconds
hive> select * from T2;
OK
Time taken: 6.978 seconds
Returned 0 rows because first table was MANAGED and drop table also removed common location.
THE END,
data is removed, do We need T2 table without data in it?
drop table T2;
OK
Second table is removed, you see, it was metadata only. The table was also managed and drop table should remove the location with data also, but it's already nothing to remove in HDFS, only metadata was removed.
I have a hive table 'videotracking_playevent' which uses the following partition format (all strings): source/createyear/createmonth/createday.
Example: source=home/createyear=2016/createmonth=9/createday=1
I'm trying to update the partition values of createmonth and createday to consistently use double digits instead.
Example: source=home/createyear=2016/createmonth=09/createday=01
I've tried to the following query:
ALTER TABLE videotracking_playevent PARTITION (
source='home',
createyear='2015',
createmonth='11',
createday='1'
) RENAME TO PARTITION (
source='home',
createyear='2015',
createmonth='11',
createday='01'
);
However that returns the following, non-descriptive error from hive: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. null
I've confirmed that this partition exists, and I think I'm using the correct syntax. My hive version is Hive 1.1.0
Any ideas what I might be doing wrong?
There was an issue with old version of Hive with renaming partition. This might be an issue for your case too. Please see this link for detail.
You need to set below two property before executing the rename partition command if you are using Older version of Hive.
set fs.hdfs.impl.disable.cache=false;
set fs.file.impl.disable.cache=false;
Now run the query by setting this property.
hive> set fs.hdfs.impl.disable.cache=false;
hive> set fs.file.impl.disable.cache=false;
hive> ALTER TABLE partition_test PARTITION (year='2016',day='1') RENAME TO PARTITION (year='2016',day='01');
OK
Time taken: 0.28 seconds
hive> show partitions partition_test;
OK
year=2016/day=01
Time taken: 0.091 seconds, Fetched: 1 row(s)
hive>
This issue is fixed in Hive latest version. In my case Hive version is 1.2.1 and it works, without setting that property. Please see the example below.
Create a partitioned table.
hive> create table partition_test(
> name string,
> age int)
> partitioned by (year string, day string);
OK
Time taken: 5.35 seconds
hive>
Now add the partition and check the newly added partition.
hive> alter table partition_test ADD PARTITION (year='2016', day='1');
OK
Time taken: 0.137 seconds
hive>
hive> show partitions partition_test;
OK
year=2016/day=1
Time taken: 0.169 seconds, Fetched: 1 row(s)
hive>
Rename the partition using RENAME TO PARTITION command and check it.
hive> ALTER TABLE partition_test PARTITION (year='2016',day='1') RENAME TO PARTITION (year='2016',day='01');
OK
Time taken: 0.28 seconds
hive> show partitions partition_test;
OK
year=2016/day=01
Time taken: 0.091 seconds, Fetched: 1 row(s)
hive>
Hope it helps you.
Rename lets you change the value of a partition column. One of use cases is that you can use this statement to normalize your legacy partition column value to conform to its type. In this case, the type conversion and normalization are not enabled for the column values in old partition_spec even with property hive.typecheck.on.insert set to true (default) which allows you to specify any legacy data in form of string in the old partition_spec"
Bug open
https://issues.apache.org/jira/browse/HIVE-10362
You can create a copy of the table without partition, then update the column of the table, and then recreate the first one with partition
create table table_name partitioned by (table_column) as
select
*
from
source_table
That worked for me.
i have an external partitioned table named employee with partition(year,month,day), everyday a new file come and seat at the particular day location call for today's date it will be at 2016/10/13.
TABLE SCHEMA:
create External table employee(EMPID Int,FirstName String,.....)
partitioned by (year string,month string,day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION '/.../emp';
so everyday we need to run command which is working fine as
ALTER TABLE employee ADD IF NOT EXISTS PARTITION (year=2016,month=10,day=14) LOCATION '/.../emp/2016/10/14';
but once we are trying with below command because we don't want to execute the above alter table command manually, it throws below Error
hive> MSCK REPAIR TABLE employee;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Note:
hive> MSCK TABLE employee; //this show me that a partition has not added in the table
OK
Partitions not in metastore: employee:2016/10/14
Time taken: 1.066 seconds, Fetched: 1 row(s)
please help me as i stuck with this. do we have any workaround for this type of situation?
I got a workaround solution for my problem which is, if the table static partition name is like 'year=2016/month=10/day=13' then we can use below command and it is working...
set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;
I have created hive table loading data from another table when i execute the query its starting but dint produce any results
CREATE TABLE fact_orders1 (order_number String, created timestamp, last_upd timestamp)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC;
OK Time taken: 0.188 seconds
INSERT OVERWRITE TABLE fact_orders1 SELECT * FROM fact_orders;
Query ID = hadoop_20151230051654_78edfb70-4d41-4fa7-9110-fa9a98d5405d
Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set
to 0 since there's no reduce operator Starting Job =
job_1451392201160_0007, Tracking URL =
http://localhost:8088/proxy/application_1451392201160_0007/ Kill
Command = /home/hadoop/hadoop-2.6.1/bin/hadoop job -kill
job_1451392201160_0007
You have no output from query because there is no data stored in it. I assume you use default metastore under /user/hive/warehouse so what you need to do is:
LOAD DATA LOCAL INPATH '/path/on/hdfs/to/data' OVERWRITE INTO TABLE fact_orders1;
That should work.
Also edit your query for table creation adding the LOCATION statement:
CREATE TABLE fact_orders1 (order_number String, created timestamp, last_upd timestamp)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC
LOCATION /user/hive/warehouse/fact_orders1;
In case if you want to use the data outside the hive metastore you need to use external tables