S3 folder structure for hive partitioned tables without "=" in it - amazon-s3

I have an existing S3 folder structure like this,
s3://mydata/{country}/{date}/
{country} could be any of 30 different countries
{date} could be any date since 20150101
How can I read this in Hive by treating {country} as partition and {date} as sub partition ?

You can use the Hive DDL statement ALTER TABLE ADD PARTITION
ALTER TABLE mydata
ADD PARTITION (country='south-africa', date='20191024')
LOCATION 's3://mydata/south-africa/20191024/';
You can script this using a shell script, and passing each statement to Hive like hive -e 'ALTER TABLE $TABLE ADD PARTITION $PARTITION_SPEC LOCATION $PARTITION_LOCATION'
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions

Related

Alter Hive partition column name NOT changed in HDFS

When alter the partition column name of the partition table(named partitioned_table), the corresponding directory in the HDFS does not change. However, the deletion and movement of partitions can be changed in the HDFS.And the the column name is changed using "show partitioin partitioned_table".
Hive version is 4.0.0-alpha-2.
Use the below statement to alter partiton column name.
ALTER TABLE table_name PARTITION
(partition_column = partition_col_value,
partition_column = partition_col_value)
RENAME TO PARTITION (partition_column = partition_col_value,
partition_column = partition_col_value);
Why and how to change the corresponding directory in HDFS when alter partition column name in Hive.
When you alter a partition, it only affects the Hive Metastore, and will never affect data in HDFS. For that, you need to explicitly insert data into the Hive table at that partition, or issue an hdfs mv command, then MSCK REPAIR Hive query to fix the metadata

How to create table over partitioned data

I have text file with snappy compression partitioned by field 'process_time' (result of Flume job). Example: hdfs://data/mytable/process_time=25-04-2019
This is my script for create table:
CREATE EXTERNAL TABLE mytable
(
...
)
PARTITIONED BY (process_time STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/mytable/'
TBLPROPERTIES("textfile.compress"="snappy");
The result of queries against this table are allways 0 (but I know that there are some data). Any help?
Thanks!
As you are creating external table on top of HDFS directory then to add the partitions to the hive table we need to run either of these commands.
if any partition added to HDFS directly(instead of using insert queries) then hive doesn't know about the newly added partitions, so we need to run either msck (or) add partitions to add newly added partitions to hive table.
To add all partitions to hive table:
hive> msck repair table <db_name>.<table_name>;
(or)
To manually add each partition to hive table:
hive> alter table <db_name>.<table_name> add partition(process_time="25-04-2019")
location '/data/mytable/process_time=25-04-2019';
For more details refer to this link.

Hive table not recognising partition

My hive table is partitioned with column 'job_id'. When I dump the data in the hdfs location of the table, then it is creating a partition with name 'JOB_ID' and my hive table is not recognizing it.
I have tried msck repair table command but that didn't helped either.
For external Hive tables you need to add new partition manually as follows:
ALTER TABLE table_name ADD PARTITION (job_id='927') location 'hdfs://some_location/job_id=927'
I found out that the partition name should always be in lowercase letter.
Here is the link:
https://medium.com/a-muggles-pensieve/hive-partition-column-name-camelcase-bad-idea-bc203d6e65da

dropping hive partition dynamically

I have an HIVE table with daily partitions day wise, something like below (which includes future date's partition as well)
20160901
20160902
........
........
........
20160931
20161001
20161002
I want to pass one date say for example yesterday's date 20160922 and want to drop all partitions dynamically which are >= 20160922 (though today is 20160923, but I want to drop from 20160922 date).
How can I can drop all these partitions dynamically.
You can not do in hive directly as it does not support dynamic sql.
There can be work around using shell script/or any script create file having drop partition script like below.
alter table partition_t drop if exists partition (y=20160922 );
alter table partition_t drop if exists partition (y=20160921 );
alter table partition_t drop if exists partition (y=20160920 );
...
then run hive -v -f ./file.sh
alter table partition_t drop if exists partition
Before Inserting Data Into Table Perform the below steps.
1) Go to Hdfs Folder of that table and delete all the folders Inside
Table Directory using Shell Commands. hadoop fs -rm r <>
2) Run MSCK repair Table to update the metadata about partitions.
above two steps will delete all the available partitions based on pattern.
Now Insert your new data.
You can drop partitions giving a range filter. For reference see that answer : https://stackoverflow.com/a/48422251/3132181
So your code could be like that:
Alter table mytable drop partition (datehour >= '20160922')

External table does not return the data in its folder

I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".