ALTER TABLE table ADD IF NOT EXISTS PARTITION (state = '34' , city = '123') is not adding the partition in temp/local folder - apache-spark-sql

I am working on hive table partitioning and using spark client to trigger the request.
I have created the table and inserted data with partition.When I execute select statement I can see the data, but when I add Partition
spark.sql("ALTER TABLE temp_table6 ADD IF NOT EXISTS PARTITION (state = '34' , city = '123')")
second time onwards I am not able to get the data.
Since spark client looking for partitioned folder in temp location temp/temp_table6, spark is throwing an exception like below
py4j.protocol.Py4JJavaError: An error occurred while calling o93.showString.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/temp_table6/state=34/city=123
From second time onwards partitioned data not getting created under temp folder.

Related

How to modify CTAS query to append query results to table based on if new partition doesn't exist? - Athena

I have a query that I want to execute daily that's to be partitioned by the date it's executed. The results of this query should be appended to a the same table.
My idea was ideally having something similar to the CREATE TABLE IF NOT EXISTS command for adding data by a new partition every day to the existing table if the partition doesn't already exist, but I can't figure out how I'd be able to integrate this in my query.
My query:
CREATE TABLE IF NOT EXISTS db_name.table_name
WITH (
external_location = 's3://my-query-results-location/',
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['date_executed'])
AS
SELECT
{columns_that_I_am_selecting_here_including_'date_executed'}
What this does is create a new table for the first day it's executed but nothing happens for subsequent days, I'm assuming because of the CREATE TABLE IF NOT EXISTS validating that the table already exists and not proceeding with the logic.
Is there a way to modify my query to create a table for the first day executed and append the results by a new partition for each subsequent day?
I'm quite sure ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION would not apply to my use case here as I'm running a CTAS query.
You can simply use INSERT INTO existing_table SELECT....
Presumably your table is already partitioned, so include that partition column in the SELECT and Amazon Athena will automatically put the data in the correct directory.
For example, you might include hte column like this: SELECT ... CURRENT_DATE as date_executed
See: INSERT INTO - Amazon Athena

Hive external table is unable to read already partitioned hdfs directory

I have a map reduce job, that already writes out record to hdfs using hive partition naming convention.
eg
/user/test/generated/code=1/channel=A
/user/test/generated/code=1/channel=B
After I create an external table, it does not see the partition.
create external table test_1 ( id string, name string ) partitioned by
(code string, channel string) STORED AS PARQUET LOCATION
'/user/test/generated'
Even with the alter command
alter table test_1 ADD PARTITION (code = '1', channel = 'A')
, it does not see the partition or record,
because
select * from test_1 limit 1 produces 0 result.
If I use empty location when I create external table, and then use
load data inpath ...
then it works. But the issue is there is too many partitions for the load data inpath to work.
Is there a way to make hive recognize the partition automatically (without doing insert query)?
Using msck, it seems to be working. But I had to exit the hive session, and connect again.
MSCK REPAIR TABLE test_1

hive query is not working properly

I have created hive table loading data from another table when i execute the query its starting but dint produce any results
CREATE TABLE fact_orders1 (order_number String, created timestamp, last_upd timestamp)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC;
OK Time taken: 0.188 seconds
INSERT OVERWRITE TABLE fact_orders1 SELECT * FROM fact_orders;
Query ID = hadoop_20151230051654_78edfb70-4d41-4fa7-9110-fa9a98d5405d
Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set
to 0 since there's no reduce operator Starting Job =
job_1451392201160_0007, Tracking URL =
http://localhost:8088/proxy/application_1451392201160_0007/ Kill
Command = /home/hadoop/hadoop-2.6.1/bin/hadoop job -kill
job_1451392201160_0007
You have no output from query because there is no data stored in it. I assume you use default metastore under /user/hive/warehouse so what you need to do is:
LOAD DATA LOCAL INPATH '/path/on/hdfs/to/data' OVERWRITE INTO TABLE fact_orders1;
That should work.
Also edit your query for table creation adding the LOCATION statement:
CREATE TABLE fact_orders1 (order_number String, created timestamp, last_upd timestamp)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC
LOCATION /user/hive/warehouse/fact_orders1;
In case if you want to use the data outside the hive metastore you need to use external tables

External table does not return the data in its folder

I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".

How to Update/Drop a Hive Partition?

After adding a partition to an external table in Hive, how can I update/drop it?
You can update a Hive partition by, for example:
ALTER TABLE logs PARTITION(year = 2012, month = 12, day = 18)
SET LOCATION 'hdfs://user/darcy/logs/2012/12/18';
This command does not move the old data, nor does it delete the old data. It simply sets the partition to the new location.
To drop a partition, you can do
ALTER TABLE logs DROP IF EXISTS PARTITION(year = 2012, month = 12, day = 18);
in addition, you can drop multiple partitions from one statement (Dropping multiple partitions in Impala/Hive).
Extract from above link:
hive> alter table t drop if exists partition (p=1),partition (p=2),partition(p=3);
Dropped the partition p=1
Dropped the partition p=2
Dropped the partition p=3
OK
EDIT 1:
Also, you can drop bulk using a condition sign (>,<,<>), for example:
Alter table t
drop partition (PART_COL>1);
Alter table table_name drop partition (partition_name);
You can either copy files into the folder where external partition is located or use
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...)...
statement.
You may also need to make database containing table active
use [dbname]
otherwise you may get error (even if you specify database i.e. dbname.table )
FAILED Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter partition. Unable to alter partitions because table or database does not exist.