Inserting a pyspark dataframe to an existing partitioned hive table

Inserting a pyspark dataframe to an existing partitioned hive table - hive

I have a hive table which is partitioned by column inserttime.
I have a pyspark dataframe which has the same columns as the table except for the partitioned column.
The following works well when the table is not partitioned:
df.insertInto('tablename',overwrite=True)
But I am not able to figure out how to insert to a particular partition from pyspark
Tried below:
df.insertInto('tablename',overwrite=True,partition(inserttime='20170818-0831'))
but it did not work and failed with
SyntaxError: non-keyword arg after keyword arg
and I am using pyspark 1.6

One option is:
df.registerTempTable('tab_name')
hiveContext.sql("insert overwrite table target_tab partition(insert_time=value) select * from tab_name ")
Another option is to add this static value as the last column of dataframe and try to use insertInto() as dynamic partition mode.

you can use df.write.mode("overwrite").partitionBy("inserttime").saveAsTable("TableName")
or you can overwrite the value in the partition itself.
df.write.mode(SaveMode.Overwrite).save("location/inserttime='20170818-0831'")
Hope this helps.

Related

Is conversion of int to double of a column valid in presto?

I am trying to change the data type of a column from int to double by using the alter command:
ALTER TABLE schema_name.table_name CHANGE COLUMN col1 col1 double CASCADE;
Now, if I run a select query over the table on presto:
select * from schema_name.table_name where partition_column = '2022-12-01
I get the error:
schema_name.table_name is declared as type double, but the Parquet
file
(hdfs://ns-platinum-prod-phx/secure/user/hive/warehouse/db_name.db/table_name/partition_column=2022-12-01/000002_0)
declares the column as type INT32"
However, if I run the query on Hive, it provides me the output.
I tried digging into the this, by creating a copy table of the source and deleting the partiton from hdfs. However, I run into the same problem again. Is there any other way to resolve this as this table contains huge data.

You cannot change the data type of the Hive table as the parquet files created in HDFS for older partitions won’t get updated.
The only fix is to create a new table and load the data into the new table from the older table.

Can we add column to an existing table in AWS Athena using SQL query?

I have a table in AWS Athena which contains 2 records. Is there a SQL query using which a new column can be inserted in to the table?

You can find more information about adding columns to table in Athena documentation
Or you can use CTAS
For example, you have a table with
CREATE EXTERNAL TABLE sample_test(
id string)
LOCATION
's3://bucket/path'
and you can create another table from sample_test with the query
CREATE TABLE new_test
AS
SELECT *, 'new' AS new_col FROM sample_test
You can use any available query after AS

This is mainly for future readers like me, who was struggling to get this working for Hive table with AVRO data and if you don't want to create new table i.e updating schema of the existing table. It works for csv using 'add columns', but not for Hive + AVRO. For Hive + AVRO, to append columns at the end, before partition columns, the solution is available at this link. However, there are couple of things to note that, we need to pass full schema to the literal attribute and not just the changes; and (not sure why but) we had to alter hive table for all 3 things in the same order - 1. add columns using add columns 2. set tblproperties and 3. set serdeproperties. Hopefully it helps someone.

Is there a way to alter column in a hive table that is stored as ORC?

There is already a question on Hive in general (
Is there a way to alter column type in hive table?). The answer to this question states that it is possible to change the schema with the alter table change command
However, is this also possible if the file is stored as ORC?

You can load the orc file into pyspark:
Load data into a dataframe:
df = spark.read.format("orc").load("<path-of-file-in-hdfs")
Create a view over the dataframe:
df2 = df.createOrReplaceTempView('Table')
Create a new data frame with manipulated columns:
df3 = spark.sql("select *, cast(third_column as float) as third_column, from Table")
Save the dataframe to hdfs:
df3.write.format("orc").save("<hdfs-path-where-file-needs-to-be-saved")

I ran tests on a ORC-table. It is possible to convert a string to a float column.
ALTER TABLE test_orc CHANGE third_column third_column float;
would convert a column called third_column that is marked as a string column to a float column. It is also possible to change the name of a column.
Sidenote: I was curious if other alterations on ORC might create problems. I ran into an exception when I tried to reorder columns.
ALTER TABLE test_orc CHANGE third_column third_column float AFTER first_column;
The exception is: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Reordering columns is not supported for table default.test_orc. SerDe may be incompatible.

SemanticException Partition spec {col=null} contains non-partition columns

I am trying to create dynamic partitions in hive using following code.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
create external table if not exists report_ipsummary_hourwise(
ip_address string,imp_date string,imp_hour bigint,geo_country string)
PARTITIONED BY (imp_date_P string,imp_hour_P string,geo_coutry_P string)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://abc';
insert overwrite table report_ipsummary_hourwise PARTITION (imp_date_P,imp_hour_P,geo_country_P)
SELECT ip_address,imp_date,imp_hour,geo_country,
imp_date as imp_date_P,
imp_hour as imp_hour_P,
geo_country as geo_country_P
FROM report_ipsummary_hourwise_Temp;
Where report_ipsummary_hourwise_Temp table contains following columns,
ip_address,imp_date,imp_hour,geo_country.
I am getting this error
SemanticException Partition spec {imp_hour_p=null, imp_date_p=null,
geo_country_p=null} contains non-partition columns.
Can anybody suggest why this error is coming ?

You insert sql have the geo_country_P column but the target table column name is geo_coutry_P. miss a n in country

I was facing the same error. It's because of the extra characters present in the file.
Best solution is to remove all the blank characters and reinsert if you want.

It could also be https://issues.apache.org/jira/browse/HIVE-14032
INSERT OVERWRITE command failed with case sensitive partition key names
There is a bug in Hive which makes partition column names case-sensitive.
For me fix was that both column name has to be lower-case in the table
and PARTITION BY clause's in table definition has to be lower-case. (they can be both upper-case too; due to this Hive bug HIVE-14032 the case just has to match)

It says while copying the file from result to hdfs jobs could not recognize the partition location. What i can suspect you have table with partition (imp_date_P,imp_hour_P,geo_country_P) whereas job is trying to copy on imp_hour_p=null, imp_date_p=null, geo_country_p=null which doesn't match..try to check hdfs location...the other point what i can suggest not to duplicate column name and partition twice

insert overwrite table report_ipsummary_hourwise PARTITION (imp_date_P,imp_hour_P,geo_country_P)
SELECT ip_address,imp_date,imp_hour,geo_country,
imp_date as imp_date_P,
imp_hour as imp_hour_P,
geo_country as geo_country_P
FROM report_ipsummary_hourwise_Temp;
The highlighted fields should be the same name available in the report_ipsummary_hourwise file

How can I create a partitioned table 'like' an unpartitioned table with Hive HQL?

I've got a table with two weeks worth of entries, and I would like to copy those entries into a table partitioned by date (creating it if it does not exist).
I'm writing a luigi task to do this, and I would love for it to be independent of the table schema--i.e. I wouldn't have to specify column names and types, and it would CREATE TABLE IF NOT EXISTS when necessary.
I was hoping I could use:
CREATE TABLE IF NOT EXISTS test_part
COMMENT 'This is a test table to see if partitioning works in this case'
PARTITIONED BY (event_date string)
AS select *, '2014-12-15' from source_db.source_table
where event_at <'2014-12-16' and event_at >='2014-12-15';
But this of course fails with: FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table
I tried again with "like" with basically the same results. Is there a way to do this that I am missing? It doesn't have to be atomic. Multiple sequential commands are fine.

You do not do a create table as.
You create a table first using describe source_table and then you make an insert into table partition (event_date string)
2 steps it works better.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Inserting a pyspark dataframe to an existing partitioned hive table - hive

One option is: df.registerTempTable('tab_name') hiveContext.sql("insert overwrite table target_tab partition(insert_time=value) select * from tab_name ") Another option is to add this static value as the last column of dataframe and try to use insertInto() as dynamic partition mode.

you can use df.write.mode("overwrite").partitionBy("inserttime").saveAsTable("TableName") or you can overwrite the value in the partition itself. df.write.mode(SaveMode.Overwrite).save("location/inserttime='20170818-0831'") Hope this helps.

Related

Is conversion of int to double of a column valid in presto?

Can we add column to an existing table in AWS Athena using SQL query?

Is there a way to alter column in a hive table that is stored as ORC?

SemanticException Partition spec {col=null} contains non-partition columns

How can I create a partitioned table 'like' an unpartitioned table with Hive HQL?

Categories

Resources