I have created an external table partitioning on two columns. The two columns are 'country' and 'state' stored as SEQUENCEFILE.
I am now trying to load the data into the table using the following command in Impala run via Hue editor -
load data inpath '/usr/temp/input.txt'
into table partitioned_user
partition (country = 'US', state = 'CA');
I am getting the following error -
AnalysisException: Partition key value may result in loss of precision. Would need to cast ''US'' to 'VARCHAR(64)' for partition column: country
What am I doing wrong? The table that I am inserting has columns such as and all are of type VARCHAR(64) - first_name,last_name,country,state.
The file input.txt contains the data only for the first two columns. Where am I going wrong?
Impala does not automatically convert from a larger type to a smaller one.. You must CAST() to a VARCHAR(64) before inserting to avoid such exception in Impala.
partition (country = cast('US' as VARCHAR(64)), state = cast('CA' as VARCHAR(64)))
Or use STRING datatype in table DDL instead.
Related
I am trying to change the data type of a column from int to double by using the alter command:
ALTER TABLE schema_name.table_name CHANGE COLUMN col1 col1 double CASCADE;
Now, if I run a select query over the table on presto:
select * from schema_name.table_name where partition_column = '2022-12-01
I get the error:
schema_name.table_name is declared as type double, but the Parquet
file
(hdfs://ns-platinum-prod-phx/secure/user/hive/warehouse/db_name.db/table_name/partition_column=2022-12-01/000002_0)
declares the column as type INT32"
However, if I run the query on Hive, it provides me the output.
I tried digging into the this, by creating a copy table of the source and deleting the partiton from hdfs. However, I run into the same problem again. Is there any other way to resolve this as this table contains huge data.
You cannot change the data type of the Hive table as the parquet files created in HDFS for older partitions won’t get updated.
The only fix is to create a new table and load the data into the new table from the older table.
So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter.
I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
How did you create your external table ??
For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.
I have table with some records in one user and another table with empty records. I want to migrate the data of that table from one user to another but I got one error ORA: 01722 because the datatype of the target table is slightly mismatch. What should I do resolve this problem without changes the datatype.
Data type of the source table is-
Description of the target table-
In both table in different user only one column is mismatch datatype LOTFRONTAGE. In source table datatype is varchar2 and in target table datatype is Number.
How to invalidate that which column having data type mismatch
While I insert the data using this SQL query-
insert into md.house(ID,MSSUBCLASS,MSZONING,
CAST(LOTFRONTAGE AS VARCHAR2(15)),LOTAREA,LOTSHAPE,LOTCONFIG,
NEIGHBORHOOD,CONDITION1,BLDGTYPE,OVERALLQUAL,
YEARBUILT,ROOFSTYLE,EXTERIOR1ST,MASVNRAREA)
select ID,MSSUBCLASS,MSZONING,LOTFRONTAGE,
LOTAREA,LOTSHAPE,LOTCONFIG,NEIGHBORHOOD,CONDITION1,
BLDGTYPE,OVERALLQUAL,YEARBUILT,ROOFSTYLE,
EXTERIOR1ST,MASVNRAREA from SYS.HOUSE_DATA;
Then i got an error
ORA-00917: comma missing
You could try this:
INSERT INTO 2ndTable (ID,...,LOTFRONTAGE,....MASVNAREA)
SELECT ID,...,to_number(LOTFRONTAGE),....MASVNAREA
FROM 1stTable;
There is already a question on Hive in general (
Is there a way to alter column type in hive table?). The answer to this question states that it is possible to change the schema with the alter table change command
However, is this also possible if the file is stored as ORC?
You can load the orc file into pyspark:
Load data into a dataframe:
df = spark.read.format("orc").load("<path-of-file-in-hdfs")
Create a view over the dataframe:
df2 = df.createOrReplaceTempView('Table')
Create a new data frame with manipulated columns:
df3 = spark.sql("select *, cast(third_column as float) as third_column, from Table")
Save the dataframe to hdfs:
df3.write.format("orc").save("<hdfs-path-where-file-needs-to-be-saved")
I ran tests on a ORC-table. It is possible to convert a string to a float column.
ALTER TABLE test_orc CHANGE third_column third_column float;
would convert a column called third_column that is marked as a string column to a float column. It is also possible to change the name of a column.
Sidenote: I was curious if other alterations on ORC might create problems. I ran into an exception when I tried to reorder columns.
ALTER TABLE test_orc CHANGE third_column third_column float AFTER first_column;
The exception is: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Reordering columns is not supported for table default.test_orc. SerDe may be incompatible.
I am trying to create dynamic partitions in hive using following code.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
create external table if not exists report_ipsummary_hourwise(
ip_address string,imp_date string,imp_hour bigint,geo_country string)
PARTITIONED BY (imp_date_P string,imp_hour_P string,geo_coutry_P string)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://abc';
insert overwrite table report_ipsummary_hourwise PARTITION (imp_date_P,imp_hour_P,geo_country_P)
SELECT ip_address,imp_date,imp_hour,geo_country,
imp_date as imp_date_P,
imp_hour as imp_hour_P,
geo_country as geo_country_P
FROM report_ipsummary_hourwise_Temp;
Where report_ipsummary_hourwise_Temp table contains following columns,
ip_address,imp_date,imp_hour,geo_country.
I am getting this error
SemanticException Partition spec {imp_hour_p=null, imp_date_p=null,
geo_country_p=null} contains non-partition columns.
Can anybody suggest why this error is coming ?
You insert sql have the geo_country_P column but the target table column name is geo_coutry_P. miss a n in country
I was facing the same error. It's because of the extra characters present in the file.
Best solution is to remove all the blank characters and reinsert if you want.
It could also be https://issues.apache.org/jira/browse/HIVE-14032
INSERT OVERWRITE command failed with case sensitive partition key names
There is a bug in Hive which makes partition column names case-sensitive.
For me fix was that both column name has to be lower-case in the table
and PARTITION BY clause's in table definition has to be lower-case. (they can be both upper-case too; due to this Hive bug HIVE-14032 the case just has to match)
It says while copying the file from result to hdfs jobs could not recognize the partition location. What i can suspect you have table with partition (imp_date_P,imp_hour_P,geo_country_P) whereas job is trying to copy on imp_hour_p=null, imp_date_p=null, geo_country_p=null which doesn't match..try to check hdfs location...the other point what i can suggest not to duplicate column name and partition twice
insert overwrite table report_ipsummary_hourwise PARTITION (imp_date_P,imp_hour_P,geo_country_P)
SELECT ip_address,imp_date,imp_hour,geo_country,
imp_date as imp_date_P,
imp_hour as imp_hour_P,
geo_country as geo_country_P
FROM report_ipsummary_hourwise_Temp;
The highlighted fields should be the same name available in the report_ipsummary_hourwise file