Is conversion of int to double of a column valid in presto? - sql

I am trying to change the data type of a column from int to double by using the alter command:
ALTER TABLE schema_name.table_name CHANGE COLUMN col1 col1 double CASCADE;
Now, if I run a select query over the table on presto:
select * from schema_name.table_name where partition_column = '2022-12-01
I get the error:
schema_name.table_name is declared as type double, but the Parquet
file
(hdfs://ns-platinum-prod-phx/secure/user/hive/warehouse/db_name.db/table_name/partition_column=2022-12-01/000002_0)
declares the column as type INT32"
However, if I run the query on Hive, it provides me the output.
I tried digging into the this, by creating a copy table of the source and deleting the partiton from hdfs. However, I run into the same problem again. Is there any other way to resolve this as this table contains huge data.

You cannot change the data type of the Hive table as the parquet files created in HDFS for older partitions won’t get updated.
The only fix is to create a new table and load the data into the new table from the older table.

Related

Alter column datatype in Hive table with cascasde not flowing in Parquet partitions

I'm trying to alter the column data type from int to bigint in Hive as below
ALTER TABLE <TABLE_NAME> CHANGE COLUMN <COLUMN_NAME> <COLUMN_NAME> BIGINT CASCADE
Hive meta store is getting updated successfully. But Parquet file schema is not getting updated due to which while querying the data in pyspark is throwing error.
parquet column cannot be converted in file expected int32 found int64
Some of the forums suggested to recreate the data, but the data size is in TB and takes huge amount of time. I need to do this for around 10 tables.
Is there anyway to change the column type in parquet as well

How to import daily csv data into table with generated columns postgres

I'm new to PostgreSQL and and looking for some guidance and best practice.
I have created a table by importing data from a csv file. I then altered the table by creating multiple generated columns like this:
ALTER TABLE master
ADD office VARCHAR(50)
GENERATED ALWAYS AS (CASE WHEN LEFT(location,4)='Chic' THEN 'CHI'
ELSE LEFT(location,strpos(location,'_')-1) END) STORED;
But when I try to import new data into the table I get the following error:
ERROR: column "office" is a generated column
DETAIL: Generated columns cannot be used in COPY.
My goal is to be able to import new data each day to the table and have the generated columns automatically populate in order to transform the data as I would like. How can I do so?
CREATE TEMP TABLE master (location VARCHAR);
ALTER TABLE master
ADD office VARCHAR
GENERATED ALWAYS AS (
CASE
WHEN LEFT(location, 4) = 'Chic' THEN 'CHI'
ELSE LEFT(location, strpos(location, '_') - 1)
END
) STORED;
--INSERT INTO master (location) VALUES ('Chicago');
--INSERT INTO master (location) VALUES ('New_York');
COPY master (location) FROM $$d:\cities.csv$$ CSV;
SELECT * FROM master;
Is this the structure and the behaviour you are expecting? If not, please provide more details regarding your table structure, your importable data and your importing commands.
Also, maybe when you try to import the csv file, the columns are not linked properly, or maybe the delimiter is not properly set. Try to specify each column in the exact order that appear in your csv file.
https://www.postgresql.org/docs/12/sql-copy.html
Note: d:\cities.csv contains:
Chicago
New_York
EDIT:
If columns positions are mixed up between table and csv, the following operation may come in handy:
1. create temporary table tmp (csv_column1 <data_type>, csv_column_2 <data_type>, ...); (including ALL csv columns)
2. copy tmp from '/path/to/file.csv';
3. insert into master (location, other_info, ...) select csv_column_3 as location, csv_column_7 as other_info, ... from tmp;
Importing data using an intermediate table may slow things down a little, but gives you a lot of flexibility.
I was getting the same error when importing to PG from a csv - I found that even though my column was generated, I still had to have it in the imported data, just left it empty. Worked fine when the column name was in there and mapped to my DB col name.

incompatible Parquet schema for column "ex: x is of type String" Column type: STRING, Parquet schema

I have an existing External Table called for example YYYYYY that contains n number of columns and this table is loaded daily with partitioned column as extract_date.
We got a request from business to add few more columns in the existing table. To implement this we have done following things.
DROP existing partitions from Hive
alter table xxxx.yyyyyy add columns (
`c10` string COMMENT '',
`b` string COMMENT '',
`c11` string COMMENT '',
`c12` string COMMENT '',
`c13` string COMMENT '',
`c14` string COMMENT '',
`c15` string COMMENT '') CASCADE;
alter table xxxx.yyyyyyy change `c8` `c8` string COMMENT '' after `c7` CASCADE;
After I did the above 2 steps, I went to Hive and did MSCK REPAIR TABLE xxxx.yyyyyy;
Partitions added (there are partitions from 2018) along with my new fields.
Before the changes I was able to query the data both from Impala and Hive, but after executing ALTER commands, I am getting the error as below.
> select * from xxxx.yyyyyyy where extract_date like '2019%';
Query: select * from XXXXX.YYYYYYY where extract_date like '2019%'
Query submitted at: 2020-05-09 11:57:10 (Coordinator: ' xxxx.yyyyyyy .c9'. Column type: STRING, Parquet schema:
optional fixed_len_byte_array a_auth [i:12 d:1 r:0]
Whereas in Hive I am able to browse the data with no issues. So I have an issue only in Impala.
Troubleshooting steps:
Created new table without additional columns and pointed the external path as new and copied the previously created partitions to new path.
MSCK REPAIR TABLE TABLE NAME;
Both in Impala and Hive a select query is working.
Added additional fields to the newly created table with ALTER commands then did the following things
MSCK REPAIR TABLE TABLE NAME;
In Impala :
REFRESH TABLE TABLE NAME;
INVALIDATE METADATA TABLE NAME;
This time in Hive select query worked but in Impala got the above mentioned error.
Can some one guide me why this is happening and how to fix this issue.
Impala Shell v2.12.0-cdh5.16.2

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

Alter the data type of a column in MonetDB

How can I alter the type of a column in an existing table in MonetDB? According to the documentation the code should be something like
ALTER TABLE <tablename> ALTER COLUMN <columnname> SET ...
but then I am basically lost because I do not know which standard the SQL used by MonetDB follows here and I get a syntax error. If this statement is not possible I would be grateful for a workaround that is not too slow for large (order of 10^9 records) tables.
Note: I ran into this problem while doing some bulk data imports from csv files into a table in my database. One of the columns is of type INT but the values in the file at some point exceed the INT limit of 2^31-1 (yes, the table is big) and so the transaction aborts. After I found out the reason for this failure, I wanted to change it to BIGINT but all versions of SQL code I tried failed.
This is currently not supported. However, there is a workaround:
Example table for this example, say we want to change the type of column b from integer to double.
create table a(b integer);
insert into a values(42);
Create a temporary column alter table a add column b2 double;
Set data in temporary column to original data update a set b2=b;
Remove the original column alter table a drop column b;
Re-create the original column with the new type alter table a add column b double;
Move data from temporary column to new column update a set b=b2;
Drop the temporary column alter table a drop column b2;
Profit
Note that this will change the ordering of columns if there are more than one. However, this is only a cosmetic issue.