how to drop columns from a table in impala - impala

I got a table with more than 2000 columns, but i need to drop only one column
There is any efficient way to make it in impala?
I'm trying this way:
alter table proceso.prueba drop subsegm
select * from proceso.prueba
but, i got this error in the "select":
'hdfs://nameservice1/user/hive/warehouse/proceso.db/prueba/914a7dd4a8462ff1-
860a4c1d00000011_978927331_data.1.parq' has an incompatible Parquet schema
for column 'proceso.prueba.nfi_meses_antiguedad_bco'. Column type: INT,
Parquet schema: optional byte_array subsegm [i:4 d:1 r:0]
What am i doing wrong?
Thanks for help

This error occurs when the schema defined for the table (datatype of columns in this case) is conflicting with the schema present in the corresponding Parquet files of the table.
To fix this, you can check the below,
Perform a SHOW CREATE TABLE proceso.prueba and list the columns.
Run the command parquet-tools meta hdfs://nameservice1/user/hive/warehouse/proceso.db/prueba/914a7dd4a8462ff1-
860a4c1d00000011_978927331_data.1.parq to see the metadata that includes the column details.
Compare the results from #1, #2 to see if the number of columns are correct and see if the datatype for the column subsegm (in #2 result) is equivalent to what it is supposed to have (in #1 result).
Modifying the table datatype to the correct value, structure (if needed) should help you fix this issue.
Hope that helps!

Related

Presto failed: com.facebook.presto.spi.type.VarcharType

I created a table with three columns - id, name, position,
then I stored the data into s3 using orc format using spark.
When I query select * from person it returns everything.
But when I query from presto, I get this error:
Query 20180919_151814_00019_33f5d failed: com.facebook.presto.spi.type.VarcharType
I have found the answer for the problem, when I stored the data in s3, the data inside the file was with one more column that was not defined in the hive table metastore.
So when Presto tried to query the data, it found that there are varchar instead of integer.
This also might happen if one record has a a type different than what is defined in the metastore.
I had to delete my data and import it again without that extra unneeded column

Import csv to impala

So for my previous homework, we were asked to import a csv file with no columns names to impala, where we explicitly give the name and type of each column while creating the table. However, now we have a csv file but with column names given, in this case, do we still need to write down the name and type of it even it is provided in the data?
Yes, you still have to create an external table and define the column names and types. But you have to pass the following option right at the end of the create table statement
tblproperties ("skip.header.line.count"="1");
-- Once the table property is set, queries skip the specified number of lines
-- at the beginning of each text data file. Therefore, all the files in the table
-- should follow the same convention for header lines.

How to add a column in the middle of a ORC partitioned hive table and still be able to query old partitioned files with new structure

Currently I have a Partitioned ORC "Managed" (Wrongly created as Internal first) Hive table in Prod with atleast 100 days worth of data partitioned by year,month,day(~16GB of data).
This table has roughly 160 columns.Now my requirement is to Add a column in the middle of this table and still be able to query the older data(partitioned files).Its is fine if the newly added column shows null for the old data.
What I did so far ?
1)First convert the table to External using below to preserve data files before dropping
alter table <table_name> SET TBLPROPERTIES('EXTERNAL'='TRUE');
2)Drop and Recreate the table with new column in the middle and then Altered the table to add the partition file
However I am unable to read the table after Recreation .I get this Error message
[Simba][HiveJDBCDriver](500312) Error in fetching data rows: *org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.io.IOException: ORC does not support type conversion from file type array<string> (87) to reader type int (87):33:32;
Any other way to accomplish this ?
No need to drop and recreate the table. Simply use the following statement.
ALTER TABLE default.test_table ADD columns (column1 string,column2 string) CASCADE;
ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata.
PS - This will add new columns to the end of the existing columns but before the partition columns. Unfortunately, ORC does not support adding columns in the middle as of now.
Hope that helps!

hive doesn't change parquet schema

I've a problem with alter table that changes the table schema but not the parquet schema.
For example I've a PARQUET table with these columns:
column1(string) column2(string)
column3(string) column4(string)
column5(bigint)
Now, I try to change the table's schema with
ALTER TABLE name_table DROP COLUMN column3;
With DESCRIBE TABLE I can see that the column2 there is not anymore;
Now I try to execute select * from table but i receive an error like this :
"data.0.parq' has an incompatible type with the table schema for column column4. Expected type: INT64. Actual type: BYTE_ARRAY"
The values of deleted column are yet present in parquet file that has 5 columns and not 4 (as the table schema)
This is a bug? How I can change the Parquet file's schema using Hive?
This is not a bug. When you drop the columns, that just updates the definition in Hive Metastore, which is just the information about the table. The underlying files on HDFS remain unchanged. Since the parquet metadata is embedded in the files , they have no idea what the metadata change has been.
Hence you see this issue.
The solution is described here. If you want to add a column(s) to a parquet table and be compatible with both impala and hive, you need to add the column(s) at the end.
If you alter the table and change column names or drop a column, that table will no longer be compatible with impala.
I had the same error after adding a column to hive table.
Solution is to set the below query option at each session
set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
If you're using Cloudera distribution, set it permanently in Cloudera Manager => Impala configuration => Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)
set config value as PARQUET_FALLBACK_SCHEMA_RESOLUTION=name

Bigquery: invalid: Illegal Schema update

I tried to append data from a query to a bigquery table.
Job ID job_i9DOuqwZw4ZR2d509kOMaEUVm1Y
Error: Job failed while writing to Bigquery. invalid: Illegal Schema update. Cannot add fields (field: debug_data) at null
I copy and paste the query executed in above jon, run it in web console and choose the same dest table to append, it works.
The job you listed is trying to append query results to a table. That query has a field named 'debug_data'. The table you're appending to does not have that field. This behavior is by design, in order to prevent people from accidentally modifying the schema of their tables.
You can run a tables.update() or tables.patch() operation to modify the table schema to add this column (see an example using bq here: Bigquery add columns to table schema), and then you'll be able to run this query successfully.
Alternately, you could use truncate instead of append as the write disposition in your query job; this would overwrite the table, and in doing so, will allow schema changes.
See this post for how to have bigquery automatically add new fields to a schema while doing an append.
The code in python is:
job_config.schema_update_options = ['ALLOW_FIELD_ADDITION']