Reading a hive table in pyspark after altering the schema

Reading a hive table in pyspark after altering the schema - hive

I added a column to a hive table:
ALTER TABLE table_name ADD COLUMNS (new_col string);
But when I read the table using pyspark (2.1), I see the old schema. How do I download the updated table?

Related

Can we add column to an existing table in AWS Athena using SQL query?

I have a table in AWS Athena which contains 2 records. Is there a SQL query using which a new column can be inserted in to the table?

You can find more information about adding columns to table in Athena documentation
Or you can use CTAS
For example, you have a table with
CREATE EXTERNAL TABLE sample_test(
id string)
LOCATION
's3://bucket/path'
and you can create another table from sample_test with the query
CREATE TABLE new_test
AS
SELECT *, 'new' AS new_col FROM sample_test
You can use any available query after AS

This is mainly for future readers like me, who was struggling to get this working for Hive table with AVRO data and if you don't want to create new table i.e updating schema of the existing table. It works for csv using 'add columns', but not for Hive + AVRO. For Hive + AVRO, to append columns at the end, before partition columns, the solution is available at this link. However, there are couple of things to note that, we need to pass full schema to the literal attribute and not just the changes; and (not sure why but) we had to alter hive table for all 3 things in the same order - 1. add columns using add columns 2. set tblproperties and 3. set serdeproperties. Hopefully it helps someone.

How to create table over partitioned data

I have text file with snappy compression partitioned by field 'process_time' (result of Flume job). Example: hdfs://data/mytable/process_time=25-04-2019
This is my script for create table:
CREATE EXTERNAL TABLE mytable
(
...
)
PARTITIONED BY (process_time STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/mytable/'
TBLPROPERTIES("textfile.compress"="snappy");
The result of queries against this table are allways 0 (but I know that there are some data). Any help?
Thanks!

As you are creating external table on top of HDFS directory then to add the partitions to the hive table we need to run either of these commands.
if any partition added to HDFS directly(instead of using insert queries) then hive doesn't know about the newly added partitions, so we need to run either msck (or) add partitions to add newly added partitions to hive table.
To add all partitions to hive table:
hive> msck repair table <db_name>.<table_name>;
(or)
To manually add each partition to hive table:
hive> alter table <db_name>.<table_name> add partition(process_time="25-04-2019")
location '/data/mytable/process_time=25-04-2019';
For more details refer to this link.

Hive table not recognising partition

My hive table is partitioned with column 'job_id'. When I dump the data in the hdfs location of the table, then it is creating a partition with name 'JOB_ID' and my hive table is not recognizing it.
I have tried msck repair table command but that didn't helped either.

For external Hive tables you need to add new partition manually as follows:
ALTER TABLE table_name ADD PARTITION (job_id='927') location 'hdfs://some_location/job_id=927'

I found out that the partition name should always be in lowercase letter.
Here is the link:
https://medium.com/a-muggles-pensieve/hive-partition-column-name-camelcase-bad-idea-bc203d6e65da

External table does not return the data in its folder

I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?

You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!

When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".

Alter the schema of Hive table

I want to alter the table created in Hive which is mapped to HBase fields. Recently i have included few more column into HBase and thus would lik to add those fields into Hive as well.
for creation i used:
CREATE EXTERNAL TABLE test1(rowKey STRING,a STRING,b STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
('hbase.columns.mapping' = ':key,cf:address,cf:name')
TBLPROPERTIES ('hbase.table.name' = 'test');
now i want to add one more column in hive tables test1 which should be mapped to hbase but i don't find any way to do this. Pleas help Thanks.

Because you use external table, the easiest way is drop and create it again.
drop table test1;
and
create external table test1 {...};

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Reading a hive table in pyspark after altering the schema - hive

I added a column to a hive table: ALTER TABLE table_name ADD COLUMNS (new_col string); But when I read the table using pyspark (2.1), I see the old schema. How do I download the updated table?

Related

Can we add column to an existing table in AWS Athena using SQL query?

How to create table over partitioned data

Hive table not recognising partition

External table does not return the data in its folder

Alter the schema of Hive table

Categories

Resources