How to rename AWS Athena columns with parquet file source? - amazon-s3

I have data loaded in my S3 bucket folder as multiple parquet files.
After loading them into Athena I can query the data successfully.
What are the ways to rename the Athena table columns for parquet file source and still be able to see the data under renamed column after querying?
Note: checked with edit schema option, column is getting renamed but after querying you will not see data under that column.

There is as far as I know no way to create a table with different names for the columns than what they are called in the files. The table can have fewer or extra columns, but only the names that are the same as in the files will be queryable.
You can, however, create a view with other names, for example:
CREATE OR REPLACE VIEW a_view AS
SELECT
a AS b,
b AS c
FROM the_table

Related

How to recover old column value from parquet file after renaming?

Description
I have a managed partitioned Hive table table_a with data stored in Amazon S3 in parquet format. I renamed column col_old to col_new. And, I lost all the data of col_old because of the way parquet file works.
Question
Is there any way to recover values of col_old? (I still have the old parquet data files.)
Here are few things I tried:
Created a new table with old files and renamed col_new to col_old.
Created a new table with old files and added col_old.
creae table with new column and do
insert into new table as select * from old_table

How to Add a Column to BigQuery External Table

I have a bunch of data stored in partitioned ORC files in Google Cloud Storage. My bucket looks something like the following:
my_bucket
- folder_of_orc_files
- - partition1=abc
- - - file1.orc
- - - file2.orc
I have an external table defined in BigQuery that points to the data above that was created like this:
CREATE OR REPLACE EXTERNAL TABLE my_dataset.my_external_table
WITH PARTITION COLUMNS (
partition1 STRING,
)
OPTIONS (
uris=['gs://my_bucket/folder_of_orc_files/*'],
format=orc,
hive_partition_uri_prefix='gs://my_bucket/folder_of_orc_files'
);
Those files currently have columns "Column A", "Column B" and "Column C".
Now I need to add "Column D". So I add a file3.orc that contains "Column D". In reality, of course, I have a metric ton of files and would rather not have to recreate all of the old ones. I need a way to have the external table see "Column D" with NULLS for the old entries and with the proper values from the new files.
Out the door, the external table does not see "Column D". So I dropped the table and re-added it. It still only has columns A, B and C and completely ignores "Column D". With ORC files you cannot specify schema - it reads it automatically. The ALTER TABLE command does not work for external tables.
The only thing I have found (short of reloading all of the data from scratch from my Spark jobs) is that I could move all of my data in to temporary tables and then re-write them out - which is again quite a large task (and expensive) when you have a huge amount of data. Anyone know of any other way to achieve this? Thanks!
When you ask BigQuery to perform a schema autodetect, BigQuery simply get a sample of lines (for CSV, or JSON format). For binary files, such as ORC format, I guess it's a sample of files, or the first one get.
Anyway, after the schema definition, the schema is never updated automatically. If you r format change, you have to update it manually.
Indeed, you can't do it with an ALTER TABLE statement, but you can do it on the UI or with the bq CLI (or the API/client libraries if you prefer). You have the documentation here
So, not impossible, but not automatic!

How Can I create a Hive Table on top of a Parquet File

Facing issue on creating hive table on top of parquet file. Can someone help me on the same.? I have read many articles and followed the guidelines but not able to load a parquet file in Hive Table.
According "Using Parquet Tables in Hive" it is often useful to create the table as an external table pointing to the location where the files will be created, if a table will be populated with data files generated outside of Hive.
hive> create external table parquet_table_name (<yourParquetDataStructure>)
STORED AS PARQUET
LOCATION '/<yourPath>/<yourParquetFile>';

How hive create a table from a file present in HDFS?

I am new to HDFS and HIVE. I got some introduction of both after reading some books and documentation. I have a question regarding creation of a table in HIVE for which file is present in HDFS.
I have this file with 300 fields in HDFS. I want to create a table accessing this file in HDFS. But I want to make use of say 30 fields from this file.
My questions are
1. Does hive create a separate file directory?
2. Do I have to create hive table first and import data from HDFS?
3. Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
4. Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
My questions are
Does hive create a separate file directory?
YES if you create a hive table (managed/external) and load the data using load command.
NO if you create an external table and point to the existing file.
Do I have to create hive table first and import data from HDFS?
Not Necessarily you can create a hive external table and point to this existing file.
Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
You can do it easily using hiveQL. follow the below steps (note: this is not the only approach):
create a external table with 300 column and point to the existing
file.
create another hive table with desired 30 columns and insert data to this new table from 300 column table using "insert into
table30col select ... from table300col". Note: hive will create the
file with 30 columns during this insert operation.
Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
Yes this can be an alternative.
I personally like solution mentioned in question 3 as I don't have to recreate the file and I can do all of that in hadoop without depending on some other system.
You have several options. One is to have Hive simply point to the existing file, i.e. create an external HIVE table:
CREATE EXTERNAL TABLE ... LOCATION '<your existing hdfs file>';
This table in Hive will, obviously, match exactly your existing table. You must declare all 300 columns. There will be no data duplication, there is only one one file, Hive simply references the already existing file.
A second option would be to either IMPORT or LOAD the data into a Hive table. This would copy the data into a Hive table and let Hive control the location. But is important to understand that neither IMPORT nor LOAD do not transform the data, so the result table will have exactly the same structure layout and storage as your original table.
Another option, which I would recommend, is to create a specific Hive table and then import the data into it, using a tool like Sqoop or going through an intermediate staging table created by one of the methods above (preferably external reference to avoid an extra copy). Create the desired table, create the external reference staging table, insert the data into the target using INSERT ... SELECT, then drop the staging table. I recommend this because it lets you control not only the table structure/schema (ie. have only the needed 30 columns) but also, importantly, the storage. Hive has a highly columnar performant storage format, namely ORC, and you should thrive to use this storage format because will give you tremendous query performance boost.

Hive Extended table

When we create using
Create external table employee (name string,salary float) row format delimited fields terminated by ',' location /emp
In /emp directory there are 2 emp files.
so when we run select * from employee, it get the data from both the file ad display.
What will be happen when there will be others file also having different kind of record which column is not matching with the employee table , so it will try to load all the files when we run "select * from employee"?
1.Can we specify the specific file name which we want to load?
2.Can we create other table also with the same location?
Thanks
Prashant
It will load all the files in emp directory even it doesn’t match with table.
for your first question. you can use Regex serde.if your data matches to regex.then it loads to the table.
regex for access log in hive serde
https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
other options:I am pointing some links.these links has some ways.
when creating an external table in hive can I point the location to specific files in a direcotry?
https://issues.apache.org/jira/browse/HIVE-951
for your second question: yes we can create other tables also with the same location.
Here are your answers
1. If the data in the file dosent match with table format, hive doesnt throw an error. It tries to read the data as best as it could. If data for some columns are missing it will put NULL for them.
No we cannot specify the file name for any table to read data. Hive will consider all the files under the table directory.
Yes, we can create other tables with the same location.