Trying to copy data from Impala Parquet table to a non-parquet table - sql

I am moving data around within Impala, not my design, and I have lost some data. I need to copy the data from the parquet tables back to their original non-parquet tables. Originally, the developers had done this with a simple one liner in a script. Since I don't know anything about databases and especially about Impala I was hoping you could help me out. This is the one line that is used to translate to a parquet table that I need to be reversed.
impalaShell -i <ipaddr> use db INVALIDATE METADATA <text_table>;
CREATE TABLE <parquet_table> LIKE <text_table> STORED AS PARQUET TABLE;
INSERT OVERWRITE <parquet_table> SELECT * FROM <text_table>;
Thanks.

Have you tried simply doing
CREATE TABLE <text_table>
AS
SELECT *
FROM <parquet_table>
Per the Cloudera documentation, this should be possible.
NOTE: Ensure that your does not exist or use a table name that does not already exist so that you do not accidentally overwrite other data.

Related

Add new partition-scheme to existing table in athena with SQL code

Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).

Is there a way to merge ORC files in HDFS without using ALTER TABLE CONCATENATE command?

This is my first week with Hive and HDFS, so please bear with me.
Almost all the ways I saw so far to merge multiple ORC files suggest using ALTER TABLE with CONCATENATE command.
But I need to merge multiple ORC files of the same table without having to ALTER the table. Another option is to create a copy of the existing table and then use ALTER TABLE on that so that my original table remains unchanged. But I can't do that as well because space and data redundancy reasons.
The thing I'm trying to achieve (ideally) is: I need to transport these ORCs as one file per table into a cloud environment. So, is there a way that I can merge the ORCs on-the-go during the transfer process into cloud? Can this be achieved with/without Hive, maybe directly in HDFS?
Two possible methods other than ALTER TABLE CONCATENATE:
Try to configure merge task, see details here: https://stackoverflow.com/a/45266244/2700344
Alternatively you can force single reducer. This method is quite applicable for not too big files. You can overwrite the same table with ORDER BY, this will force single reducer on the last ORDER BY stage. This will work slow or even fail with big files because all the data will be passed through single reducer:
INSERT OVERWRITE TABLE
SELECT * FROM TABLE
ORDER BY some_col; --this will force single reducer
As a side effect you will get better packed ORC file with efficient index on columns listed in order by.

Schema Evolution in Parquet Hive table

I have a lot of data in a Parquet based Hive table (Hive version 0.10). I have to add a few new columns to the table. I want the new columns to have data going forward. If the value is NULL for already loaded data, that is fine with me.
If I add the new columns and not update the old Parquet files, it gives an error and it looks strange as I am adding String columns only.
Error getting row data with exception java.lang.UnsupportedOperationException: Cannot inspect java.util.ArrayList
Can you please tell me how to add new fields to Parquet Hive without affecting the already existing data in the table ?
I use Hive version 0.10.
Thanks.
1)
Hive starting with version 0.13 has parquet schema evoultion built in.
https://issues.apache.org/jira/browse/HIVE-6456
https://github.com/Parquet/parquet-mr/pull/297
ps. Notice that out-of-the-box support for schema evolution might take a toll on performance. For example, Spark has a knob to turn parquet schema evolution on and off. After one of the recent Spark releases, it's now off by default because of performance hit (epscially when there are a lot of parquet files). Not sure if Hive 0.13+ has such a setting too.
2)
Also wanted to suggest to try creating views in Hive on top of such parquet tables where you expect often schema changes, and use views everywhere but not tables directly.
For example, if you have two tables - A and B with compatible schemas, but table B has two more columns, you could workaround this by
CREATE VIEW view_1 AS
SELECT col1,col2,col3,null as col4,null as col5 FROM tableA
UNION ALL
SELECT col1,col2,col3,col4,col5 FROM tableB
;
So you don't actually have to recreate any tables like #miljanm has suggested, you can just recreate the view. It'll help with the agility of your projects.
Create a new table with the two new columns. Insert data by issuing:
insert into new_table select old_table.col1, old_table.col2,...,null,null from old_table;
The last two nulls are for the two new columns. That's it.
If you have too many columns, it may be easier for you to write a program that reads the old files and writes the new ones.
Hive 0.10 does not have support for schema evolution in parquet as far as I know. Hive 0.13 does have it, so you may try to upgrade hive.

Hive: create table and write it locally at the same time

Is it possible in hive to create a table and have it saved locally at the same time?
When I get data for my analyses, I usually create temporary tables to track eventual
mistakes in the queries/scripts. Some of these are just temporary tables, while others contain the data that I actually need for my analyses.
What I do usually is using hive -e "select * from db.table" > filename.tsv to get the data locally; however when the tables are big this can take quite some time.
I was wondering if there is some way in my script to create the table and save it locally at the same time. Probably this is not possible, but I thought it is worth asking.
Honestly doing it the way you are is the best way out of the two possible ways but it is worth noting you can preform a similar task in an .hql file for automation.
Using syntax like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp' select * from table;
You can run a query and store it somewhere in the local directory (as long as there is enough space and correct privileges)
A disadvantage to this is that with a pipe you get the data stored nicely as '|' delimitation and new line separated, but this method will store the values in the hive default '^b' I think.
A work around is to do something like this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/temp'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select books from table;
But this is only in Hive 0.11 or higher

remove source file from Hive table

When I load a (csv)-file to a hive table I can load without overwriting, thus adding the new file to the table.
Internally the file is just copied to the correct folder in HDFS
(e.g. user/warehouse/dbname/tablName/datafile1.csv). And probably some metadata is updated.
After a few loads I want to remove the contents of a specific file from the table.
I am sure I cannot simply delete the file because of the metadata that needs to be adjusted as well. There must be some kind of build-in function for this.
How do I do that?
Why do you need that?I mean Hive was developed to serve as a warehouse where you put lots n lots n lots of data and not to delete data every now and then. Such a need seems to be a poorly thought out schema or a poor use of Hive, at least to me.
And if you really have these kind of needs why don't you create partitioned tables? If you need to delete some specific data just delete that particular partition using either TRUNCATE or ALTER.
TRUNCATE TABLE table_name [PARTITION partition_spec];
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,...
if this feature is needed more than just once in a while you can use MapR's distribution while allows this kind of operations with no problem (even via NFS). otherwise, if you don't have partition I think you'll have to create and new table using CTAS filterring the data in the bad file or just copy the good files back to os with "hadoop fs -copyToLocal" and move them back to hdfs into new table