How to create a bucketed ORC transactional table in Hive that is modeled after a non-transactional table - hive

Suppose I have a non-transactional table in Hive named 'ccm'. It has hundreds of columns and one partition field.
I know how to create a copy with "create table abc like ccm' but I would like abc to be bucketed, ORC, and have transaction support set on via TBLPROPERTIES.
I do not want to mention all the columns in ccm when I compose the HQL.
Can I do this?

This answer may have the correct way to proceed in your case, and it also explains some limitation of the method used.
Create hive table using "as select" or "like" and also specify delimiter
So, from the example, you should add the missing parts:
CLUSTER BY
TBLPROPERTIES ("transactional"="true")
I have some doubts that you can achieve exactly your expected results but i would consider it as a step forward

Related

Bigquery - Updating GCS path of Bigquery external tables

I have a requirement where I might have to update the Bigquery External tables on a periodic basis.
The GCS location has timestamp for every incremental run, I would like to update to the latest timestamp folder as the path of External table.
One way i see is only dropping the table and creating again by pointing it to latest folder. But, is there any other way to update it without dropping the table
As suggested by #Samuel , you can use the SQL statement CREATE or REPLACE EXTERNAL TABLES for your requirement. Scheduled queries support DML and DDL statements which can be used to create the new tables. You can use the below mentioned query parameter to create the table according to your schedule :
My_database_name.my_table_name.my_results_{run_date}
For more information you can refer to this documentation.

Add new partition-scheme to existing table in athena with SQL code

Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).

Is there a way to merge ORC files in HDFS without using ALTER TABLE CONCATENATE command?

This is my first week with Hive and HDFS, so please bear with me.
Almost all the ways I saw so far to merge multiple ORC files suggest using ALTER TABLE with CONCATENATE command.
But I need to merge multiple ORC files of the same table without having to ALTER the table. Another option is to create a copy of the existing table and then use ALTER TABLE on that so that my original table remains unchanged. But I can't do that as well because space and data redundancy reasons.
The thing I'm trying to achieve (ideally) is: I need to transport these ORCs as one file per table into a cloud environment. So, is there a way that I can merge the ORCs on-the-go during the transfer process into cloud? Can this be achieved with/without Hive, maybe directly in HDFS?
Two possible methods other than ALTER TABLE CONCATENATE:
Try to configure merge task, see details here: https://stackoverflow.com/a/45266244/2700344
Alternatively you can force single reducer. This method is quite applicable for not too big files. You can overwrite the same table with ORDER BY, this will force single reducer on the last ORDER BY stage. This will work slow or even fail with big files because all the data will be passed through single reducer:
INSERT OVERWRITE TABLE
SELECT * FROM TABLE
ORDER BY some_col; --this will force single reducer
As a side effect you will get better packed ORC file with efficient index on columns listed in order by.

Create Partition table in Big Query

Can anyone please suggest how to create partition table in Big Query ?.
Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table with partitioned by date.
Thanks in Advance
Documentation for partitioned tables is here:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables
In this case, you'd create a partitioned table and populate the partitions with the data. You can run a query job that reads from GCS (and filters data for the specific date) and writes to the corresponding partition of a table. For example, to load data for May 1st, 2016 -- you'd specify the destination_table as table$20160501.
Currently, you'll have to run several query jobs to achieve this process. Please note that you'll be charged for each query job based on bytes processed.
Please see this post for some more details:
Migrating from non-partitioned to Partitioned tables
There are two options:
Option 1
You can load each daily file into separate respective table with name as YourLogs_YYYYMMDD
See details on how to Load Data from Cloud Storage
After tables created, you can access them either using Table wildcard functions (Legacy SQL) or using Wildcard Table (Standar SQL). See also Querying Multiple Tables Using a Wildcard Table for more examples
Option 2
You can create Date-Partitioned Table (just one table - YourLogs) - but you still will need to load each daily file into respective partition - see Creating and Updating Date-Partitioned Tables
After table is loaded you can easily Query Date-Partitioned Tables
Having partitions for an External Table is not allowed as for now. There is a Feature Request for it:
https://issuetracker.google.com/issues/62993684
(please vote for it if you're interested in it!)
Google says that they are considering it.

Trying to copy data from Impala Parquet table to a non-parquet table

I am moving data around within Impala, not my design, and I have lost some data. I need to copy the data from the parquet tables back to their original non-parquet tables. Originally, the developers had done this with a simple one liner in a script. Since I don't know anything about databases and especially about Impala I was hoping you could help me out. This is the one line that is used to translate to a parquet table that I need to be reversed.
impalaShell -i <ipaddr> use db INVALIDATE METADATA <text_table>;
CREATE TABLE <parquet_table> LIKE <text_table> STORED AS PARQUET TABLE;
INSERT OVERWRITE <parquet_table> SELECT * FROM <text_table>;
Thanks.
Have you tried simply doing
CREATE TABLE <text_table>
AS
SELECT *
FROM <parquet_table>
Per the Cloudera documentation, this should be possible.
NOTE: Ensure that your does not exist or use a table name that does not already exist so that you do not accidentally overwrite other data.