Can I use incremental, time travel, and snapshot queries with hudi only using spark-sql? - apache-spark-sql

I'm trying to do incremental, snapshot, and time travel queries using spark-sql with hudi, but the only way that I can find to do this is creating a DataFrame with spark.read and then creating a temp view. Is there any way to accomplish this with only using spark-sql? This is in an EMR cluster, version 6.6.0.

Yes you can.
For time travel, you need spark 3.2+:
select * from <table name> timestamp as of '20221113183401100' where id = 1;
select * from <table name> timestamp as of '2022-11-13 18:34:01.100' where id = 1;
select * from <table name> timestamp as of '2022-11-13' where id = 1;
For incremental queries, you need to sync your MoR table to hive, this will create two view for your table, <table name>_rt for real-time view, <table name>_ro for read-optimized view.
set hoodie.<table name>.consume.mode=INCREMENTAL;
set hoodie.<table name>.consume.max.commits=3;
set hoodie.<table name>.consume.start.timestamp=20221113183401;
select * from <table name>_rt where `_hoodie_commit_time`>'20221113183401';
And for snapshot queries, it is still under development (issue), but you can use the incremental queries syntax to achieve it, by using the timestamp you want to query as start timestamp and setting max commits to 1:
set hoodie.<table name>.consume.mode=INCREMENTAL;
set hoodie.<table name>.consume.max.commits=1;
set hoodie.<table name>.consume.start.timestamp=20221113183401;
select * from <table name>_rt where `_hoodie_commit_time`='20221113183401';

Related

I need to make backup from partitioned table (Hive)

I need to make backup data from partitioned table which has over 500 partitions.
My table has partitioning by date_part like "date_part = 20221101" or "date_part = 20221102" etc.
I need to take 30 partitions from 20221101 to 20221130 and make copy to another new backup-table.
If I do something like this:
create table <backup_table> as
select * from <data_table> where date_part between 20221101 and 20221130
at the output I get non-partitioned <backup_table> and idk is it good way or not but i guess partitioned <backup_table> will be more better.
If I try to do:
create table <bacup_table> like <data_table>;
insert overwrite table <backup_table> partition (`date_part`)
select * from <data_table> where date_part between 20221101 and 20221130;
At the output I get error like need to specify partition columns...
If I go another way:
create table <bacup_table> like <data_table>;
insert overwrite table <backup_table> partition (`date_part`)
select field1, field2...,
date_part
from <data_table> where date_part between 20221101 and 20221130;
I get another errors like "error running query" or "...nonstrick mode..." or something else.
I've tried a lot of hive settings but it still not work :(
Thats why I need your help to do it correctly.
enable dynamic partition and copy the data.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.mapred.mode = nonstrict;

How to generate a script to create all tables with different schema

I have about 200 tables in a schema.
I need to replicate these tables in a new backup schema with an automatic procedure.
I would like to create a procedure to dynamically recreate all the Tables in a Schema (potentially dynamic number of tables and columns) on a different schema.
I can cycle all the tables and create the SELECT * INTO dbo_b.TABLE FROM dbo.TABLE statement, but I get the error:
Column 'AMBIENTE' has a data type that cannot participate in a columnstore index.
I created a view that simply SELECT * FROM TABLE, and tried to perform the SELECT * INTO dbo_b.TABLE from dbo.VIEW but I got the same issue.
It works only if I create the dbo_b.Table and INSERT INTO it: so I would need to generate a script to automatically cycle all the tables in my schema and generate a script to create the tables in the new schema.
It's not a one time job, it should run every day so I cannot do it manually.
Seams we get the same issue.
You can try to loop on all table and create table in the new schema in this way:
IF EXISTS(SELECT * FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'YYYY' AND TABLE_SCHEMA = 'XXXX')
drop table [ZZZZ].[YYYY]
CREATE TABLE [ZZZZ].[YYYY]
WITH ( DISTRIBUTION = ROUND_ROBIN
, HEAP ) as
( SELECT * FROM XXXX.YYYY )
Let me know. BR

Listing all the partitions from BigQuery partitioned table with require_partition_filter

I am trying to find a way to list the partitions of a table created with require_partition_filter = true however I am not able to find the way yet.
This is table creation script
CREATE TABLE mydataset.partitionedtable_partitiontime
(
x INT64 \
)
PARTITION BY DATE(_PARTITIONTIME)
OPTIONS(
require_partition_filter = true
);
Some test rows
INSERT INTO mydataset.partitionedtable_partitiontime (_PARTITIONTIME, x) SELECT TIMESTAMP("2017-05-01"), 10;
INSERT INTO mydataset.partitionedtable_partitiontime (_PARTITIONTIME, x) SELECT TIMESTAMP("2017-04-01"), 20;
INSERT INTO mydataset.partitionedtable_partitiontime (_PARTITIONTIME, x) SELECT TIMESTAMP("2017-03-01"), 30;
As expected, If a try the following query to get the partitions, I am getting an error because I need to user a filter on top of the partitioning column
SELECT _PARTITIONTIME as pt, FORMAT_TIMESTAMP("%Y%m%d", _PARTITIONTIME) as partition_id
FROM `mydataset.partitionedtable_partitiontime`
GROUP BY _PARTITIONTIME
ORDER BY _PARTITIONTIME
Error
Cannot query over table 'mydataset.partitionedtable_partitiontime' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
any ideas how to list the partitions?
EDIT: I know that it is possible to add the filter, but I am looking for a solution like "SHOW PARTITIONS TABLENAME" of Hive to list all the partitions (which are essentially metadata)
Thanks!
Here is the way to do it:
SELECT * FROM `mydataset.partitionedtable_partitiontime$__PARTITIONS_SUMMARY__`
The bigquery.jobs.create permission is required.
EDIT: Now is possible to get this information using Standard SQL:
SELECT * FROM `myproject.mydataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'partitionedtable'
As mentioned by hlagos, you can get this data by querying the _PARTITIONTIME pseudo column, in case you are using Standard SQL, or the __PARTITIONS_SUMMARY__ meta table for Legacy SQL.
You can take a look on this GCP documentation that contains detailed information about the usage of this partitioned tables metadata.

BigQuery Equivalent of "CREATE TABLE my_table (LIKE your_table)"

I want to create a table which schema is exactly the same as another table. In other SQL engines, I think I was able to use "CREATE TABLE my_table (LIKE your_table)" or some variations.
I couldn't find the equivalent in BigQuery yet. Is this possible in some fashion?
Use this form:
CREATE TABLE dataset.new_table AS
SELECT *
FROM dataset.existing_table
LIMIT 0
This creates a new table with the same schema as the old one, and there is no cost due to the LIMIT 0.
Note that this does not preserve partitioning, table description, etc., however. Another option is to use the CLI (or API), making a copy of the table and then overwriting its contents, e.g.:
$ bq cp dataset.existing_table dataset.new_table
$ bq query --use_legacy_sql --replace --destination_table=dataset.new_table \
"SELECT * FROM dataset.new_table LIMIT 0;"
Now the new table has the same structure and attributes as the original did.
To create a partitioned and/or clustered table the syntax would be:
CREATE TABLE project.dataset.clustered_table
PARTITION BY DATE(created_time)
CLUSTER BY
account_id
AS SELECT * FROM project.dataset.example_table LIMIT 0

Insert overwrite in Hive

I am trying to use Insert overwrite in Hive. Basically I would like to insert overwrite not the complete partition but only a few records in the partition. I am not finding any solution to do it (Insert overwrite in destination table based on a filter on non partition column also).
Is there any way I can achieve it?
Hive is not as Regular RDBMS, If you want to update the record simple do INSERT OVERWRITE TABLE Table_Name...simple change your data in one temporary table or by using WITH clause simply insert overwrite..by using table partioning..it is safe.
QUERY[HIVE]:
WITH TEMP_TABLE AS (SELECT * FROM SOURCE_TABLE_NAME) INSERT OVERWRITE TABLE TARGET_TABLE_NAME SELECT * FROM TEMP_TABLE
Hive is not an RDBMS. What you are trying to achieve with Hive is not recommended. Hive is better suited for batch processing over very large sets of immutable data.
However, from what I could deduce, you are trying to update an existing record in your table. To do so, enable ACID support on the table that needs to be updated and your update queries will start working.
UPDATE <TABLE>
SET <COL1>='Value1',
SET <COL2>='Value2'
WHERE <Some Condition That Only Evaluates To The Rows You Need Updated>