BigQuery Write Truncate with a partitioned table causes loss of partition information? - google-bigquery

We have recently partitioned most of our tables in BigQuery using the following method:
Run a Dataflow pipeline which reads a table and writes the data to a new partitioned table.
Copy the partitioned table back to the original table using a copy job with write truncate set.
Once complete the original table is replaced with the data from the newly created partitioned table, however, the original table is still not partitioned. So I tried the copy again, this time deleting the original table first and it all worked.
The problem is it takes 20 minutes to copy our partitioned table which would cause downtime for our production application. So is there any way of doing write truncate with a partitioned table replacing a non-partitioned table without causing any downtime? Or will we need to delete the table first in order to replace it?

Sorry but you cannot change a non-partitioned table to partitioned, or vice versa. You will have to delete and re-create the table.
Couple of workarounds I can think of:
Keep both tables while you're migrating your queries to the partitioned table. After all queries are migrated you delete the original, non-partitioned table.
If you are using Standard Sql, you can replace the original table with a view on top of the partitioned table. Deleting and replacing the original table with a view should be very quick. And partition pruning should still work on top of the view so you're only charged for the queried partitions. Partition pruning might not work for legacy sql.

Related

How to handle source table row deletions in SQL ETL?

I usually follow this strategy to load fact tables via ETL:
Truncate the staging table
Insert new rows (those added after the previous ETL run) into the fact table and changed rows into the staging table
Perform updates on the fact table based upon the data in the staging table
The challenge I am facing is that the source table rows can be deleted. How do I handle this deletion in ETL? I want the deleted row to be removed from the fact table. I cannot use merge between source oltp db table and target data warehouse table because that puts additional load at each ETL run.
Note: the source table has got a date last modified column, but this is not useful to me because the record disappears from the source table upon deletion.

Drop empty Impala partitions

Impala external table partitions still show up in stats with row count 0 after deleting the data in HDFS and altering (like ALTER TABLE table RECOVER PARTITIONS) refreshing (REFRESH table) and invalidation of metadata.
Trying to drop partitions one by one works, but there are tens of partitions which should be removed and it would be quite tedious.
Dropping and recreating the table would also be an option but that way all the statistics would be dropped together with the table.
Is there any kind of other options in impala to get this done?
Found a workaround through HIVE.
By issuing MSCK REPAIR TABLE tablename SYNC PARTITIONS then refreshing the table in impala, the empty partitions disappear.

Merge, Partition and Remote Database - Performance Tuning Oracle

I want to tune my merge query which inserts and updates table in Oracle based on source table in SQL Server. Table Size is around 120 million rows and normally around 120k records are inserted/updated daily. Merge takes around 1.5 hours to run. It uses nested loop and primary key index to perform insert and update.
There is no record update date in source table to use; so all records are compared.
Merge abc tgt
using
(
select a,b,c
from sourcetable#sqlserver_remote) src
on (tgt.ref_id = src.ref_id)
when matched then
update set
.......
where
decode(tgt.a, src.a,1,0) = 0
or ......
when not matched then
insert (....) values (.....);
commit;
Since the table is huge and growing every day, I partitioned the table in DEV based on ref id (10 groups) and created local index on ref id.
Now it uses hash join and full table scan and it runs longer than the existing process.
When I changed from local to global index (ref_id), i uses nested loops but still takes longer to run than the existing process.
Is there a way to performance tune the process.
Thanks...
I'd be wary to join/merge huge tables over a database link. I'd try to copy over the complete source table (for instance with a non-atomic mview, possibly compressed, possibly sorted, certainly only the columns you'll need). After gathering statistics, I'd merge the target table with the local copy. Afterwards, the local copy can be truncated.
I wouldn't be surprised, if partitioning speeds up the merge from the local copy to your target table.

Populating a table from a view in Oracle with "locked" truncate/populate

I would like to populate a table from a (potentially large) view on a scheduled basis.
My process would be:
Disable indexes on table
Truncate table
Copy data from view to table
Enable indexes on table
In SQL Server, I can wrap the process in a transaction such that when I truncate the table a schema modification lock will be held until I commit. This effectively means that no other process can insert/update/whatever until the entire process is complete.
However I am aware that in Oracle the truncate table statement is considered DDL and will thus issue an implicit commit.
So my question is how can I mimic the behaviour of SQL Server here? I don't want any other process trying to insert/update/whatever whilst I am truncating and (re)populating the table. I would also prefer my other process to be unaware of any locks.
Thanks in advance.
Make your table a partitioned table with a single partition and local indexes only. Then whenever you need to refresh:
Copy data from view into a new temporary table
CREATE TABLE tmp AS SELECT ... FROM some_view;
Exchange the partition with the temporary table:
ALTER TABLE some_table
EXCHANGE PARTITION part WITH TABLE tmp
WITHOUT VALIDATION;
The table is only locked for the duration of the partition exchange, which, without validation and global index update, should be instant.

Time associated with dropping a table

Does the time it takes to drop a table in SQL reflect the quantity of data within the table?
Lets say for dropping an identical table one with 100000 rows and 1000 rows.
Is this MySQL? The ext3 Linux filesystem is known to be slow to drop large tables. The time to drop will have to do more with the size of the table, more than the rows in the table.
http://www.mysqlperformanceblog.com/2009/06/16/slow-drop-table/
It will definitely depend on the dbserver and on the specific storage params in that db server and upon the existence of LOBs in the table. For MOST scenarios, it's very fast.
In Oracle, dropping a table is very quick and unlikely to matter compared to other operations you would be performing (creating the table and populating it).
It would not be idomatic to be creating and dropping tables enough in Oracle for this to be any sort of a performance factor. You would instead consider using global temporary tables.
From http://download.oracle.com/docs/cd/B28359_01/server.111/b28318/schema.htm
In addition to permanent tables,
Oracle Database can create temporary
tables to hold session-private data
that exists only for the duration of a
transaction or session.
The CREATE GLOBAL TEMPORARY TABLE
statement creates a temporary table
that can be transaction-specific or
session-specific. For
transaction-specific temporary tables,
data exists for the duration of the
transaction. For session-specific
temporary tables, data exists for the
duration of the session. Data in a
temporary table is private to the
session. Each session can only see and
modify its own data. DML locks are not
acquired on the data of the temporary
tables. The LOCK statement has no
effect on a temporary table, because
each session has its own private data.