Bigquery : Change Table's Partitioning

Bigquery : Change Table's Partitioning - google-bigquery

I have an existing table partitioned by ingestion time (_PARTITIONTIME). That table got corrupted and I had to fix it by retrieving data from another table. The problem is that I now have nearly 95% of my data sharing the same _PARTITIONTIME (the date of the fix).
My data have a timestamp field my_timestamp, that could be used to have a correct partitioning.
But the other constraint I have is that there is multiple external connectors querying this table using the _PARTITIONTIME field and I want to avoid updating these queries.
I would like to alter the table such as, for every rows actually in the table, the _PARTITIONTIME has the value of my_timestamp field. And for the next rows appended in the table, the ingestion time is used.
Is it possible to do such thing ?

I finally found the solution, which was actually pretty simple but I post it here anyway in case someone else try over-complicated thing for such a simple use case.
UPDATE `MY_TABLE`
SET _PARTITIONTIME = my_timestamp
WHERE DATE(_PARTITIONTIME) = [THE_FIX_DATE]
For some reason I expected the _PARTITIONTIME to be immutable but it is not.

Related

How to insert/update a partitioned table in Big Query

Problem statement:
I need to insert/update a few columns in a big query table that is partitioned by date.So basically I need to do the necessary changes for each partitioned date (done by day).
(its the sessions table that is created automatically by linking the GA View to BQ so I haven't done the partition manually but its automatically taken care by google).
query reference from google_docs
my query:
I also tried the below :
Can anyone help me here ? sorry I am a bit naive with BQ.

You are trying to insert into a wildcard table, a meta-table that is actually composed of multiple tables. Wildcard table is read only and cannot be inserted into.

As Hua said, ga_sessions_* is not a partitioned table, but represents many tables, each with a different suffix.
You probably want to do this then:
INSERT INTO `p.d.ga_sessions_20191125` (visitNumber, visitId)
SELECT 1, 1574

Redshift Spectrum partitioning a table using two date fields

I was searching for best practices to create partitions by date, using amazon-redshift-spectrum, but the examples shows the problem being solved by partitioning the table by one date only. What to do if I have more than one date field?
Eg: Mobile events with user_install_date and event_date
How performative is to partition your s3 like:
installdate=2015-01-01/eventdate=2017-01-01
installdate=2015-01-01/eventdate=2017-01-02
installdate=2015-01-01/eventdate=2017-01-03
Will It kill my select performance ? What is the best strategy in this case?

If you data was partitioned in the above manner, then a query that merely had eventdate in the WHERE clause (without installdate) would be less efficient.
It would still need to look through every installdate directory, but it could skip over eventdate directories that do not match the predicate.
Put the less-used parameter second.

Google Big Query - Date-Partitioned Tables with Eventual Data

Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?

You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to

Quick way to reset all column values to a default

I'm converting data from one schema to another. Each table in the source schema has a 'status' column (default NULL). When a record has been converted, I update the status column to 1. Afterwards, I can report on the # of records that are (not) converted.
While the conversion routines are still under development, I'd like to be able to quickly reset all values for status to NULL again.
An UPDATE statement on the tables is too slow (there are too many records). Does anyone know a fast alternative way to accomplish this?

The fastest way to reset a column would be to SET UNUSED the column, then add a column with the same name and datatype.
This will be the fastest way since both operations will not touch the actual table (only dictionary update).
As in Nivas' answer the actual ordering of the columns will be changed (the reset column will be the last column). If your code rely on the ordering of the columns (it should not!) you can create a view that will have the column in the right order (rename table, create view with the same name as old table, revoke grants from base table, add grants to view).
The SET UNUSED method will not reclaim the space used by the column (whereas dropping the column will free space in each block).

If the column is nullable (since default is NULL, I think this is the case), drop and add the column again?

While the conversion routines are still under development, I'd like to be able to quickly reset all values for status to NULL again.
If you are in development why do you need 70 million records? Why not develop against a subset of the data?

Have you tried using flashback table?
For example:
select current_scn from v$database;
-- 5607722
-- do a bunch of work
flashback table TABLE_NAME to scn 5607722;
What this does is ensure that the table you are working on is IDENTICAL each time you run your tests. Of course, you need to ensure you have sufficient UNDO to hold your changes.

hm. maybe add an index to the status column.
or alterately, add a new table with the primary key only in it. then insert to that table when the record is converted, and TRUNC that table to reset...

I like some of the other answers, but I just read in a tuning book that for several reasons it's often quicker to recreate the table than to do massive updates on the table. In this case, it seems ideal, since you would be writing the CREATE TABLE X AS SELECT with hopefully very few columns.

Oracle partition pruning with NLS_COMP = Linguistic

Oracle 10g.
We have a large table partitioned by a varchar2 column (if it were up to me, it wouldn't be this column, but it is) with each partition having a single value. Ex. PARTITION "PARTION1" VALUES ('C').
We also have NLS_COMP = LINGUISTIC.
Partition pruning, when indicating a value in that column, doesn't work.
SELECT * from table1 where column_partitioned_by = 'C'
That does a full table scan on all partitions and not only the relevant one.
According to the docs here, "The NLS_COMP parameter does not affect comparison behavior for partitioned tables."
If I issue:
ALTER SESSION SET NLS_COMP = BINARY
And then:
SELECT * from table1 where column_partitioned_by = 'C'
it does correctly prune the partitions down. (I'm basing the prune/not prune off of the plans generated)
Is there anything, short of hardcoding partition names into the from clause, that would work here?
Additionally, changing the partition definition is out as well. I'm in the minority on my team as even seeing this as a problem. Before I got there, the previous team decided it would "solve" this problem by sending all application sql queries through a string-find-and-replace that adds hardcoded partition names in the FROM clause and has somebody manually update partition names in stored procs as needed...but it will break one day and it will break hard. I'm trying to find the least invasive approach but I'm afraid there may not be one.
Preferably, it would be a solution that only changing queries themselves and not the underlying db structure. Like I said, this solution simply may not exist...

Some solutions to prototype:
The CAST function. You can partition by an expression; the downside is your application would have to provide a similar expression.
Partition on NLS_SORT(column_partitioned_by, 'NLSSORT=BINARY'). Again, application changes required.
Converting column_partitioned_by to a numeric value, possibly using a code table to transform between the two. You'd have to include a join to that table throughout the application, though.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Bigquery : Change Table's Partitioning - google-bigquery

Related

How to insert/update a partitioned table in Big Query

Redshift Spectrum partitioning a table using two date fields

Google Big Query - Date-Partitioned Tables with Eventual Data

Quick way to reset all column values to a default

Oracle partition pruning with NLS_COMP = Linguistic

Categories

Resources