In time-partitioned bigquery tables, when is data written to __UNPARTITIONED__? what are the effects? - google-bigquery

I ran into some freak undocumented behavior of time-partitioned bigquery tables:
I created a time-partitioned table in BigQuery and inserted data.
I was able to insert normally - data was written to today's partition (I was also able to explicitly specify a partition and write into it)
After some tests with new data, I deleted today's partition, in order to have clean data:(CLI)
bq --project_id=my-project rm v1.mytable$20160613
I then checked whether it's empty:
select count(*) from [v1.mytable]
Result 270 instead of 0
I tried deleting again and rerunning the query - same result.
So I queried
select count(*) from [v1.mytable$20160613]
Result 0
so a couple of previous dates in which I may have inserted data, but all were 0.
Finally I ran
SELECT partition_id from [v1.mytable$__PARTITIONS_SUMMARY__];
and the result was
{ UNPARTITIONED 20160609 20160613 }
and all the data was in fact in UNPARTITIONED
My questions:
When is the data written to this special partition instead of the daily partition, and how can I avoid this?
Are there other effects, except from losing the ability to address specific dates (in query, or when deleting data, etc.)? should I take care for this case?

While data is in the streaming buffer, it remains in the UNPARTITIONED partition. To address this partition in a query, you can use the value NULL for the _PARTITIONTIME pseudo column.
SELECT ... FROM mydataset.mypartitioned_table WHERE _PARTITIONTIME IS NULL
To delete data for a given partition, we suggest doing a write truncate to it with a query that returns an empty result. For example:
bq query --destination_table=mydataset.mypartitionedtable\$20160121 --replace 'SELECT 1 as field1, "one" as field2 FROM (SELECT 1 as field1, "one" as field2) WHERE FALSE'
Note that the partition will still be around (if you do a SELECT * from table$__PARTITIONS__SUMMARY), but it will have 0 rows.
$ bq query 'SELECT COUNT(*) from [mydataset.mypartitionedtable$20160121]'
+-----+
| f0_ |
+-----+
| 0 |
+-----+

This is a temporary state -- querying an hour later the records all belonged to today's partition.
The effect is thus similar to a delay in data write: querying immediately after the insert may not have the most recent data in the correct partition, but eventually this will be ok

Related

Bigquery Schedule query to load data to a particular partition

I am using the bigquery schedule query functionality to run a query every 30 mins.
My destination table will be a partitioned table and the partionining column is 'event_date'
The schedule query that i am using will be to copy today's data from source_table -> Dest_table
(like select * from source_table where event_date = CURRENT_DATE())
every 30 mins ,
but i would like it to write_truncate existing partition without write truncating the whole table.(since i don't want to duplicate today's data every 30 mins)
Currently when i schedule this query with partition_field set to event_date and write_truncate , it is truncating the whole table and this causes the previous data to be lost . Is there something else that i am missing
Instead of specifying destination table, you may use MERGE to truncate only one partition.
It is unfortunately more expensive, for you also pay for deleting the data from dest_table. (Insert is still free)
MERGE dest_table t
USING source_table
ON FALSE
WHEN NOT MATCHED BY SOURCE AND event_date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

reduce the amount of data scanned by Athena when using aggregate functions

The below query scans 100 mb of data.
select * from table where column1 = 'val' and partition_id = '20190309';
However the below query scans 15 GB of data (there are over 90 partitions)
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
How can I optimize the second query to scan the same amount of data as the first?
There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one #PiotrFindeisen pointed out around dynamic filtering.
The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Trino (formerly PrestoSQL) to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.
If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"
select *
from table
where column1 = 'val' and
partition_id = (select max(partition_id) from "table$partitions");
The second problem is the one #PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.
This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.
BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.
EDIT: Presto removed the __internal_partitions__ table in their 0.193 release so I'd suggest not using the solution defined in the Slow aggregation queries for partition keys section below in any production systems since Athena 'transparently' updates presto versions. I ended up just going with the naive SELECT max(partition_date) ... query but also using the same lookback trick outlined in the Lack of Dynamic Filtering section. It's about 3x slower than using the __internal_partitions__ table, but at least it won't break when Athena decides to update their presto version.
----- Original Post -----
So I've come up with a fairly hacky way to accomplish this for date-based partitions on large datasets for when you only need to look back over a few partitions'-worth of data for a match on the max, however, please note that I'm not 100% sure how brittle the usage of the information_schema.__internal_partitions__ table is.
As #Dain noted above, there are really two issues. The first being how slow an aggregation of the max(partition_date) query is, and the second being Presto's lack of support for dynamic filtering.
Slow aggregation queries for partition keys
To solve the first issue, I'm using the information_schema.__internal_partitions__ table which allows me to get quick aggregations on the partitions of a table without scanning the data inside the files. (Note that partition_value, partition_key, and partition_number in the below queries are all column names of the __internal_partitions__ table and not related to your table's columns)
If you only have a single partition key for your table, you can do something like:
SELECT max(partition_value) FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
But if you have multiple partition keys, you'll need something more like this:
SELECT max(partition_date) as latest_partition_date from (
SELECT max(case when partition_key = 'partition_date' then partition_value end) as partition_date, max(case when partition_key = 'another_partition_key' then partition_value end) as another_partition_key
FROM information_schema.__internal_partitions__
WHERE table_schema = 'DATABASE_NAME' AND table_name = 'TABLE_NAME'
GROUP BY partition_number
)
WHERE
-- ... Filter down by values for e.g. another_partition_key
)
These queries should run fairly quickly (mine run in about 1-2 seconds) without scanning through the actual data in the files, but again, I'm not sure if there are any gotchas with using this approach.
Lack of Dynamic Filtering
I'm able to mitigate the worst effects of the second problem for my specific use-case because I expect there to always be a partition within a finite amount of time back from the current date (e.g. I can guarantee any data-production or partition-loading issues will be remedied within 3 days). It turns out that Athena does do some pre-processing when using presto's datetime functions, so this does not have the same types of issues with Dynamic Filtering as using a sub-query.
So you can change your query to limit how far it will look back for the actual max using the datetime functions so that the amount of data scanned will be limited.
SELECT * FROM "DATABASE_NAME"."TABLE_NAME"
WHERE partition_date >= cast(date '2019-06-25' - interval '3' day as varchar) -- Will only scan partitions from 3 days before '2019-06-25'
AND partition_date = (
-- Insert the partition aggregation query from above here
)
I don't know if it is still relevant, but just found out:
Instead of:
select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);
Use:
select a.* from table a
inner join (select max(partition_id) max_id from table) b on a.partition_id=b.max_id
where column1 = 'val';
I think it has something to do with optimizations of joins to use partitions.

How can I avoid and/or clean duplicated row in BigQuery?

How should I import data in BigQuery on a daily basis when I have potential duplicated row ?
Here is a bit of context. I'm updating data on a daily basis from a spreadsheet to BigQuery. I'm using Google App Script with a simple WRITE_APPEND method.
Sometimes I'm importing data I've already imported the day before. So I'm wondering how I can avoid this ?
Can I build a sql query in order to clean my table from duplicate row every day ? Or is this possible to detect duplicate even before importing them (with some specific command in my job definition for example...) ?
thanks !
Step 1: Have a sheet with data to be imported
Step 2: Set up your spreadsheet as a federated data source in BigQuery.
Step 3: Use DML to load data into an existing table
(requires #standardSql)
#standardSQL
INSERT INTO `fh-bigquery.tt.test_import_native` (id, data)
SELECT *
FROM `fh-bigquery.tt.test_import_sheet`
WHERE id NOT IN (
SELECT id
FROM `fh-bigquery.tt.test_import_native`
)
WHERE id NOT IN (...) ensures that only rows with new ids are loaded into the table.
As far as I know, the answer provided by Felipe Hoffa is the most effective way to avoid duplicate rows since Bigquery do not normalize data when loading data. The reason is that Bigquery performs best with denormalized data [1]. To better understand it, I’d recommend you to have a look in this SO thread.
I also would like to suggest using SQL aggregate or analytic function to clean the duplicate rows in a Bigquery table, as Felipe Hoffa's or Jordan Tigani's answer in this SO question.
If you have a large-size partitioned table, and only want to remove duplicates in a given range without scanning through (cost-saving) and replacing the whole table.
use the MERGE SQL below:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `your_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `your_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

Alternatives to UPDATE statement Oracle 11g

I'm currently using Oracle 11g and let's say I have a table with the following columns (more or less)
Table1
ID varchar(64)
Status int(1)
Transaction_date date
tons of other columns
And this table has about 1 Billion rows. I would want to update the status column with a specific where clause, let's say
where transaction_date = somedatehere
What other alternatives can I use rather than just the normal UPDATE statement?
Currently what I'm trying to do is using CTAS or Insert into select to get the rows that I want to update and put on another table while using AS COLUMN_NAME so the values are already updated on the new/temporary table, which looks something like this:
INSERT INTO TABLE1_TEMPORARY (
ID,
STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS)
SELECT
ID
3 AS STATUS,
TRANSACTION_DATE,
TONS_OF_OTHER_COLUMNS
FROM TABLE1
WHERE
TRANSACTION_DATE = SOMEDATE
So far everything seems to work faster than the normal update statement. The problem now is I would want to get the remaining data from the original table which I do not need to update but I do need to be included on my updated table/list.
What I tried to do at first was use DELETE on the same original table using the same where clause so that in theory, everything that should be left on that table should be all the data that i do not need to update, leaving me now with the two tables:
TABLE1 --which now contains the rows that i did not need to update
TABLE1_TEMPORARY --which contains the data I updated
But the delete statement in itself is also too slow or as slow as the orginal UPDATE statement so without the delete statement brings me to this point.
TABLE1 --which contains BOTH the data that I want to update and do not want to update
TABLE1_TEMPORARY --which contains the data I updated
What other alternatives can I use in order to get the data that's the opposite of my WHERE clause (take note that the where clause in this example has been simplified so I'm not looking for an answer of NOT EXISTS/NOT IN/NOT EQUALS plus those clauses are slower too compared to positive clauses)
I have ruled out deletion by partition since the data I need to update and not update can exist in different partitions, as well as TRUNCATE since I'm not updating all of the data, just part of it.
Is there some kind of JOIN statement I use with my TABLE1 and TABLE1_TEMPORARY in order to filter out the data that does not need to be updated?
I would also like to achieve this using as less REDO/UNDO/LOGGING as possible.
Thanks in advance.
I'm assuming this is not a one-time operation, but you are trying to design for a repeatable procedure.
Partition/subpartition the table in a way so the rows touched are not totally spread over all partitions but confined to a few partitions.
Ensure your transactions wouldn't use these partitions for now.
Per each partition/subpartition you would normally UPDATE, perform CTAS of all the rows (I mean even the rows which stay the same go to TABLE1_TEMPORARY). Then EXCHANGE PARTITION and rebuild index partitions.
At the end rebuild global indexes.
If you don't have Oracle Enterprise Edition, you would need to either CTAS entire billion of rows (followed by ALTER TABLE RENAME instead of ALTER TABLE EXCHANGE PARTITION) or to prepare some kind of "poor man's partitioning" using a view (SELECT UNION ALL SELECT UNION ALL SELECT etc) and a bunch of tables.
There is some chance that this mess would actually be faster than UPDATE.
I'm not saying that this is elegant or optimal, I'm saying that this is the canonical way of speeding up large UPDATE operations in Oracle.
How about keeping in the UPDATE in the same table, but breaking it into multiple small chunks?
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 0000000 and 0999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 1000000 and 1999999
COMMIT
UPDATE .. WHERE transaction_date = somedatehere AND id BETWEEN 2000000 and 2999999
COMMIT
This could help if the total workload is potentially manageable, but doing it all in one chunk is the problem. This approach breaks it into modest-sized pieces.
Doing it this way could, for example, enable other apps to keep running & give other workloads a look in; and would avoid needing a single humungous transaction in the logfile.

insertId equivalent for bq command line

I'm making some tests to avoid doublons during insert. I have noticed rows[].insertId could permit to avoid doublons but it's seems the bq command line have no such parameter. I have tried with --undefok but with no effect.
bq --apilog= --show_build_data insert --insert_id=201603210850 --template_suffix=_20160520 --dataset_id=mydataset --undefok=insert_id MYTEMPLATE.table myjson.json
have I missing something ?
AFAIK the insert_id is only taken into account on streaming insert bases, not load jobs.
And it's not a switch, it's a value of the row being ingested.
https://cloud.google.com/bigquery/streaming-data-into-bigquery#before_you_begin
Manually removing duplicates
You can use the following manual process to ensure that no duplicate rows exist after you are done streaming.
1) Add the insertID as a column in your table schema and include the insertID value in the data for each row.
2) After streaming has stopped, perform the following query to check for duplicates:
SELECT max(count) FROM(
SELECT <id_column>, count(*) as count
FROM <table>
GROUP BY id_column)
If the result is greater than 1, duplicates exist.
3) To remove duplicates, perform the following query. You should specify a destination table, allow large results, and disable result flattening.
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY <id_column>)
row_number,
FROM <table>
)
WHERE row_number = 1
Notes about the duplicate removal query:
The safer strategy for the duplicate removal query is to target a new table. Alternatively, you can target the source table with write disposition WRITE_TRUNCATE.
The duplicate removal query adds a row_number column with the value 1 to the end of the table schema. You can select by specific column names to omit this column.
For querying live data with duplicates removed, you can also create a view over your table using the duplicate removal query. Be aware that query costs against the view will be calculated based on the columns selected in your view, which can result in large bytes scanned sizes.