Bigquery: Check for duplications during stream - google-bigquery

We have some data generated from our devices installed on clients' side. Duplicated data exist and it is by design, which means we wouldn't be able to eliminate duplicated ones in data generating phase. We are now looking into the possibility to avoid duplication while streaming into Bigquery (rather than clean the data by doing table copy and delete later). That's to say, for every ready-to-be-streamed record, we check whether it's already in Bigquery first, if not then we continue to stream it in, if it does exist, then we won't stream it in.
But here's the concern: (quote from [here]:https://developers.google.com/bigquery/streaming-data-into-bigquery)
Data availability
The first time a streaming insert occurs, the streamed data is inaccessible for a warm-up period of up to two minutes. After the warm-up period, all streamed data added during and after the warm-up period is immediately queryable. After several hours of inactivity, the warm-up period will occur again during the next insert.
Data can take up to 90 minutes to become available for copy and export operations.
Our data will go into different bigquery tables (the table name is dynamically generated from the data's date_time). What does "the first time a stream insert occur" mean? Is it per table?
Does the above doc mean that we cannot rely on the query result to check for duplications in the process of streaming?

If you provide an insert id, bigquery will automatically do the deduplication for you, as long as the duplicates are within the de-duplication window. The official docs don't mention how long the de-duplicatin window is, but it is generally from 5 minutes to 90 minutes (if you write data very quickly to a table, it will be closer to 5 than 90, but if data is trickled in, it will last longer in the deduplication buffers.).
Regarding "the first time a streaming insert occurs", this is per table. If you have a new table and start streaming to it, it may take a few minutes for that data to be available for querying. Once you've started streaming, however, new data will be available immediately.

Related

How long does bigquery steaming buffer persist

I am migrating data from a sql server database systme to bigquery at the moment, and I have encountered a problem when trying to delete records from a bigquery table with active streaming buffer, can you confirm how long does a streaming buffer persist before it is removed so the delete operation can run against it? I find this has caused unnecessary inconvenience during development.
Many thanks for your help and I look forward to hearing from you.
Best regards,
According to official documentation
Data can take up to 90 minutes to become available for copy and export operations. Also, when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column. To see whether data is available for copy and export, check the tables.get response for a section named streamingBuffer. If that section is absent, your data should be available for copy or export, and should have a non-null value for the _PARTITIONTIME pseudo column. Additionally, the streamingBuffer.oldestEntryTime field can be leveraged to identify the age of records in the streaming buffer.
Streaming into partitioned tables
When the data is streamed, data between 7 days in the past and 3 days in the future is placed in the streaming buffer, and then it is extracted to the corresponding partitions. Data outside of this window (but inside the 1 year, 6 month range) is placed in streaming buffer, and then it is extracted to the UNPARTITIONED partition. When there's enough unpartitioned data, it is loaded to the corresponding partitions.
We overcomed your situation by delaying the delete requests, or do once in 24 hours. You could script now the query for delete to use the streamingBuffer.oldestEntryTime as a param, and attempt to delete anything older than that.
https://cloud.google.com/bigquery/streaming-data-into-bigquery

Populating fact table with different sequence time

I am using the following query to populate my fact table:
Select sh.isbn_l,sh.id_c,sh.id_s, sh.data,sh.quantity, b.price
from Book as b
inner join Sales as sh
on l.isbn=sh.isbn_l
The main thing is that I want to load the table from a specific time to a specific time. So if I load today, I will get all the records from today till the last time I loaded.
And if I load it the day after tomorrow, I will get the datas from today after load time, till the day after tomorrow.
What I mean is NO DUBLICATED ROWS or DATAS. What should I do ?
Any idea pleasee ?
Thank you in advance
Streams (and maybe Tasks) are your friend here.
A Snowflake Stream records the delta of change data capture (CDC) information for a table (such as a staging table), including inserts and other DML changes. A stream allows querying and consuming a set of changes to a table, at the row level, between two transactional points of time.
In a continuous data pipeline, table streams record when staging tables and any downstream tables are populated with data from business applications using continuous data loading and are ready for further processing using SQL statements.
Snowflake Tasks may optionally use table streams to provide a convenient way to continuously process new or changed data. A task can transform new or changed rows that a stream surfaces. Each time a task is scheduled to run, it can verify whether a stream contains change data for a table (using SYSTEM$STREAM_HAS_DATA) and either consume the change data or skip the current run if no change data exists.
Users can define a simple tree-like structure of tasks that executes consecutive SQL statements to process data and move it to various destination tables.
https://docs.snowflake.com/en/user-guide/data-pipelines-intro.html

BigQuery streamed data is not in table

I've got an ETL process which streams data from a mongo cluster to BigQuery. This runs via cron on a weekly basis, and manually when needed. I have a separate dataset for each of our customers, with the table structures being identical across them.
I just ran the process, only to find that while all of my data chunks returned a "success" response ({"kind": "bigquery#tableDataInsertAllResponse"}) from the insertAll api, the table is empty for one specific dataset.
I had seen this happen a few times before, but was never able to re-create. I've now run it twice more with the same results. I know my code is working, because the other datasets are properly populated.
There's no 'streaming buffer' in the table details, and running a count(*) query returns 0 response. I've even tried removing cached results from the query, to force freshness - but nothing helps.
Edit - After 10 minutes from my data stream (I keep timestamped logs) - partial data now appears in the table; however, after another 40 minutes, it doesn't look like any new data is flowing in.
Is anyone else experiencing hiccups in streaming service?
Might be worth mentioning that part of my process is to copy the existing table to a backup table, remove the original table, and recreate it with the latest schema. Could this be affecting the inserts on some specific edge cases?
Probably this is what is happening to you: BigQuery table truncation before streaming not working
If you delete or create a table, you must wait a least 2 minutes to start streaming data on it.
Since you mentioned that all other tables are working correctly and only the table that has the deletion process is not saving data then probably this explains what you are observing.
To fix this issue you can either wait a bit longer before streaming data after the delete and create operations or maybe changing the strategy to upload the data (maybe saving it into some CSV file and then using job insert methods to upload the data into the table).

BigQuery range decorator duplicate issue

We are facing issues with BigQuery range decorators on streaming table. The range decorator queries give duplicate data.
My case:
My BQ table is getting data regularly from customer events through streaming inserts. Another job is periodically fetching time bound data from the table using range decorator and sending it to dataflow jobs. like
First time fetching all the data from table using
SELECT * FROM [project_id:alpha.user_action#1450287482158]
when i ran this query got 91 records..
after 15 mins another query based on last interval
SELECT * FROM [alpha.user_action#1450287482159-1450291802380]
this also gave the same result with 91 records.
however i tried to run the same query again to cross check
SELECT * FROM [project_id:alpha.user_action#1450287482158]
Gives empty data.
any help on this?
First off, have you tried using streaming dataflow? That might be a better fit (though your logic is not expressible as a query). Streaming dataflow also supports Tee-ing your writes, so you can keep both raw data and aggregate results.
On to your question:
Unfortunately this is a collision of two concepts that were built concurrently and somewhat independently, thus resulting in ill-defined interactions.
Time range table decorators were designed/built in a world where only load jobs existed. As such, blocks of data are atomically committed to a table at a single point in time. Time range decorators work quite well with this, as there are clear boundaries of inclusion/exclusion, and the relationship is stable.
Streaming Ingestion + realtime query is somewhat counter to the "load job" world. BigQuery is buffering your data for some period of time, making it available for analysis, and then periodically flushing the buffers onto the table using the traditional storage means. While the data is buffered, we have "infinite" time granularity. However, once we flush the buffer onto the table, that infinite granularity is compressed into a single time, which is currently the flush time.
Thus, using time range decorators on streaming tables can unfortunately result in some unexpected behaviors, as the same data may appear in two non-overlapping time windows (once while it is buffered, and once when it is flushed).
Our recommendation if you're trying to achieve windowed queries over recent data is to do the following:
Include a system timestamp in your data.
For the table decorator timestamps, include some buffer around the actual window to account for clock skew between your clock and Google's, and late arrivals from retry. This buffer should be both before and after your target window.
Modify your query to apply your actual time window.
It should be noted that depending on your actual usage purpose, this may not address your problems. If you can give more details, there might be a way to achieve what you need.
Sorry for the inconvenience.

Bigquery: How long should we wait to stream data after schema update?

I add three columns into my existing tables (change table schema). I can see them on the web UI.
But when I stream the data in , I have errors: no such field. I printed out the data content json, found it actually matched the schema.
How long should I wait for the changed schema to be seen for streaming? Any rule of thumb?
The cache invalidation frequency is currently set to 8 hours for streaming schemas. So if you change your schema, it may take up to 8 hours to be able to stream data to that table.
This is set so high in order to deal with very high rate of inserts without overwhelming our metadata servers.
We've got an internal bug to lower the cache time to a few minutes or less.