Bigquery data transfer duration from intraday table to daily table - google-bigquery

I am using firebase analytics and bigguery with average of 50~60 GB daily data.
For the most recent daily table, a query gives different result from yesterday even if query conditions are exact same including target date.
I just found that there are 1~2days gap between table creation date and last modified date.
I assume the difference between the query results are because of this. (Calculating on different data volume, maybe)
Is this date gap means a single daily table needs at least 2 days to be fully loaded from intraday table?
Thanks in advance.
biqguery table info

In the documentation we can find the following information:
After you link a project to BigQuery, the first daily export of
events creates a corresponding dataset in the associated BigQuery
project. Then, each day, raw event data for each linked app populates
a new daily table in the associated dataset, and raw event data is
streamed into a separate intraday BigQuery table in real-time.
It seems that the intraday table is loaded to the main table each day and if you want to access this data in real-time you`ll have to use this intraday separate table.
If this information doesn`t help you, please provide some extra information so I can help you more efficiently.

Related

BigQuery events_intraday_ tables are being generated daily but no daily events_ table is created

I've been running the GA4 to BigQuery Streaming export for almost a month now because the amount of daily events is bigger than the daily export limit (2.7 mio events VS 1 mio events).
Google docs (https://support.google.com/firebase/answer/7029846?hl=en):
If the Streaming export option is enabled, a table named events_intraday_YYYYMMDD is created. This table is populated continuously as events are recorded throughout the day. This table is deleted at the end of each day once events_YYYYMMDD is complete.
According to the docs I should have events_YYYYMMDD tables for previous days and events_intraday_YYYYMMDD table for current day. But that's not the case - all I'm stuck with are events_intraday_YYYYMMDD tables for previous days.
Am I missing something or not reading the docs correctly?
Should I or shouldn't I expect the events_YYYYMMDD tables to be automatically created and filled?
If that's the case then I guess I have to take care of doing this backup by myself?

To create a hive history fact from a daily fact

I have a fact/table which used to run daily and it doesnot store the data of previous day. I want to create a fact on top of this which will store the data of prvious day along with the daily data.
Best Regards,
Santosh

Data goes missing when tables go from intraday to regular tables

I am using Firebase and BigQuery to make a dashboard. I found the discrepancy once the data transferred to the "regular events table" from the "intraday table".
I've been saving the intraday table for the last three days to compare the values when the data is transferred to the regular event table. I found out there is some problem while transferring the events data to the regular table as some of the rows were removed while transferring the data.
Does anyone know what needs to be done here?

How to check if tables refreshed in Bigquery or not?

Currently I have around 1000 tables in which I need to track around 500 tables in various bigquery datasets and generate a report or create of dashboard.so that we can monitor and act promptly if a table is not refreshed.
Could someone please tell me how can I do that with minimal usage of Bigquery slots.
I think you should be able to query the last modification time as shown here:
https://cloud.google.com/bigquery/docs/dataset-metadata
You could then add a table with the max allowed time interval for a table to be updated and include that table in the query to create your own alerts.
drftr
There is a Preview feature INFORMATION_SCHEMA.PARTITIONS giving you the LAST_MODIFIED_TIME per table in a dataset
select *
from yourDataset.INFORMATION_SCHEMA.PARTITIONS;

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).