I am using Firebase and BigQuery to make a dashboard. I found the discrepancy once the data transferred to the "regular events table" from the "intraday table".
I've been saving the intraday table for the last three days to compare the values when the data is transferred to the regular event table. I found out there is some problem while transferring the events data to the regular table as some of the rows were removed while transferring the data.
Does anyone know what needs to be done here?
Related
I have set up Google Analytics GA4 to send both streaming and daily data to BigQuery for a test project. When I first did this, the events_intraday_ table was converted to events_ table each day.
As of sometime this week, it stopped working. In the morning, I can see yesterday's intraday table, and when I send some data, I see that today's is created. Sometime during the day, yesterday's table will disappear but no matching events_ table shows up.
Possibly related: I removed some old events_ tables last week as I was figuring out what I wanted to be stored. Is it possible deleting them somehow triggered a constant delete?
All the tables have expiration dates set to NEVER.
Is there a setting/log file/anything I can look at to understand what went wrong? And any ideas how to fix it?
I am using firebase analytics and bigguery with average of 50~60 GB daily data.
For the most recent daily table, a query gives different result from yesterday even if query conditions are exact same including target date.
I just found that there are 1~2days gap between table creation date and last modified date.
I assume the difference between the query results are because of this. (Calculating on different data volume, maybe)
Is this date gap means a single daily table needs at least 2 days to be fully loaded from intraday table?
Thanks in advance.
biqguery table info
In the documentation we can find the following information:
After you link a project to BigQuery, the first daily export of
events creates a corresponding dataset in the associated BigQuery
project. Then, each day, raw event data for each linked app populates
a new daily table in the associated dataset, and raw event data is
streamed into a separate intraday BigQuery table in real-time.
It seems that the intraday table is loaded to the main table each day and if you want to access this data in real-time you`ll have to use this intraday separate table.
If this information doesn`t help you, please provide some extra information so I can help you more efficiently.
Google BigQuery has no primary key or unique constraints.
We cannot use traditional SQL options such as insert ignore or insert on duplicate key update so how do you prevent duplicate records being inserted into Google BigQuery?
If I have to call delete first (based on unique key in my own system) and then insert to prevent duplicate records being inserted into bigquery, wouldn't that that be too inefficient? I would assume that insert is the cheapest operation, no query, just append data. For each insert if I have to call delete, it will be too inefficient and cost us extra money.
What is your advice and suggestions based on your experience?
It would be nice that bigquery has primary key, but it might be conflict with the algorithms/data structure that bigquery is based on?
So let's clear some facts up in the first place.
Bigquery is a managed data warehouse suitable for large datasets, and it's complementary to a traditional database, rather than a replacement.
Up until early 2020 there was only a maximum of 96 DML (update,delete) operations on a table per day. That low limited forced you to think of BQ as a data lake. That limit has been removed but it demonstrates that the early design of the system was oriented around "append-only".
So, on BigQuery, you actually let all data in, and favor an append-only design. That means that by design you have a database that holds a new row for every update. Hence if you want to use the latest data, you need to pick the last row and use that.
We actually leverage insights from every new update we add to the same row. For example, we can detect how long it took for the end-user to choose his/her country at signup flow. Because we have a dropdown of countries, it took some time until he/she scrolled to the right country, and metrics show this, because we ended up in BQ with two rows, one prior country selected, and one after country selected and based on time selection we were able to optimize the process. Now on our country drop-down we have first 5 most recent/frequent countries listed, so those users no longer need to scroll and pick a country; it's faster.
"Bulk Delete and Insert" is the approach I am using to avoid the duplicated records. And Google's own "Youtube BigQuery Transfer Services" is using "Bulk Delete and Insert" too.
"Youtube BigQuery Transfer Services" push daily reports to the same set of report tables every day. Each record has a column "date".
When we run Youtube Bigquery Transfer backfill (ask youtube bigquery transfer to push the reports for certain dates again.) Youtube BigQury Transfer services will first, delete the full dataset for that date in the report tables and then insert the full dataset of that date back to the report tables again.
Another approach is drop the results table (if it already exists) first, and then re-create the results table and re-input the results into the tables again. I used this approach a lot. Everyday, I have my process data results saved in some results tables in the daily dataset. If I rerun the process for that day, my script will check if the results tables for that day exist or not. If table exists for that day, delete it and then re-create a fresh new table, and then reinput the process results to the new created table.
BigQuery now doesn't have DML limits.
https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery
I am trying to copy a BigQuery table using the API from one table to the other in the same dataset.
While copying big tables seems to work just fine, copying small tables with a limited number of rows (1-10) I noticed that the destination table comes out empty (created but 0 rows).
I get the same results using the API and the BigQuery management console.
The issue is replicated for any table in any dataset I have. Looks like a bug or a designed behavior.
Could not find any "minimum lines" directive in the docs.. am I missing something?
EDIT:
Screenshots
Original table: video_content_events with 2 rows
Copy table: copy111 with 0 rows
How are you populating the small tables? Are you perchance using streaming insert (bq insert from the command line tool, tabledata.insertAll method)? If so, per the documentation, data can take up to 90 minutes to be copyable/exportable:
https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability
I won't get super detailed, but the reason is that our copy and export operations are optimized to work on materialized files. Data within our streaming buffers are stored in a completely different system, and thus aren't picked up until the buffers are flushed into the traditional storage mechanism. That said, we are working on removing the copy/export delay.
If you aren't using streaming insert to populate the table, then definitely contact support/file a bug here.
There is no minimum records limit to copy the table within the same dataset or over a different dataset. This applies both for the API and the BigQuery UI. I just replicated your scenario of creating a new table with just 2 records and I was able to successfully copy the table to another table using UI.
Attaching screenshot
I tried to copy to a timestamp partitioned table. I messed up the timestamp, and 1000 x current timestamp. Guess it is beyond BigQuery's max partition range. Despite copy job success, no data is actually loaded to the destination table.
I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).