On 13.07 I only see 204,047 rows in my bigquery, usually not less than 5 million rows. In the GA4 interface, there is no such drop, 5.9 million events are displayed.These days tables appeared in BQ much later than usual. What can be done in this situation?
data 13.07
data 14.07
Related
I have a use case where I need to have a structured data (with 30 million records & 20 columns) in a Tableau compatible source. And we want the tableau charts refresh to happen within 1sec of time. NOTE: That 30 million records are already aggregated so we cannot reduce the records further.
I was thinking about having a hive table & create a presto connection to create a tableau extract. But when I did that then I saw we have a latency of about 5-10 seconds.
Someone please help in selecting (may be) a better source along with (may be) a live tableau connection to refresh the data faster?
I was thinking about having a hive table & create a presto connection to create a tableau extract. But when I did that then I saw we have a latency of about 5-10 seconds.
I have a BigQuery table with the following properties:
Table size: 1.64 TB
Number of rows: 9,883,491,153
The data is put there using streaming inserts (in batches of 500 rows each).
From the Google Cloud Pricing Calculator the costs for these inserts so far should roughly be 86 $.
But in reality, it turns out to be around 482 $.
The explanation is in the pricing docs:
Streaming inserts (tabledata.insertAll): $0.010 per 200 MB (You are charged for rows that are successfully inserted. Individual rows are calculated using a 1 KB minimum size.)
So, in the case of my table, each row is just 182 bytes, but I need to pay the full 1024 bytes for each row, resulting in ~ 562 % of the originally (incorrectly) estimated costs.
Is there a canonical (and of course legal) way to improve the situation, i.e., reduce cost? (Something like inserting into a temp table with just one array-of-struct column, to hold multiple rows in a row, and then split-moving regularly into the actual target table?)
I can suggest you these options:
Use BigQuery Storage Write API. You can stream records into BigQuery and they can be available as the ones written in the DB, or batch a process to insert a large number of records to commit in a single operation.
Some advantages are:
Lower cost because you have 2 TB per month free.
It supports exactly-once semantics through the use of stream offset.
If a table schema changes while a client is streaming, BigQuery
Storage Write notifies the client.
Here is more information about BigQuery Storage Write.
Another option, you could use Beam/DataFlow to create a batch for streaming into BigQuery and use BigQueryIO with the write method of batch.
You can see more information here.
We have a simple copy ADF activity taking data from Azure SQL staging DB into Azure SQL data warehouse. Same environment but these are two different databases.
Because sometimes we have huge delta load, we have put in a batch size of 25000 records.
But we have observed that when we put in batch records, the copy activity becomes progressively slower. While the first batch rate is as high as 20 MB/s as iterations increase, the speed goes slower to few hundred bytes per second.
I implore if anyone can provide any insight.
Some screenshots attached below for reference
When ADF has copied 1 million rows out of 3.18 million rows
When ADF had copied 1.425 million rows
ADF at around 2 million rows mark
ADF at around 2.8 million rows
ADF at 2.875 million rows
I'm using BigQuery for ~5 billion rows that can be partitioned on ~1 million keys.
Since our queries are usually by partition key, is it possible to create ~1 million tables (1 table / key) to limit the total number of bytes processed?
We also need to query all of the data together at times, which is easy to do by putting it all in one table, but I'm hoping to use the same platform for partitioned analysis as bulk analytics.
That might work, but partitioning your table this finely is highly discouraged. You might be better off partitioning your data into a smaller number of tables, say 10 or 100, and querying just the one(s) you need.
What do I mean by discouraged? First, each of those million tables will get charged a minimum of 10 MB for storage. So you'll get charged for 9 TB of storage, when you likely have a lot less data than that. Second, you'll likely hit rate limits when you try to create that many tables. Third, managing a million tables is very tricky; the BigQuery UI will likely not be much help. Fourth, you'll make engineers on the BigQuery exceedingly grumpy, and they'll start trying to figure out whether we need to raise the minimum size for tables.
Also, if you do want to sometimes query all of your data, partitioning this finely is likely going to make things difficult for you, unless you are willing to store your data multiple times. You can only reference 1000 tables in a query, and each one you reference causes you to take a performance hit.
I have a table with 1.6 billion rows. I have been running a query that uses a group-by field that has over 5 million unique values and then sort by sum of another integer value in descending order and finally return only the top 10. Notice after more than an hour, that query is still stuck in running state.
I have created this big table by using "bq cp -a ". Originally those source tables are "bq cp" from 1000 smaller tables and each table were loaded from over 12 compressed csv load files.
I have searched related question and found "Google BigQuery is running queries slowly" mention slowness caused by fragmentation from a lot of small ingestion. Is my approach of data infestion consider as "too small data bit" during ingestion which caused fragmentation?
Is it possible 5 million unique values is too much and that is the root cause of slow response?
We've had a latency spike yesterday, and a smaller one today. Can you give project id + job ids of query jobs that took longer than you expected?