Table size limit to do a Broadcast Join in BigQuery

Table size limit to do a Broadcast Join in BigQuery - google-bigquery

The official documentation says that if 1 table is small enough, a broadcast join can be created (which is faster than a Shuffle join): https://cloud.google.com/bigquery/docs/best-practices-performance-compute#optimize_your_join_patterns
However, it does not indicates the size limit for the small table. Some (old) links on the internet say we no longer do some broadcast join if the small table is higher than 8MB. However, this number seems quite small to me (I compare with Hive where I can have several hundreds of MB in the small table).
Does somebody know this limit?

The exact value is not published, and it is not a fixed value (depends on types of JOINs and on query plan shape), but it is in order of hundrends of MBs.

Related

Google BigQuery Query exceeded resource limits

I'm setting up a crude data warehouse for my company and I've successfully pulled contact, company, deal and association data from our CRM into bigquery, but when I join these together into a master table for analysis via our BI platform, I continually get the error:
Query exceeded resource limits. This query used 22602 CPU seconds but would charge only 40M Analysis bytes. This exceeds the ratio supported by the on-demand pricing model. Please consider moving this workload to the flat-rate reservation pricing model, which does not have this limit. 22602 CPU seconds were used, and this query must use less than 10200 CPU seconds.
As such, I'm looking to optimise my query. I've already removed all GROUP BY and ORDER BY commands, and have tried using WHERE commands to do additional filtering but this seems illogical to me as it would add processing demands.
My current query is:
SELECT
coy.company_id,
cont.contact_id,
deals.deal_id,
{another 52 fields}
FROM `{contacts}` AS cont
LEFT JOIN `{assoc-contact}` AS ac
ON cont.contact_id = ac.to_id
LEFT JOIN `{companies}` AS coy
ON CAST(ac.from_id AS int64) = coy.company_id
LEFT JOIN `{assoc-deal}` AS ad
ON coy.company_id = CAST(ad.from_id AS int64)
LEFT JOIN `{deals}` AS deals
ON ad.to_id = deals.deal_id;
FYI {assoc-contact} and {assoc-deal} are both separate views I created from the associations table for easier associations of those tables to the companies table.
It should also be noted that this query has occasionally run successfully, so I know it does work, it just fails about 90% of the time due to the query being so big.

TLDR;
Check your join keys. 99% of the time the cause of the problem is a combinatoric explosion.
I can't know for sure since I don't have access to the data of the underlying table, but I will give a general resolution method which in my experience worked every time to find the root cause.
Long Answer
Investigation method
Say you are joining two tables
SELECT
cols
FROM L
JOIN R ON L.c1 = R.c1 AND L.c2 = R.c2
and you run into this error. The first thing you should do is check for duplicates in both tables.
SELECT
c1, c2, COUNT(1) as nb
FROM L
GROUP BY c1, c2
ORDER by nb DESC
And the same thing for each table involved in a join.
I bet that you will find that your join keys is duplicated. BigQuery is very scalable, so in my experience this error happens when you have a join key that repeats more than 100 000 times on both tables. It means that after your join, you will have 100000^2 = 10 billion rows !!!
Why BigQuery gives this error
In my experience, this error message means that your query does too many computation compared to the size of your inputs.
No wonder you're getting this if you end up with 10 billion rows after joining tables with a few million rows each.
BigQuery's on-demand pricing model is based on the amount of data read in your tables. This means that people could try to abuse this by, say running CPU-intensive computations while reading small datasets. To give an extreme example, imagine someone makes a Javascript UDF to mine bitcoin and runs it on BigQuery
SELECT MINE_BITCOIN_UDF()
The query will be billed $0 because it doesn't read anything, but will consume hours of Google's CPU. Of course they had to do something about this.
So this ratio exists to make sure that users don't do anything sketchy by using hours of CPUs while processing a few Mb of inputs.
Other MPP platforms with a different pricing model (e.g. Azure Synapse who charges based on the amount of bytes processed, not read like BQ) would perhaps have run without complaining, and then billed you 10Tb for reading that 40Mb table.
P.S.: Sorry for the late and long answer, it's probably too late for the person who asked, but hopefully it will help whoever runs into that error.

Redshift : Query Optimisation

Need help in optimizing the below query. Please suggest here
Db : Redshift
Sort Key:
order Table : install_ts
order_item: Install_ts
Suborder. : install_ts
suborder_item: install_ts
Dist Key: Not Added
query: Extracting selected columns from below table (not all)
select *, rank() OVER (PARTITION BY oo.id,
oi.id
ORDER BY i.suborder_id DESC) AS suborder_rank,
FROM order_item oi
LEFT JOIN order oo ON oo.id=oi.order_id
LEFT JOIN sub_order_item i ON i.order_item_id = oi.id
LEFT JOIN sub_order s ON i.suborder_id = s.id
WHERE
(
(oo.update_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59' AND oo.create_ts>=current_date-500)
OR
oo.create_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59'
OR
oi.create_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59'
OR
(oi.update_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59' AND oi.create_ts>=current_date-500)
)

Without knowing your data in more detail and knowing the full query load you want to optimize it will difficult to get things exactly correct. However I see a number of things in this that are POTENTIAL areas for significant performance issues. I do this work for many clients and while the optimization methods in Redshift differ from other databases there are a number of things you can do.
First off you need to remember that Redshift is a networked cluster of computers acting as a coherent database. This allows it to scale to very large database sizes and achieve horizontally scalable compute performance. It also means that there are network cables in play in the middle of your query.
This leads to the issue that I see most often at issue - massive amounts of data being redistributed across these network cable during the query. In your case you have not assigned a distribution key which will lead to random distribution of your data. But when you perform those joins based on "id" all the rows with the same id need to be moved to one slide (compute element) of the Redshift cluster. Basically all your data is being transferred around your cluster and this takes time. Assigning a common distribution key that is also a common join on (or group by) column, such as id, will greatly reduce this network traffic.
The next most common issue is scanning (or writing) too much data from (or to) disk. The disks on Redshift are fast but the data is often massive and the data is still moving through cables (faster than network but still takes time). Redshift helps this by having metadata stored with your data on disk and when it can it will not read unneeded data based on your Where clauses. In your case you have where clauses that are reducing the needed rows (I hope) which is good. But you are using a column that is not the sort key. If these columns correlates well with each other this may not be a problem but if they do not you could be reading more data than you need. You will want to look at the metadata for these table so see how well your sort keys are working for you. Also, Redshift only reads the columns you reference in your query so if you don't ask for a columns its data is not read from disk. Having "*" in your select is reading all the columns - if you don't need all the columns think about listing only the needed columns.
If your query is working on data that doesn't fit in memory during execution the extra data will need to spill to disk. This creates writes to disk of the swapped data and then reads of it back in. This can happen multiple times during the query. I don't know if this is happening in your case but you will want to check if you are spilling during your query.
The thing that impacts everything else is issues in the query and these take a number of forms. If your joins are not properly qualified then the amount of data can explode making the two issues above much worse. Window functions can be expensive especially when a partition different than the distribution key is used but also when the amount of data being operated on is larger than is needed. There are many other but these are two possible areas that look like they may be in your query. I just cannot tell by the info provided.
I'd start by looking at the explain plan and the actual execution information from system tables. Big numbers for rows or cost or time will lead you to the likely areas that need optimization. Like I said I do this work for customer all the time and can help here as time allows but much more information will be needed to understand what is really impacting your query.

GCP BigQuery - LIMIT but full table read - How to limit queried data to a minimum

It looks like LIMIT would have no effect on the amount of processed/queried data (if you trust the UI).
SELECT
* --count(*)
FROM
`bigquery-public-data.github_repos.commits`
-- LIMIT 20
How to limit the amount of queried data to a minimum (even though one whole partition would probably always be needed)
without to use "preview" or similar
without to know the partition / clustering of the data
How to check the real approximate amount before a query execution?
In the execution details is stated that only 163514 rows has been queried as input (not 244928379 rows)

If you want to limit the amount of data BQ uses for a query you have this two options:
Table Partitioning
Big query can partition data using either a Date/Datetime/Timemestamp column you provide or by insert date (which is good if you have regular updates on a table).
In order to do this, you must specify the partition strategy in the DDL:
CREATE TABLE mydataset.mytable (foo: int64, txdate:date)
PARTITION BY txdate
Wildcard tables (like Sharding - splitting the data into multiple tables
This works when your data holds information about different domains (geographical, customer type, etc.) or sources.
Instead of having one big table, you can create 'subtables' or 'shards' like this with a similar schema (usually people use the same). For instance,dateset.tablename.eur for european data and ```dataset.tablename.jap`` for data from Japan.
You can query one of those tables directll select col1,col2... from dataset.tablename.custromer_eur; or from all tables select col1,col2 from 'dataset.tablename.*'
Wildcard tables can be also partitioned by date.

You pay for the volume of data loaded in the workers. Of course, you do nothing in your request and you ask for the 20 first result, the query stop earlier, and all the data aren't processed, but at least loaded. And you will pay for this!
Have a look to this. I have a similar request
Now, let's go to the logs
The total byte billed is ~800Mb
So you, have to think differently when you work with BigQuery, it's analytics database and not designed to perform small requests (too slow to start, the latency is at least 500ms due to worker warm up).
My table contain 3M+ of rows, and only 10% have been processed
And you pay for the reservation and the load cost (moving data have a cost and reserving slots has also a cost).
That's why, there is a lot of tip to save money on Google BigQuery. Some examples by a former BigQuery Dev Advocate

as of december 2021, I notice select * from Limit, will not scan the whole table and you pay only for a small number of rows, obviously if you add order by, it will scan everything.

AWS Redshift column limit?

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.
What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.
More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.
SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'
My three questions are:
What's the deal? Why does Redshift have this limitation if it claims to be a column store?
Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.

I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.
I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.
1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.
Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?
FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

Understanding "Resources exceeded during query execution" with GROUP EACH BY in BigQuery

I'm writing a background job to automatically process A/B test data in BigQuery, and I'm finding that I'm hitting "Resources exceeded during query execution" when doing large GROUP EACH BY statements. I saw from Resources Exceeded during query execution that reducing the number of groups can make queries succeed, so I split up my data into smaller pieces, but I'm still hitting errors (although less frequently). It would be nice to get a better intuition about what actually causes this error. In particular:
Does "resources exceeded" always mean that a shard ran out of memory, or could it also mean that the task ran out of time?
What's the right way to approximate the memory usage and the total memory I have available? Am I correct in assuming each shard tracks about 1/n of the groups and keeps the group key and all aggregates for each group, or is there another way that I should be thinking about it?
How is the number of shards determined? In particular, do I get fewer shards/resources if I'm querying over a smaller dataset?
The problematic query looks like this (in practice, it's used as a subquery, and the outer query aggregates the results):
SELECT
alternative,
snapshot_time,
SUM(column_1),
...
SUM(column_139)
FROM
my_table
CROSS JOIN
[table containing 24 unix timestamps] timestamps
WHERE last_updated_time < timestamps.snapshot_time
GROUP EACH BY alternative, user_id, snapshot_time
(Here's an example failed job: 124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4 )
I realize this query may be asking for trouble, but in this case, the table is only 22MB and the query results in under a million groups and it's still failing with "resources exceeded". Reducing the number of timestamps to process at once fixes the error, but I'm worried that I'll eventually hit a data scale large enough that this approach as a whole will stop working.

As you've guessed, BigQuery chooses a number of parallel workers (shards) for GROUP EACH and JOIN EACH queries based on the size of the tables being operated upon. It is a rough heuristic, but in practice, it works pretty well.
What is interesting about your query is that the GROUP EACH is being done over a larger table than the original table because of the expansion in the CROSS JOIN. Because of this, we choose a number of shards that is too small for your query.
To answer your specific questions:
Resources exceeded almost always means that a worker ran out of memory. This could be a shard or a mixer, in Dremel terms (mixers are the nodes in the computation tree that aggregate results. GROUP EACH BY pushes aggregation down to the shards, which are the leaves of the computation tree).
There isn't a good way to approximate the amount of resources available. This changes over time, with the goal that more of your queries should just work.
The number of shards is determined by the total bytes processed in the query. As you've noticed, this heuristic doesn't work well with joins that expand the underlying data sets. That said, there is active work underway to be smarter about how we pick the number of shards. To give you an idea of scale, your query got scheduled on only 20 shards, which is a tiny fraction of what a larger table would get.
As a workaround, you could save the intermediate result of the CROSS JOIN as a table, and running the GROUP EACH BY over that temporary table. That should let BigQuery use the expanded size when picking the number of shards. (if that doesn't work, please let me know, it is possible that we need to tweak our assignment thresholds).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas