Will LIMIT affect the number of slots that get used? - google-bigquery

We recently moved to BigQuery's flat-rate billing model. This means we purchase a finite amount of slots and all BigQuery queries in our organisation will make use of those slots.
I am wondering whether or not the number of slots used by a query will get affected by use of LIMIT. TO put it another way, will this query:
select *
from project.dataset.table
use more slots than this one?
select *
from project.dataset.table
limit 10
?

Related

BigQuery in Google Cloud, limiting search of views and cost

I'm a little new to BQ, I'm doing a query, very simply of a view to get a quick look at the data, but when I put say LIMIT 100, to see just the first 100 rows, I don't get a reduction in the data required and hence the cost. If I want to simply do this, what can I do that is inexpensive to get the data.
For example:
select * from table
uses exactly the same projected data as
select * from table limit 100
Is there not any simplification under hood. Is BQ searching all rows and then taking the top 100?
BigQuery charging is based on the data queried and unfortunately limit does not reduce the volume of data queried.
The following can help:
using the table preview in the console (this is free if I recall correctly) but does not work on views or some types of attached tables
reducing the number of columns that are queried
if your data is partitioned, you can query a specific partition - https://cloud.google.com/bigquery/docs/querying-partitioned-tables
There is information from Google on this page https://cloud.google.com/bigquery/docs/best-practices-performance-input

GCP BigQuery - LIMIT but full table read - How to limit queried data to a minimum

It looks like LIMIT would have no effect on the amount of processed/queried data (if you trust the UI).
SELECT
* --count(*)
FROM
`bigquery-public-data.github_repos.commits`
-- LIMIT 20
How to limit the amount of queried data to a minimum (even though one whole partition would probably always be needed)
without to use "preview" or similar
without to know the partition / clustering of the data
How to check the real approximate amount before a query execution?
In the execution details is stated that only 163514 rows has been queried as input (not 244928379 rows)
If you want to limit the amount of data BQ uses for a query you have this two options:
Table Partitioning
Big query can partition data using either a Date/Datetime/Timemestamp column you provide or by insert date (which is good if you have regular updates on a table).
In order to do this, you must specify the partition strategy in the DDL:
CREATE TABLE mydataset.mytable (foo: int64, txdate:date)
PARTITION BY txdate
Wildcard tables (like Sharding - splitting the data into multiple tables
This works when your data holds information about different domains (geographical, customer type, etc.) or sources.
Instead of having one big table, you can create 'subtables' or 'shards' like this with a similar schema (usually people use the same). For instance,dateset.tablename.eur for european data and ```dataset.tablename.jap`` for data from Japan.
You can query one of those tables directll select col1,col2... from dataset.tablename.custromer_eur; or from all tables select col1,col2 from 'dataset.tablename.*'
Wildcard tables can be also partitioned by date.
You pay for the volume of data loaded in the workers. Of course, you do nothing in your request and you ask for the 20 first result, the query stop earlier, and all the data aren't processed, but at least loaded. And you will pay for this!
Have a look to this. I have a similar request
Now, let's go to the logs
The total byte billed is ~800Mb
So you, have to think differently when you work with BigQuery, it's analytics database and not designed to perform small requests (too slow to start, the latency is at least 500ms due to worker warm up).
My table contain 3M+ of rows, and only 10% have been processed
And you pay for the reservation and the load cost (moving data have a cost and reserving slots has also a cost).
That's why, there is a lot of tip to save money on Google BigQuery. Some examples by a former BigQuery Dev Advocate
as of december 2021, I notice select * from Limit, will not scan the whole table and you pay only for a small number of rows, obviously if you add order by, it will scan everything.

Is there a way to increase allotted memory for queries in BigQuery?

I have a large table (about 59 millions rows, 7.1 GB) already ordered as i want, and I want to query this table and get a row_number() for each row of the table.
Unfortunately I get the error
Resources exceeded during query execution: The query could not be executed in the allotted memory.
Is there a way to increase allotted memory in BigQuery ?
Here is my query, I don't see how I can simplify it, but if you have any advices I'll take it
SELECT
row_number() over() as rowNumber,
game,
app_version,
event_date,
user_pseudo_id,
event_name,
event_timestamp,
country,
platform
FROM
`mediation_time_BASE`
Here is the complete error message :
Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 146% of limit. Top memory consumer(s): analytic OVER() clauses: 98% other/unattributed: 2%
Edit:
the query here represents a list of event starts and ends, and I need to link the start event with its end, so I follow this tip : https://www.interfacett.com/blogs/how-to-use-values-from-previous-or-next-rows-in-a-query-in-sql-server/
For that I need to have the rows with row_number() in order to separate this subquery in 2 (event start in one hand and event end in the other), join them and then have one row per event with the start and end of the event, as follow (where subquery represents the query with the row_number()):
SELECT
(case when lead(inter.rowNumber) OVER(ORDER BY inter.rowNumber) - inter.rownumber =1
then lead(inter.rowNumber) OVER(ORDER BY inter.rowNumber)
else inter.rownumber end) as rowNumber,
min(inter_success.rowNumber) as rowNumber_success,
inter.game,
inter.app_version,
inter.event_date,
inter.user_pseudo_id,
inter.event_timestamp as event_start,
min(inter_success.event_timestamp) as event_end,
inter_success.event_name as results
FROM
(SELECT * FROM `subquery` where event_name = 'interstitial_fetch') as inter INNER JOIN
(SELECT * FROM `subquery` where event_name = 'interstitial_fetch_success') as inter_success
ON inter.rowNumber < inter_success.rowNumber and inter.game= inter_success.game and inter.app_version = inter_success.app_version and inter.user_pseudo_id = inter_success.user_pseudo_id
GROUP BY inter.rowNumber,inter.game,inter.app_version,inter.event_date,inter.user_pseudo_id,inter.event_timestamp,inter_success.event_name
This works fine with a smaller dataset, but doesn't for 59 million rows...
TL;DR: You don't need to increase the memory for BigQuery.
In order to answer that you need to understand how BigQuery works. BigQuery relies on executor machines called slots. These slots are all similar in type and have a limited amount of memory.
Now, many of the operations split the data between multiple slots (like GROUP BY), each slot performs a reduction on a portion of the data and sends the result upwards in the execution tree.
Some operations must be performed on a single machine (like SORT and OVER) see here.
When your data overflows the slot's memory, you experience the described error. Hence, what you really need is to change the slot type to a higher memory machine. That's unfortunately not possible. You will have to follow the query best practices to avoid single slot operations on too much data.
One thing that may help you, is to calculate the OVER() with PARTITIONS, thus each partition will be sent to a different machine. see this example. Another thing that usually helps is to move to STANDARD SQL if you haven't done that yet.
AS per the offical documetation you need to request an increse in the slots for your reservation...
Maximum concurrent slots per project for on-demand pricing — 2,000
The default number of slots for on-demand queries is shared among all
queries in a single project. As a rule, if you're processing less than
100 GB of queries at once, you're unlikely to be using all 2,000
slots.
To check how many slots you're using, see Monitoring BigQuery using
Stackdriver. If you need more than 2,000 slots, contact your sales
representative to discuss whether flat-rate pricing meets your needs.
refer to this for slots1, the process for requesting more memory is here2
For increasing the BigQuery slots in the project, you may have to contact Google Cloud support or buy reservations.
I assume you were using the with clause for the subquery which in turns runs out of memory. My proposed solution is to create a expiring table that will expire in a few days automatically with the syntax of
OPTIONS(expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 5 DAY))
With this approach, I imagine inserting the 59 million rows of query result into a expiring table will use a lot less slots. Replace your subsequent subquery with the expiring table name.
To avoid being billed for the expiring table, you may delete it after all the dependent queries are executed.

Is there a limit to queries using Bigquery's library and api?

I want to know if there is any limit when making queries to my data already loaded in bigquery?
For example, if I want to extract bigquery information from a web application or from a "web service", what is my limit of selects, updates and deletes?
The documentation tells me this:
Concurrent rate limit for interactive queries under on-demand pricing: 50 concurrent queries. Queries that return cached results, or queries configured using the dryRun property, do not count against this limit.
Daily query size limit: unlimited by default, but you may specify limits using custom quotas.
But I can not understand if I have a limit on the number of consultations per day, and if so, what is my limit?
There is a limit to the number of slots you can allocate for queries at a particular time.
Some nuggets:
Slot: represents one unit of computational capacity.
Query: Uses as many slots as required so the query runs optimally (Currently: MAX 50 slots for On Demand Price) [A]
Project: The slots used per project is based on the number of queries that run at the same time (Currently: MAX 2000 slots for On Demand Price)
[A] This is all under the hood without user intervention. BigQuery makes an assessment of the query to calculate the number of slots required.
So if you do the math, worst case, if all your queries use 50 slots, you will not find any side effect until you have more than 40 queries running concurrently. Even in those situations, the queries will just be in the queue for a while and will start running after some queries are done executing.
Slots become more worrisome when you are time sensitive to getting your data on time and they are run in an environment where:
A lot of queries are running at the same time.
Most of those queries that are running at the same time usually take a long time to execute on an empty load.
The best way to understand whether these limits will impact you or not is by monitoring the current activity within your project. Bigquery advises you to monitor your slots with Stackdriver.
Update: Bigquery addresses the problem of query prioritization in one of their blog posts - Truth 5: BigQuery supports query prioritization

How to show a sample of the data in BigQuery?

Let us suppose I have a 1TB dataset in BigQuery, and I want to be able to view the data in a columnar view, limiting to 1000 results. Here are a few of the queries I might use:
1. SELECT * FROM mytable LIMIT 1000
2. SELECT first_name, last_name FROM mytable LIMIT 1000
3. SELECT last_name, first_name FROM mytable LIMIT 1000
4. SELECT * FROM mytable ORDER BY first_name LIMIT 1000
If I ran these four queries I would be charged ~$20 ($5/tb, pretend * = first_name, last_name). This seems like a very high amount to pay to just sample the data -- is there another way to query this data to view a limited view of the data, like the above?
This seems like a very high amount to pay to just sample the data -- is there another way to
If your data dynamic, meaning is updated daily or whatever other way - you can use Table Decorators
For example
SELECT * FROM mytable#-3600000--1800000 LIMIT 1000
will query only data inserted within last hour, thus lowering cost a lot!!
Another option is to use Day partitioned tables so you can query only specific day worth of data
Is there a way to export a subset of the data instead of doing a query?
Yes. You can use Tabledata.list API to list page-by-page data in your original table and insert into new [sampled] table using whatever sampling logic you need. Note: this API is free as it actually doesn't use BigQuery query engine per se, but rather reading from underlying storage!!! so you can be reasonably wild :o)
Of course you need to implement this in client of your choice.
I assume you are accessing BQ through the online query interface (https://bigquery.cloud.google.com/table . . . ).
Click on the table in the data set. Go down to where it says "Table Details" in bold letters, beneath the "Run Query" icon.
In the second row below that is an option for "Preview". This will show you some data and it's free.
We have a sample table that's generated every day at work which I find extremely useful for many tasks. It's as simple as:
SELECT * FROM mytable WHERE RAND() < 0.01
The table is hierarchical, and this sampling is set to reproduce the whole structure; so queries can be tested/replicated in exactly the same form and then swapped over to the big table if needed. The 1% sample applies to the top level of the hierarchy (meaning you don't have to wonder whether you are getting valid results from branches).
For us, there is enough data that sums and ratios are generally very representative. The only kind of data that poses a significant problem is relatively rare events, which means counts of unique elements can't be relied on.
And of course, after the single daily charge for making this table, the billing goes from dollars to cents!