BigQuery in Google Cloud, limiting search of views and cost - sql

I'm a little new to BQ, I'm doing a query, very simply of a view to get a quick look at the data, but when I put say LIMIT 100, to see just the first 100 rows, I don't get a reduction in the data required and hence the cost. If I want to simply do this, what can I do that is inexpensive to get the data.
For example:
select * from table
uses exactly the same projected data as
select * from table limit 100
Is there not any simplification under hood. Is BQ searching all rows and then taking the top 100?

BigQuery charging is based on the data queried and unfortunately limit does not reduce the volume of data queried.
The following can help:
using the table preview in the console (this is free if I recall correctly) but does not work on views or some types of attached tables
reducing the number of columns that are queried
if your data is partitioned, you can query a specific partition - https://cloud.google.com/bigquery/docs/querying-partitioned-tables
There is information from Google on this page https://cloud.google.com/bigquery/docs/best-practices-performance-input

Related

GCP BigQuery - LIMIT but full table read - How to limit queried data to a minimum

It looks like LIMIT would have no effect on the amount of processed/queried data (if you trust the UI).
SELECT
* --count(*)
FROM
`bigquery-public-data.github_repos.commits`
-- LIMIT 20
How to limit the amount of queried data to a minimum (even though one whole partition would probably always be needed)
without to use "preview" or similar
without to know the partition / clustering of the data
How to check the real approximate amount before a query execution?
In the execution details is stated that only 163514 rows has been queried as input (not 244928379 rows)
If you want to limit the amount of data BQ uses for a query you have this two options:
Table Partitioning
Big query can partition data using either a Date/Datetime/Timemestamp column you provide or by insert date (which is good if you have regular updates on a table).
In order to do this, you must specify the partition strategy in the DDL:
CREATE TABLE mydataset.mytable (foo: int64, txdate:date)
PARTITION BY txdate
Wildcard tables (like Sharding - splitting the data into multiple tables
This works when your data holds information about different domains (geographical, customer type, etc.) or sources.
Instead of having one big table, you can create 'subtables' or 'shards' like this with a similar schema (usually people use the same). For instance,dateset.tablename.eur for european data and ```dataset.tablename.jap`` for data from Japan.
You can query one of those tables directll select col1,col2... from dataset.tablename.custromer_eur; or from all tables select col1,col2 from 'dataset.tablename.*'
Wildcard tables can be also partitioned by date.
You pay for the volume of data loaded in the workers. Of course, you do nothing in your request and you ask for the 20 first result, the query stop earlier, and all the data aren't processed, but at least loaded. And you will pay for this!
Have a look to this. I have a similar request
Now, let's go to the logs
The total byte billed is ~800Mb
So you, have to think differently when you work with BigQuery, it's analytics database and not designed to perform small requests (too slow to start, the latency is at least 500ms due to worker warm up).
My table contain 3M+ of rows, and only 10% have been processed
And you pay for the reservation and the load cost (moving data have a cost and reserving slots has also a cost).
That's why, there is a lot of tip to save money on Google BigQuery. Some examples by a former BigQuery Dev Advocate
as of december 2021, I notice select * from Limit, will not scan the whole table and you pay only for a small number of rows, obviously if you add order by, it will scan everything.

Understanding data scanned when querying ORC with Presto/Athena

I have a large amount of data in ORC files in AWS S3. The data in ORC files is sorted by uuid. I create an AWS Athena (Presto) table on top of them and run the following experiment.
First, I retrieve the first row to see how much data gets scanned:
select * from my_table limit 1
This query reports 18 MB of data being scanned.
I record the uuid from the row returned from the first query and run the following query:
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query reports 8.5 GB of data being scanned.
By design, both queries return the same result but the second query scans 500 times more data!
Any ideas why this is happening? Is this something inherent to ORC design or is it specific to how Presto interacts with S3?
[EDIT after ilya-kisil's response]
Let's change the last query to only select the uuid column:
select uuid from my_table where uuid=<FIRST_ROW_UUID> limit 1
For this query, the amount of data scanned drops to about 600 MB! This means that the bulk of the 8.5 GB scanned in the second query is attributed to gathering values from all columns for the record found and not to finding this record.
Given that all values in the record add up to no more than 1 MB, scanning almost 8 GB of data to put these values together seems extremely excessive. This seems like some idiosyncrasy of ORC or columnar formats in general and I am wondering if there are standard practices, e.g. ORC properties, that help reduce this overhead?
Well this is fairly simple. The very first time your query would pick a random record from your data. On top of that it is not guaranteed that you have read the very first record, since ORC files are splittable and can be processed in parallel. On the other hand, the second query looks for a specific record.
Here is an analogy. Let's assume you have 100 coins UUID and some other info imprinted at on their backs. All of them are face up on a table, so you can't see their UUID.
select * from my_table limit 1
This query is like you flipped some random coin, looked at what it is written on the back and put it back on a table face up. Next, someone came and shuffled all of the coins.
select * from my_table where uuid=<FIRST_ROW_UUID> limit 1
This query is like you wanting to look at the information written on the back of a specific coin. It is unlikely that you would flip the correct coin with your first try. So you would need to "scan" more coins (data).
One of the common ways to reduce size of scanned data is to partition your data, i.e. put it into separate "folders" (not files) in your S3 bucket. Then "folder" names can be use as a virtual columns within your table definition, i.e. additional metadata for your table. Have a look at this post, which goes into mor details on how to optimise queries in Athena.

Bigquery Tier 1 exceeded for partitioned table but not for by day tables

We have two tables in bigquery: one is large (a couple billion rows) and on a 'table-per-day' basis, the other is date partitioned, has the exact same schema but a small subset of the rows (~100 million rows).
I wanted to run a (standard-sql) query with a subselect in form of a join (same when subselect is in the where clause) on the small partitioned dataset. I was not able to run it because tier 1 was exceeded.
If I run the same query on the big dataset (that contains the data I need and a lot of other data) it runs fine.
I do not understand the reason for this.
Is it because:
Partitioned tables need more resources to query
Bigquery has some internal rules that the ratio of data processed to resources needed must meet a certain threshold, i.e. I was not paying enough when I queried the small dataset given the amount of resources I needed.
If 1. is true, we could simply make the small dataset also be on a 'table-per-day' basis in order to solve the issue. But before we do that though we would like to know if it is really going to solve our problem.
Details on the queries:
Big datset
queries 11 GB, runs 50 secs, Job Id remilon-study:bquijob_2adced71_15bf9a59b0a
Small dataset
Job Id remilon-study:bquijob_5b09f646_15bf9acd941
I'm an engineer on BigQuery and I just took a look at your jobs but it looks like your second query has an additional filter with a nested clause that your first query does not. It is likely that that extra processing is making your query exceed your tier. I would recommend running the queries in the BigQuery UI and looking at the Explanation tab to see how the queries differ in the query plan.
If you try running the exact same query (modifying only the partition syntax) for both tables and still get the same error I would recommend filing a bug.

How to show a sample of the data in BigQuery?

Let us suppose I have a 1TB dataset in BigQuery, and I want to be able to view the data in a columnar view, limiting to 1000 results. Here are a few of the queries I might use:
1. SELECT * FROM mytable LIMIT 1000
2. SELECT first_name, last_name FROM mytable LIMIT 1000
3. SELECT last_name, first_name FROM mytable LIMIT 1000
4. SELECT * FROM mytable ORDER BY first_name LIMIT 1000
If I ran these four queries I would be charged ~$20 ($5/tb, pretend * = first_name, last_name). This seems like a very high amount to pay to just sample the data -- is there another way to query this data to view a limited view of the data, like the above?
This seems like a very high amount to pay to just sample the data -- is there another way to
If your data dynamic, meaning is updated daily or whatever other way - you can use Table Decorators
For example
SELECT * FROM mytable#-3600000--1800000 LIMIT 1000
will query only data inserted within last hour, thus lowering cost a lot!!
Another option is to use Day partitioned tables so you can query only specific day worth of data
Is there a way to export a subset of the data instead of doing a query?
Yes. You can use Tabledata.list API to list page-by-page data in your original table and insert into new [sampled] table using whatever sampling logic you need. Note: this API is free as it actually doesn't use BigQuery query engine per se, but rather reading from underlying storage!!! so you can be reasonably wild :o)
Of course you need to implement this in client of your choice.
I assume you are accessing BQ through the online query interface (https://bigquery.cloud.google.com/table . . . ).
Click on the table in the data set. Go down to where it says "Table Details" in bold letters, beneath the "Run Query" icon.
In the second row below that is an option for "Preview". This will show you some data and it's free.
We have a sample table that's generated every day at work which I find extremely useful for many tasks. It's as simple as:
SELECT * FROM mytable WHERE RAND() < 0.01
The table is hierarchical, and this sampling is set to reproduce the whole structure; so queries can be tested/replicated in exactly the same form and then swapped over to the big table if needed. The 1% sample applies to the top level of the hierarchy (meaning you don't have to wonder whether you are getting valid results from branches).
For us, there is enough data that sums and ratios are generally very representative. The only kind of data that poses a significant problem is relatively rare events, which means counts of unique elements can't be relied on.
And of course, after the single daily charge for making this table, the billing goes from dollars to cents!

Bigquery: Does huge amount of tables in a dataset impact performance?

I am currently using big query to store the user information to compute aggregate results against huge log data . But since modifying the data is not possible. In order to overcome this I am planning to store each user record in separate table. I understand bigquery supports querying from multiple tables using which i can get all information. My doubt over here are
as the number of users grows will the performance deteriorate as compared to storing all the users in a singe table.if there any limitations on number of tables per dataset in biq query
Thanks in advance
From what I know - there is no hard limit on number of tables in dataset.
At the same time - Native BQ UI has limit of first 10,000 tables in dataset to show.
Another limits to consider (just few to mention):
* Daily update limit: 1,000 updates per table per day;
* Query (including referenced views) can reference up to 1,000 tables and not more;
* Each additional table involved in a query (with hundreds and hundreds tables) makes considerable impact on performance.
* even if each table is small enough - it still will be charged at min price of 10MB (even if it is just few KB)
Not knowing your exact scenario doesnt allow making some recommendation, but at least you've got answer on those items in your question.
Overall, idea of having table per user doesn't sound good to me