To avoid extra charging using BigQuery Magnitude Simba JDBC driver I'm looking for implementation of BQ API tabledata.list method in JDBC, is there any?
This behavior is expected in BigQuery. In Controlling costs in BigQuery guide you can find information:
Applying a LIMIT clause to a SELECT * query does not affect the amount of data read. You are billed for reading all bytes in the entire table, and the query counts against your free tier quota.
In this guide you have also a best practice how to reduce costs like using SELECT * EXCEPT to reduce the number of columns or using clause where with partitioned table.
The closest thing which you are asking is to use Sample data using preview options
In the API, use tabledata.list to retrieve table data from a specified set of rows.
You can find some examples in List tables or Preview table data samples
Related
I'm a little new to BQ, I'm doing a query, very simply of a view to get a quick look at the data, but when I put say LIMIT 100, to see just the first 100 rows, I don't get a reduction in the data required and hence the cost. If I want to simply do this, what can I do that is inexpensive to get the data.
For example:
select * from table
uses exactly the same projected data as
select * from table limit 100
Is there not any simplification under hood. Is BQ searching all rows and then taking the top 100?
BigQuery charging is based on the data queried and unfortunately limit does not reduce the volume of data queried.
The following can help:
using the table preview in the console (this is free if I recall correctly) but does not work on views or some types of attached tables
reducing the number of columns that are queried
if your data is partitioned, you can query a specific partition - https://cloud.google.com/bigquery/docs/querying-partitioned-tables
There is information from Google on this page https://cloud.google.com/bigquery/docs/best-practices-performance-input
I have usecase for designing storage for 30 TB of text files as part of deploying data pipeline on Google cloud. My input data is in CSV format, and I want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which would be a better option in below for this use case?
Using Cloud Storage for storage and link permanent tables in Big Query for query or Using Cloud Big table for storage and installing HBaseShell on compute engine to query Big table data.
Based on my analysis in below for this specific usecase, I see below where cloudstorage can be queried in through BigQuery. Also, Bigtable supports CSV imports and querying. BigQuery limits also mention a maximum size per load job of 15 TB across all input files for CSV, JSON, and Avro based on the documentation, which means i could load mutiple load jobs if loading more than 15 TB, i assume.
https://cloud.google.com/bigquery/external-data-cloud-storage#temporary-tables
https://cloud.google.com/community/tutorials/cbt-import-csv
https://cloud.google.com/bigquery/quotas
So, does that mean I can use BigQuery for the above usecase?
The short answer is yes.
I wrote about this in:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And when loading cluster your tables, for massive improvements in costs for the most common queries:
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
In summary:
BigQuery can read CSVs and other files straight from GCS.
You can define a view that parses those CSVs in any way you might prefer, all within SQL.
You can run a CREATE TABLE statement to materialize the CSVs into BigQuery native tables for better performance and costs.
Instead of CREATE TABLE you can do imports via API, those are free (instead of cost of query for CREATE TABLE.
15 TB can be handled easily by BigQuery.
Is there a query I can run to determine how much queries against each table are costing us? For instance, the result of this query would at least include something like:
dataset.table1 236TB processed dataset.table2 56GB processed dataset.table3 24kB processed etc
Also is there a way to know what specific queries are costing us the most?
Thanks!
Let's talk first about data and respective data-points to do such a query!
Take a look at Job Resources
Here you have few useful properties
configuration.query.query - BigQuery SQL query to execute.
statistics.query.referencedTables - Referenced tables for the job.
statistics.query.totalBytesBilled - Total bytes billed for the job.
statistics.query.totalBytesProcessed - Total bytes processed for the job.
statistics.query.billingTier - Billing tier for the job.
Having above data-points would allow you to write relativelly simple query to answer your cost per query and cost per table questions!
So, now - how to get this data available?
You can collect your jobs using Job.list API and than loop thru all available jobs and retrieve respective stats via Job.get API - of course dumping retrieved data into BigQuery table. Than you can enjoy analysis!
Or you can use BigQuery's audit logs to track access and cost details (as described in the docs) and export them back to BigQuery for analysis.
The former option (Jobs.list and than Job.get in loop)) gives you ability to get your jobs info even if you don't have audit logs enabled yet, because Job.get API returns information about a specific job that is available for a six month period after creation - so plenty of data for analysis!
In my understanding currently, it is not possible to get processed bytes per table.
In my understanding it would be a great feature through which you can identify and optimize costs and also have a better possibility to understand effectivity of partioning and clustering changes. Currently is just possible to get the totalprocessed bytes for a query and also see which tables were referenced. But there are no easy query and no query at all which makes possible to analyze this cost on the table level which is more granuar then query level.
I know from this question that one can do random sampling RAND.
SELECT * FROM [table] WHERE RAND() < percentage
But this would require a full table scan and incur equivalent cost. I'm wondering if there are more efficient ways?
I'm experimenting with tabledata.list API but got java.net.SocketTimeoutException: Read timed out when index is very large (i.e. > 10000000). Is this operation not O(1)?
bigquery
.tabledata()
.list(tableRef.getProjectId, tableRef.getDatasetId, tableRef.getTableId)
.setStartIndex(index)
.setMaxResults(1L)
.execute()
I would recommend paging tabledata.list with pageToken and get collect sample rows from each page. This should scale much better.
Another (totally different) option I see is use of Table Decorators
You can in loop grammatically generate random time (for snapshot) or time-frame (for range) and query only that portions of data extracting needed data.
Note limitation: This will allow you to sample data that is less than 7 days old.
tabledata.list is not especially performant for arbitrary lookups in a table, especially as you are looking later and later into the table. It is not really designed for efficient data retrieval of an entire table, it's more for looking at the first few pages of data in a table.
If you want to run some operation over all the data in your table, but not run a query, you should probably use an extract job to GCS instead, and sample rows from the output files.
In traditional data modeling, I create hourly and daily rollup table to reduce data storage and improve query response time. However, the attempt to create similar rollup table easily run into "Response too large to return" error. What is the recommended method to create rollup table with BigQuery? I need to reduce data to reduce cost of storage and query.
Thx!
A recently announced BigQuery feature allows large results!
Now you can specify a flag and a destination table. Results of arbitrary size will be stored in the designated table.
https://developers.google.com/bigquery/docs/queries#largequeryresults
It sounds like you are appending all of your data to a single table, then want to create smaller tables to query over ... is that correct?
One option would be to load your data in the hourly slices, then create the daily and 'all' tables by performing table copy operations with write_disposition=WRITE_APPEND. Alternately, you can use multiple tables in your queries. For example select foo from table20130101,table20130102,table20130102. (Note this does not do a join, it does a UNION ALL. It is a quirk of the bigquery query syntax).
If it will be difficult to change the layout of your tables, there isn't currently support for larger query result sizes, but it is something that is one of our most requested features and we have it a high priority.
Also, creating smaller tables won't necessarily improve query performance, since bigquery processes queries in parallel to the extent possible. It won't reduce storage costs, unless you're only going to store part of the table. It will, of course, reduce the costs of a query, since running queries against larger tables is more expensive.
If you describe your scenario a bit more I may be able to offer more concrete advice.