They say there are no stupid questions, but this might be an exception.
I understand that BigQuery, being a columnar database, does a full table scan for any query over a specific column.
I also understand that query results can be cached or a named table can be created with the results of a query.
However I also see tabledata.list() in the documentation, and I'm unsure of how this fits in with query costs. Once a table is created from a query, am I free to access that table without cost through the API?
Let's say, for example, I run a query that is grouped by UserID, and I want to then present the results of that query to individual users based on that ID. As far as I understand there are two obvious ways of getting out the appropriate row for doing so.
I can write another query over the destination table with a WHERE userID=xxx clause
I can use the tabledata.list() endpoint to get all the (potentially paginated) data and get the appropriate row myself in my code
Where situation 1 would incur a query cost, and situation 2 would not? Am I getting this right?
Tabledata.list API is free as it actually does not use BigQuery Engine at all
so you are right for both 1 and 2
Related
I want to find which tables/columns in Redshift remain unused in the database in order to do a clean-up.
I have been trying to parse the queries from the stl_query table, but it turns out this is a quite complex task for which I haven't found any library that I can use.
Anyone knows if this is somehow possible?
Thank you!
The column question is a tricky one. For table use information I'd look at stl_scan which records info about every table scan step performed by the system. Each of these is date-stamped so you will know when the table was "used". Just remember that system logging tables are pruned periodically and the data will go back for only a few days. So may need a process to view table use daily to get extended history.
I ponder the column question some more. One thought is that query ids will also be provided in stl_scan and this could help in identifying the columns used in the query text. For every query id that scans table_A search the query text for each column name of the table. Wouldn't be perfect but a start.
Please someone tell and explain the correct answer to the following Multiple Choice Question?
You have a query that filters a BigQuery table using a WHERE clause on timestamp and ID columns. By using bq query –-dry_run you learn that the query triggers a full scan of the table, even though the filter on timestamp and ID select a tiny fraction of the overall data. You want to reduce the amount of data scanned by BigQuery with minimal changes to existing SQL queries. What should you do?
Create a separate table for each ID.
Use the LIMIT keyword to reduce the number of rows returned.
Recreate the table with a partitioning column and clustering column.
Use the bq query --maximum_bytes_billed flag to restrict the number of bytes billed.
As far as I know, the only way to limit the number of bytes read by BigQuery is either through removing (entirely) column references, removing table references, or through partitioning (and perhaps clustering in some cases).
One of the challenges when starting to use BigQuery is that a query like this:
select *
from t
limit 1;
can be really, really expensive.
However, a query like this:
select sum(x)
from t;
on the same table can be quite cheap.
To answer the question, you should learn more about how BigQuery bills for usage.
Assuming these are the only four possible answers, the answer is almost certainly "Recreate the table with a partitioning column and clustering column."
Lets eliminate the others:
Use the LIMIT keyword to reduce the number of rows returned.
This isn't going to help at all, since the LIMIT is only applied after a full table scan has already happened, so you'll still be billed the same, despite the limit.
Create a separate table for each ID.
This doesn't seem likely to help, as in addition to being an organizational mess, then you'd have to query every table to find all the right timestamps, and process the same amount of data as before (but with a lot more work).
Use the bq query --maximum_bytes_billed flag to restrict the number of bytes billed.
You could do this, but then the query would fail when the maximum bytes to be billed were too high, so you wouldn't get your results.
So why partitioning and clustering?
BigQuery (on-demand) billing is based on the columns that you select, and the amount of data that you read in those columns. So you want to do everything you can to reduce the amount of data processed.
Depending on the exact query, partitioning by the timestamp allows you to only scan the data for the relevant days. This can obviously be a huge savings compared to an entire table scan.
Clustering allows to to put commonly used data together within a table by sorting based on the clustering column, so that it can eliminate the need to scan irrelevant data based on the filter (WHERE clause). Thus, you scan less data and reduce your cost. There is a similar benefit for aggregation of data.
This of course all assumes you have a good understanding of the queries you are actually making and which columns make sense to cluster on.
Im having 260 columns table in SQL server. When we run "Select count(*) from table" it is taking almost 5-6 to get the count. Table contains close 90-100 million records with 260 columns where more than 50 % Column contains NULL. Apart from that, user can also build dynamic sql query on to table from the UI, so searching 90-100 million records will take time to return results. Is there way to improve find functionality on a SQL table where filter criteria can be anything , can any1 suggest me fastest way get aggregate data on 25GB data .Ui should get hanged or timeout
Investigate horizontal partitioning. This will really only help query performance if you can force users to put the partitioning key into the predicates.
Try vertical partitioning, where you split one 260-column table into several tables with fewer columns. Put all the values which are commonly required together into one table. The queries will only reference the table(s) which contain columns required. This will give you more rows per page i.e. fewer pages per query.
You have a high fraction of NULLs. Sparse columns may help, but calculate your percentages as they can hurt if inappropriate. There's an SO question on this.
Filtered indexes and filtered statistics may be useful if the DB often runs similar queries.
As the guys state in the comments you need to analyse a few of the queries and see which indexes would help you the most. If your query does a lot of searches, you could use the full text search feature of the MSSQL server. Here you will find a nice reference with good examples.
Things that came me up was:
[SQL Server 2012+] If you are using SQL Server 2012, you can use the new Columnstore Indexes.
[SQL Server 2005+] If you are filtering a text column, you can use Full-Text Search
If you have some function that you apply frequently in some column (like SOUNDEX of column, for example), you could create PERSISTED COMPUTED COLUMN to not having to compute this value everytime.
Use temp tables (indexed ones will be much better) to reduce the number of rows to work on.
#Twelfth comment is very good:
"I think you need to create an ETL process and start changing this into a fact table with dimensions."
Changing my comment into an answer...
You are moving from a transaction world where these 90-100 million records are recorded and into a data warehousing scenario where you are now trying to slice, dice, and analyze the information you have. Not an easy solution, but odds are you're hitting the limits of what your current system can scale to.
In a past job, I had several (6) data fields belonging to each record that were pretty much free text and randomly populated depending on where the data was generated (they were search queries and people were entering what they basically would enter in google). With 6 fields like this...I created a dim_text table that took each entry in any of these 6 tables and replaced it with an integer. This left me a table with two columns, text_ID and text. Any time a user was searching for a specific entry in any of these 6 columns, I would search my dim_search table that was optimized (indexing) for this sort of query to return an integer matching the query I wanted...I would then take the integer and search for all occourences of the integer across the 6 fields instead. searching 1 table highly optimized for this type of free text search and then querying the main table for instances of the integer is far quicker than searching 6 fields on this free text field.
I'd also create aggregate tables (reporting tables if you prefer the term) for your common aggregates. There are quite a few options here that your business setup will determine...for example, if each row is an item on a sales invoice and you need to show sales by date...it may be better to aggregate total sales by invoice and save that to a table, then when a user wants totals by day, an aggregate is run on the aggreate of the invoices to determine the totals by day (so you've 'partially' aggregated the data in advance).
Hope that makes sense...I'm sure I'll need several edits here for clarity in my answer.
I have the following query:
SELECT * FROM messages GROUP BY peer
(really it's more complicated with joins, but I omitted them here for simplicity)
The problem is that SQLite doesn't use any indexes and always performs a full scan of the table. Expectedly, it works fast on small data sets but it's noticeably slow with a big table containing thousands of rows. Here's the output of the EXPLAIN QUERY PLAN command:
0|0|0|SCAN TABLE messages USING INDEX messages_peer_mid (~1000000 rows)
Despite it says "USING INDEX" it still performs a full scan. Is there any way to make SQLite use index for this query or it's better to give up with GROUP BY and look for some other approach?
The plan takes into account the amount of data and performs a scan because it's algorithm probably concludes it's faster to do so.
Other comments, your query has no WHERE condition and you are returning ALL columns so why wouldn't you expect a table scan?
Indexes assist in selecting records from a table (using a WHERE clause or as a result of a JOIN operation). GROUP BY is performed on a set of records after they've been selected and retrieved from the table. It cannot be assisted by indexes.
If you want to know more about what options are available for index use in your query, please post the entire query.
Also, you note that the SQL you gave is a symbolic representation of the code you're running, but if you're really using *, or any non-aggregated field names other than peer in your statement you may not be getting the results you want.
Finally, you ask "it's better to give up with GROUP BY and look for some other approach?" GROUP BY is used for a specific function in SQL (producing new aggregated result sets from non-aggregated data). If that's your goal, GROUP BY is likely to be the best solution (because it defers to the database engine, which is highly optimized and cognizant of database statistics the decision about how to retrieve and process the data). If that's not your goal and you're trying to do something else using GROUP BY as an "approach" to that other functionality, let us know what it is you're actually trying to achieve.
I am new to columnar DB concepts and BigQuery in particular. I noticed that for the sake of performance and cost efficiency it is recommended to split data across tables not only logically - but also by time.
For example - while I need a table to store my logs (1 logical table that is called "logs"), it is actually considered a good practice to have a separate table for different periods, like "logs_2012", "logs_2013", etc... or even "logs_2013_01", "logs_2013_02", etc...
My questions:
1) Is it actually the best practice?
2) Where would be best to draw the line - an annual table? A monthly table? A daily table? You get the point...
3) In terms of retrieving the data via queries - what is the best approach? Should I construct my queries dynamically using the UNION option? If I had all my logs in one table - I would naturally use the where clause to get data for the desired time range, but having data distributed over multiple tables makes it weird. I come from the world of relational DB (if it wasn't obvious so far) and I'm trying to make the leap as smoothly as possible...
4) Using the distributed method (different tables for different periods) still raises the following question: before querying the data itself - I want to be able to determine for a specific log type - what is the available range for querying. For example - for a specific machine I would like to first present to my users the relevant scope of their available logs, and let them choose the specific period within that scope to get insights for. The question is - how do I construct such a query when my data is distributed over a number of tables (each for a period) where I don't know which tables are available? How can I construct a query when I don't know which tables exist? I might try to access the table "logs_2012_12" when this table doesn't actually exist, or event worst - I wouldn't know which tables are relevant and available for my query.
Hope my questions make sense...
Amit
Table naming
For daily tables, the suggested table name pattern is the specific name of your table + the date like in '20131225'. For example, "logs20131225" or "logs_20131225".
Ideal aggregation: Day, month, year?
The answer to this question will depend on your data and your queries.
Will you usually query one or two days of data? Then have daily tables, and your costs will be much lower, as you query only the data you need.
Will you usually query all your data? Then have all the data in one table. Having many tables in one query can get slower as the number of tables to query grow.
If in doubt, do both! You could have daily, monthly, yearly tables. For a small storage cost, you could save a lot when doing queries that target only the intended data.
Unions
Feel free to do unions.
Keep in mind that there is a limit of a 1000 tables per query. This means if you have daily tables, you won't be able to query 3 years of data (3*365 > 1000).
Remember that unions in BigQuery don't use the UNION keyword, but the "," that other databases use for joins. Joins in BigQuery can be done with the explicit SQL keyword JOIN (or JOIN EACH for very big joins).
Table discovery
API: tables.list will list all tables in a dataset, through the API.
SQL: To query the list of tables within SQL... keep tuned.
New 2016 answer: Partitions
Now you can have everything in one table, and BigQuery will analyze only the data contained in the desired dates - if you set up the new partitioned tables:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables