How can I reduce the amount of data scanned by BigQuery during a query? - sql

Please someone tell and explain the correct answer to the following Multiple Choice Question?
You have a query that filters a BigQuery table using a WHERE clause on timestamp and ID columns. By using bq query –-dry_run you learn that the query triggers a full scan of the table, even though the filter on timestamp and ID select a tiny fraction of the overall data. You want to reduce the amount of data scanned by BigQuery with minimal changes to existing SQL queries. What should you do?
Create a separate table for each ID.
Use the LIMIT keyword to reduce the number of rows returned.
Recreate the table with a partitioning column and clustering column.
Use the bq query --maximum_bytes_billed flag to restrict the number of bytes billed.

As far as I know, the only way to limit the number of bytes read by BigQuery is either through removing (entirely) column references, removing table references, or through partitioning (and perhaps clustering in some cases).
One of the challenges when starting to use BigQuery is that a query like this:
select *
from t
limit 1;
can be really, really expensive.
However, a query like this:
select sum(x)
from t;
on the same table can be quite cheap.
To answer the question, you should learn more about how BigQuery bills for usage.

Assuming these are the only four possible answers, the answer is almost certainly "Recreate the table with a partitioning column and clustering column."
Lets eliminate the others:
Use the LIMIT keyword to reduce the number of rows returned.
This isn't going to help at all, since the LIMIT is only applied after a full table scan has already happened, so you'll still be billed the same, despite the limit.
Create a separate table for each ID.
This doesn't seem likely to help, as in addition to being an organizational mess, then you'd have to query every table to find all the right timestamps, and process the same amount of data as before (but with a lot more work).
Use the bq query --maximum_bytes_billed flag to restrict the number of bytes billed.
You could do this, but then the query would fail when the maximum bytes to be billed were too high, so you wouldn't get your results.
So why partitioning and clustering?
BigQuery (on-demand) billing is based on the columns that you select, and the amount of data that you read in those columns. So you want to do everything you can to reduce the amount of data processed.
Depending on the exact query, partitioning by the timestamp allows you to only scan the data for the relevant days. This can obviously be a huge savings compared to an entire table scan.
Clustering allows to to put commonly used data together within a table by sorting based on the clustering column, so that it can eliminate the need to scan irrelevant data based on the filter (WHERE clause). Thus, you scan less data and reduce your cost. There is a similar benefit for aggregation of data.
This of course all assumes you have a good understanding of the queries you are actually making and which columns make sense to cluster on.

Related

What is the difference between partitioning and indexing in DB ? (Performance-Wise)

I am new to SQL and have been trying to optimize the query performances of my microservices to my DB (Oracle SQL).
Based on my research, I checked that you can do indexing and partitioning to improve query performance, I seem to have known each of the concept and how to do it, but I'm not sure about the difference between both?
For example, suppose I have table Orders, with 100 million entries with
Columns:
OrderId (PK)
CustomerId (6 digit unique number)
Order (what the order is. Ex: "Burger")
CreatedTime (Timestamp)
In essence, both methods "subdivides" the orders table so that a DB query wont need to scan through all 100 million entries in DB, right?
Lets say I want to find orders on "2020-01-30", I can create an index on createdTime to improve the performance.
But I can also create a partition based on createdTime to improve the performance. (the partition is per day)
Are there any difference to both methods in this case? Is one better than the other ?
There are several ways to partition - by range, by list, by hash, and by reference (although that tends to be uncommon).
If you were to partition by a date column, it would usually be using a range, so one month/week/day uses one partition, another uses another etc. If you want to filter rows where this date is equal to a value then you can do a partition full table scan to read all of the partition that houses this data with a full scan. This can end up being quite efficient if most of the data in the partition would match your filter - apply the same thinking about whether a full table scan in general is a good idea but when the data in the table is already filtered down. If you wanted to look for an hour long date range and you’re partitioning by range with monthly intervals then you’re going to be reading about 730 times more data than necessary. Local indexes are useful in that scenario.
Smaller partitions also help this out, but you can end up with a case where you have thousands of partitions. If you have selective queries that don’t know which partition needs to be read - you could want global indexes. These add a lot of effort into all partition maintenance operations.
If you index the date column instead then you can quickly establish the location of the rows in your table that meet your filter. This is easy in an index because it’s just a sorted list - you find the first key in the index that matches the filter and read until it no longer matches. You then have to lookup these rows using single block reads. Usual efficiency rules of an index apply - the less data you need to read with your filters the more useful the index will be.
Usually, queries include more than just a date filter. These additional filters might be more appropriate for your partitioning scheme. You could also just include the columns in your index (remembering the Golden Rule of Indexing would tell you if you’re using a column with equality filters it should go before columns that you use range filters on in an index).
You can generally get all the performance you need with just indexing. Partitioning really comes into play when you have important queries that need to read huge chunks of data (generally reporting queries) or when you need to do things like purge data older than X months.

Oracle SQL: What is the best way to select a subset of a very large table

I have been roaming these forums for a few years and I've always found my questions had already been asked, and a fitting answer was already present.
I have a pretty generic (and maybe easy) question now though, but I haven't been able to find a thread asking the same one yet.
The situation:
I have a payment table with 10-50M records per day, a history of 10 days and hundreds of columns. About 10-20 columns are indexed. One of the indices is batch_id.
I have a batch table with considerably fewer records and columns, say 10k a day and 30 columns.
If I want to select all payments from one specific sender, I could just do this:
Select * from payments p
where p.sender_id = 'SenderA'
This runs a while, even though sender_id is also indexed. So I figure, it's better to select the batches first, then go into the payments table with the batch_id:
select * from payments p
where p.batch_id in
(select b.batch_id from batches where b.sender_id = 'SenderA')
--and p.sender_id = 'SenderA'
Now, my questions are:
In the second script, should I uncomment the Sender_id in my where clause on the payments table? It doesn't feel very efficient to filter on sender_id twice, even though it's in different tables.
Is it better if I make it an inner join instead of a nested query?
Is it better if I make it a common table expression instead of a nested query or inner join?
I suppose it could all fit into one question: What is the best way to query this?
In the worst case the two queries should run in the same time and in the best case I would expect the first query to run quicker. If it is running slower, there is some problem elsewhere. You don't need the additional condition in the second query.
The first query will retrieve index entries for a single value, so that is going to access less blocks than the second query which has to find index entries for multiple batches (as well as executing the subquery, but that is probably not significant).
But the danger as always with Oracle is that there are a lot of factors determining which query plan the optimizer chooses. I would immediately verify that the statistics on your indexed columns are up-to-date. If they are not, this might be your problem and you don't need to read any further.
The next step is to obtain a query execution plan. My guess is that this will tell you that your query is running a full-table-scan.
Whether or not Oracle choses to perform a full-table-scan on a query such as this is dependent on the number of rows returned and whether Oracle thinks it is more efficient to use the index or to simply read the whole table. The threshold for flipping between the two is not a fixed number: it depends on a lot of things, one of them being a parameter called DB_FILE_MULTIBLOCK_READ_COUNT.
This is set-up by Orale and in theory it should be configured such that the transition between indexed and full-table scan queries should be smooth. In other words, at the transition point where your query is returning enough rows to just about make a full table scan more efficient, the index scan and the table scan should take roughly the same time.
Unfortunately, I have seen systems where this is way out and Oracle flips to doing full table scans far too quickly, resulting in a long query time once the number of rows gets over a certain threshold.
As I said before, first check your statistics. If that doesn't work, get a QEP and start tuning your Oracle instance.
Tuning Oracle is a very complex subject that can't be answered in full here, so I am forced to recommend links. Here is a useful page on the parameter: reducing it might help: Why Change the Oracle DB_FILE_MULTIBLOCK_READ_COUNT.
Other than that, the general Oracle performance tuning guide is here: (Oracle) Configuring a Database for Performance.
If you are still having problems, you need to progress your investigation further and then come up with a more specific question.
EDIT:
Based on your comment where you say your query is returning 4M rows out of 10M-50M in the table. If it is 4M out of 10M there is no way an index will be of any use. Even with 4M out of 50M, it is still pretty certain that a full-table-scan would be the most efficient approach.
You say that you have a lot of columns, so probably this 4M row fetch is returning a huge amount of data.
You could perhaps consider splitting off some of the columns that are not required and putting them into a child table. In particular, if you have columns containing a lot of data (e.g., some text comments or whatever) they might be better being kept outside the main table.
Remember - small is fast, not only in terms of number of rows, but also in terms of the size of each row.
SQL is an declarative language. This means, that you specify what you like not how.
Check your indexes primary and "normal" ones...

Performance Improve on SQL Large table

Im having 260 columns table in SQL server. When we run "Select count(*) from table" it is taking almost 5-6 to get the count. Table contains close 90-100 million records with 260 columns where more than 50 % Column contains NULL. Apart from that, user can also build dynamic sql query on to table from the UI, so searching 90-100 million records will take time to return results. Is there way to improve find functionality on a SQL table where filter criteria can be anything , can any1 suggest me fastest way get aggregate data on 25GB data .Ui should get hanged or timeout
Investigate horizontal partitioning. This will really only help query performance if you can force users to put the partitioning key into the predicates.
Try vertical partitioning, where you split one 260-column table into several tables with fewer columns. Put all the values which are commonly required together into one table. The queries will only reference the table(s) which contain columns required. This will give you more rows per page i.e. fewer pages per query.
You have a high fraction of NULLs. Sparse columns may help, but calculate your percentages as they can hurt if inappropriate. There's an SO question on this.
Filtered indexes and filtered statistics may be useful if the DB often runs similar queries.
As the guys state in the comments you need to analyse a few of the queries and see which indexes would help you the most. If your query does a lot of searches, you could use the full text search feature of the MSSQL server. Here you will find a nice reference with good examples.
Things that came me up was:
[SQL Server 2012+] If you are using SQL Server 2012, you can use the new Columnstore Indexes.
[SQL Server 2005+] If you are filtering a text column, you can use Full-Text Search
If you have some function that you apply frequently in some column (like SOUNDEX of column, for example), you could create PERSISTED COMPUTED COLUMN to not having to compute this value everytime.
Use temp tables (indexed ones will be much better) to reduce the number of rows to work on.
#Twelfth comment is very good:
"I think you need to create an ETL process and start changing this into a fact table with dimensions."
Changing my comment into an answer...
You are moving from a transaction world where these 90-100 million records are recorded and into a data warehousing scenario where you are now trying to slice, dice, and analyze the information you have. Not an easy solution, but odds are you're hitting the limits of what your current system can scale to.
In a past job, I had several (6) data fields belonging to each record that were pretty much free text and randomly populated depending on where the data was generated (they were search queries and people were entering what they basically would enter in google). With 6 fields like this...I created a dim_text table that took each entry in any of these 6 tables and replaced it with an integer. This left me a table with two columns, text_ID and text. Any time a user was searching for a specific entry in any of these 6 columns, I would search my dim_search table that was optimized (indexing) for this sort of query to return an integer matching the query I wanted...I would then take the integer and search for all occourences of the integer across the 6 fields instead. searching 1 table highly optimized for this type of free text search and then querying the main table for instances of the integer is far quicker than searching 6 fields on this free text field.
I'd also create aggregate tables (reporting tables if you prefer the term) for your common aggregates. There are quite a few options here that your business setup will determine...for example, if each row is an item on a sales invoice and you need to show sales by date...it may be better to aggregate total sales by invoice and save that to a table, then when a user wants totals by day, an aggregate is run on the aggreate of the invoices to determine the totals by day (so you've 'partially' aggregated the data in advance).
Hope that makes sense...I'm sure I'll need several edits here for clarity in my answer.

Query performance on record type vs flatten table in BigQuery

I have a table with "orders", and "order lines" that come as JSON, and it is simple to store it as JSON in BigQuery. I can run a process to flatten the file to rows, but it is a burden, and makes the BigQUery table bigger.
What would be a best performance structure for BigQuery? Assuming I have queries on sum or products, and sales in order lines.
And what is the best practice to number of "records" (or "order lines") in a record column? Can it contain thousands or is it aimed for a few? Assuming I would query it like in a MongoDB document based database.
This will help me plan the right architecture.
BigQuery's columnar architecture is designed to handle nested and repeated fields in a highly performant manner, and in general can return query results as fast as it would if those records were flattened. In fact, in some cases, (depending on your data and the types of queries you are running) using already nested records can actually allow you to avoid subqueries that tack on an extra step.
Short answer: Don't worry about flattening, keep your data in the nested structure, the query performance will generally be the same either way.
However, as to your second question: Your record limit will be determined by how much data you can store in a single record. Currently BigQuery's per row maximum is 100MB. You can have many, many repeated fields in a single record, but they need to fit into this limit.

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.
Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.
I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.
Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.
It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).
yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer
To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.