I was reading articles about partition tables and got confused as in whether it is a boon or bane. I do understand that partition is meant for large amount of datasets. but here is my confusion:
Lets assume that there is a table:
Orders(Orderid,Custid,Orderdate,Shipperid)
and its has huge amount of data; well enough to justify the partitioning. There are select queries done on every column of this table; many queries having joins with other tables.
If I partition the table on the basis of OrderId; will other queries based on other columns become slow?
Will the join queries involving column other than OrderId column become slow?
Will appreciate any guidance!! Thanks
Imagine you have two tables with the same schema and the same data. Both are clustered on OrderID. One of these tables is also partitioned by OrderID. Sometimes access is keyed by OrderID and sometimes not.
Lookups for a single OrderID may be faster against the partitioned table if you have sufficient data to force multiple levels in your index BTree. This is because there is one BTree per partition. Lookups for a range of OrderIDs will, in general, be faster because of partition elimination - SQL Server will only access those partitions needed to satisfy the query.
Lookups or scans on other keys will be no different.
Partitioning also allows swap in and swap out of a whole partition which can save hours in a daily load / delete cycle.
Related
The purpose of all this is to create a lookup table to avoid a self join down the road, which would involve joins for the same data against much bigger data sets.
In this instance a sales order may have one or both of bill to and ship to customer ID.
The tables here are aggregates of data from 5 different servers, differentiated by the box_id. The customer table is ~1.7M rows, and sales_order is ~55M. The end result is ~52M records and takes on average about 80 minutes to run.
The query:
SELECT DISTINCT sog.box_id ,
sog.sales_order_id ,
cb.cust_id AS bill_to_customer_id ,
cb.customer_name AS bill_to_customer_name ,
cs.cust_id AS ship_to_customer_id ,
cs.customer_name AS ship_to_customer_name
FROM sales_order sog
LEFT JOIN customer cb ON cb.cust_id = sog.bill_to_id AND cb.box_id = sog.box_id
LEFT JOIN customer cs ON cs.cust_id = sog.ship_to_id AND cs.box_id = sog.box_id
The execution plan:
https://www.brentozar.com/pastetheplan/?id=SkjhXspEs
All of this is happening on SQL Server.
I've tried reproducing the bill to and ship to customer sets as CTEs and joining to those, but found no performance benefit.
The only indexes on these tables are the primary keys (which are synthetic IDs). Somewhat curiously the execution plan analyzer is not recommending adding any indexes to either table; it usually wants me to slap indexes on almost everything.
I don't know that there necessarily IS a way to make this run faster, but I am trying to improve my query optimization and have hit the limit of my knowledge. Any insight is much appreciated.
When you run queries like yours -- queries with no WHERE filters -- often the DBMS decides it has to scan entire tables. (In SQL Server execution plans, "clustered index scan" means it is scanning the whole table.) It certainly has to wrangle all the data in the tables. The lookup table you want to create is often called a "materialized view." (An online version of SQL server has built in support for materialized views, but other versions still don't.)
Depending on how you will use your data, you may be better off avoiding this materialized lookup table. If all your uses of your proposed lookup table involve filtering out a small subset of rows using WHERE clauses, an ordinary non-materialized view may be a good choice. When you give queries involving ordinary views, the query planner folds those views into the query, and may recommend helpful indexes.
I am new to SQL and have been trying to optimize the query performances of my microservices to my DB (Oracle SQL).
Based on my research, I checked that you can do indexing and partitioning to improve query performance, I seem to have known each of the concept and how to do it, but I'm not sure about the difference between both?
For example, suppose I have table Orders, with 100 million entries with
Columns:
OrderId (PK)
CustomerId (6 digit unique number)
Order (what the order is. Ex: "Burger")
CreatedTime (Timestamp)
In essence, both methods "subdivides" the orders table so that a DB query wont need to scan through all 100 million entries in DB, right?
Lets say I want to find orders on "2020-01-30", I can create an index on createdTime to improve the performance.
But I can also create a partition based on createdTime to improve the performance. (the partition is per day)
Are there any difference to both methods in this case? Is one better than the other ?
There are several ways to partition - by range, by list, by hash, and by reference (although that tends to be uncommon).
If you were to partition by a date column, it would usually be using a range, so one month/week/day uses one partition, another uses another etc. If you want to filter rows where this date is equal to a value then you can do a partition full table scan to read all of the partition that houses this data with a full scan. This can end up being quite efficient if most of the data in the partition would match your filter - apply the same thinking about whether a full table scan in general is a good idea but when the data in the table is already filtered down. If you wanted to look for an hour long date range and you’re partitioning by range with monthly intervals then you’re going to be reading about 730 times more data than necessary. Local indexes are useful in that scenario.
Smaller partitions also help this out, but you can end up with a case where you have thousands of partitions. If you have selective queries that don’t know which partition needs to be read - you could want global indexes. These add a lot of effort into all partition maintenance operations.
If you index the date column instead then you can quickly establish the location of the rows in your table that meet your filter. This is easy in an index because it’s just a sorted list - you find the first key in the index that matches the filter and read until it no longer matches. You then have to lookup these rows using single block reads. Usual efficiency rules of an index apply - the less data you need to read with your filters the more useful the index will be.
Usually, queries include more than just a date filter. These additional filters might be more appropriate for your partitioning scheme. You could also just include the columns in your index (remembering the Golden Rule of Indexing would tell you if you’re using a column with equality filters it should go before columns that you use range filters on in an index).
You can generally get all the performance you need with just indexing. Partitioning really comes into play when you have important queries that need to read huge chunks of data (generally reporting queries) or when you need to do things like purge data older than X months.
For streaming inserts, I want to use a template table (with user id suffix) which is itself a Partitioned table. This way I can make my tables smaller than just using Partitioned Tables and hence make my queries more cost-effective. Also my query cost per user stays constant irrespective of the number of users in my system. As per the documentation at https://cloud.google.com/bigquery/streaming-data-into-bigquery:-
To create smaller sets of data by date, use time-partitioned tables. To create smaller tables that are not date-based, use template tables and BigQuery creates the tables for you.
It sounds as if it can either be a time-partitioned table OR a template table. Can it not be both? If not, is there another architecture that I should look into?
One more concern regarding my above proposed architecture is the 4000 limit that I saw on https://cloud.google.com/bigquery/docs/partitioned-tables . Does it mean that my partitioned table can't cover more than 4000 days? Will I have to delete old partitions in this case or will the last partition keep storing any subsequent streamed data?
You should look into Clustered Tables on partitioned tables.
With that you can have ONE table with all users in it, partitioned by time, and clustered by user_id as you would use in a template table.
Introduction to Clustered Tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Clustered table pricing
When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries you run against the data. Clustered tables help you to reduce query costs by pruning data so it is not processed by the query.
I have table about 20-25 million records, I have to put in another table based on some condition and also sorted. Example
Create table X AS
select * from Y
where item <> 'ABC'
Order By id;
I know that Order by Uses single reducer to guarantee total order in output.
I need optimize way to do sorting for above query.
SQL tables represent unordered sets. This is especially true in parallel databases where the data is spread among multiple processors.
That said, Hive does support clustered indexes (which essentially define partitions) and sorting within the partitions. The documentation is quite specific, though, that this is not supported with CREATE TABLE AS:
CTAS has these restrictions:
The target table cannot be a partitioned table.
You could do what you want by exporting the data and re-importing it.
However, I would suggest that you figure out what you really need without requiring the data to be ordered within the database.
I have multiple tables that data can be queried from with joins.
In regards to database performance:
Should I run multiple selects from multiple tables for the required data?
or
Should I write 1 select that uses a bunch of Joins to select the required data from all the tables at once?
EDIT:
The where clause I will be using for the select contains Indexed fields of the tables. It sounds like because of this, it will be faster to use 1 select statement with many joins. I will however still test the performance difference between the 2.
Thanks for all the great answers.
Just write one query with joins. If you are concerned about performance there are a number of options including:
Creating indexes that will help the performance of your selects
Create a persisted denormalized form of the data you want so you can query one table. This would most likely be an indexed view or another table.
This can be one of those, well-gee-it-depends, but generally if you're writing straight SQL do one query--especially since the joins might limit some of the data you get back.
There is a good chance if you do multiple point queries for one record in each table, if you're using the primary key of the table for lookup, the connection cost for each query will be more costly than the actual query.
It depends on how the tables are joined. If you do a cross-product of all tables than it would be better to do individual selects. However if your tables are properly indexed and well thought out one query with multiple selects would be more efficient.
If you have proper indexes on your tables, you might be better off with the JOINs but they are often the cause of bottlenecks. Instead of multiple selects, you might look at ways to de-normalize your data. It is far less "expensive" when a user performs an operation to update a count or timestamp in multiple tables which prevents you from having to join those tables.
The best tool I find for performance tuning of queries is using EXPLAIN. You type EXPLAIN before the query and you can see how many rows are scanned. Your goal is the lower the number the better, which means your indexes are working. The other thing is when creating indexes, use compound indexes on multiple fields and order them left to right in the order they appear in the WHERE clause.
For example you have 10,000 rows in sometable:
SELECT id, name, description, status FROM sometable WHERE name LIKE '%someName%' AND status = 'Active';
You could type EXPLAIN before the query and it might return 10,000 as number of rows scanned to match. You then create a compound index:
ALTER TABLE sometable ADD INDEX idx_st_search (name, status);
You then perform the EXPLAIN on table again and it might return 1 as number of rows scanned and performance significantly improved.
Depends on your Table designs.
Most of times one large query is better but be sure to
Use primary keys in where clause as much as you can for joins.
use indexed fields or make indexes for fields which are used in where clauses.