SQL aggregation question

SQL aggregation question - sql

I have three tables:
unmatched_purchases table:
unmatched_purchases_id --primary key
purchases_id --foreign key to events table
location_id --which store
purchase_date
item_id --item purchased
purchases table:
purchases_id --primary key
location_id --which store
customer_id
credit_card_transactions:
transaction_id --primary key
trans_timestamp --timestamp of when the transaction occurred
item_id --item purchased
customer_id
location_id
All three tables are very large. The purchases table has 590130404 records. (Yes, half a billion) Unmatched_purchases has 192827577 records. Credit_card_transactions has 79965740 records.
I need to find out how many purchases in the unmatched_purchases table match up with entries in the credit_card_transactions table. I need to do this for one location at a time (IE run the query for location_id = 123. Then run it for location_id = 456) "Match up" is defined as:
1) same customer_id
2) same item_id
3) the trans_timestamp is within a certain window of the purchase_date
(EG if the purchase_date is Jan 3, 2005
and the trans_timestamp is 11:14PM Jan 2, 2005, that's close enough)
I need the following aggregated:
1) How many unmatched purchases are there for that location
2) How many of those unmatched purchases could have been matched with credit_card_transactions for a location.
So, what is a query (or queries) to get this information that won't take forever to run?
Note: all three tables are indexed on location_id
EDIT: as it turns out, the credit_card_purchases table has been partitioned based on location_id. So that will help speed this up for me. I'm asking our DBA if the others could be partitioned as well, but the decision is out of my hands.
CLARIFICATION: I only will need to run this on a few of our many locations, not all of them separately. I need to run it on 3 locations. We have 155 location_ids in our system, but some of them are not used in this part of our system.

try this (I have no idea how fast it will be - that depends on your indices)
Select Count(*) TotalPurchases,
Sum(Case When c.transaction_id Is Not Null
Then 1 Else 0 End) MatchablePurchases
From unmatched_purchases u
Join purchases p
On p.purchases_id = u.unmatched_purchases_id
Left Join credit_card_transactions c
On customer_id = p.customer_id
And item_id = u.item_id
And trans_timestamp - purchase_date < #DelayThreshold
Where Location_id = #Location

At least, you'll need more indexes. I propose at least the folloging:
An index on unmatched_purchases.purchases_id, one on purchases.location_id and
another index on credit_card_transactions.(location_id, customer_id, item_id, trans_timestamp).
Without those indexes, there is little hope IMO.

I suggest you to query ALL locations at once. It will cost you 3 full scans (each table once) + sorting. I bet this will be faster then querying locations one by one.
But if you want not to guess, you at least need to examine EXPLAIN PLAN and 10046 trace of your query...

The query ought to be straightforward, but the tricky part is to get it to perform. I'd question why you need to run it once for each location when it would probably be more eficient to run it for every location in a single query.
The join would be a big challenge, but the aggregation ought to be straightforward. I would guess that your best hope performance-wise for the join would be a hash join on the customer and item columns, with a subsequent filter operation on the date range. You might have to fiddle with putting the customer and item join in an inline view and then try to stop the date predicate from being pushed into the inline view.
The hash join would be much more efficient with tables that are being equi-joined both having the same hash partitioning key on all join columns, if that can be arranged.
Whether to use the location index or not ...
Whether the index is worth using or not depends on the clustering factor for the location index, which you can read from the user_indexes table. Can you post the clustering factor along with the number of blocks that the table contains? That will give a measure of the way that values for each location are distributed throughout the table. You could also extract the execution plan for a query such as:
select some_other_column
from my_table
where location_id in (value 1, value 2, value 3)
... and see if oracle thinks the index is useful.

Related

Best approach to ocurrences of ids on a table and all elements in another table

Well, the query I need is simple, and maybe is in another question, but there is a performance thing in what I need, so:
I have a table of users with 10.000 rows, the table contains id, email and more data.
In another table called orders I have way more rows, maybe 150.000 rows.
In this orders I have the id of the user that made the order, and also a status of the order. The status could be a number from 0 to 9 (or null).
My final requirement is to have every user with the id, email, some other column , and the number of orders with status 3 or 7. it does not care of its 3 or 7, I just need the amount
But I need to do this query in a low-impact way (or a performant way).
What is the best approach?
I need to run this in a redash with postgres 10.

This sounds like a join and group by:
select u.*, count(*)
from users u join
orders o
on o.user_id = u.user_id
where o.status in (3, 7)
group by u.user_id;
Postgres is usually pretty good about optimizing these queries -- and the above assumes that users(user_id) is the primary key -- so this should work pretty well.

PostgreSQL ON vs WHERE when joining tables?

I have 2 tables customer and coupons, a customer may or may not have a reward_id assigned to, so it's a nullable column. A customer can have many coupons and coupon belongs to a customer.
+-------------+------------+
| coupons | customers |
+-------------+------------+
| id | id |
| customer_id | first_name |
| code | reward_id |
+-------------+------------+
customer_id column is indexed
I would like to make a join between 2 tables.
My attempt is:
select c.*, cust.id as cust_id, cust.first_name as cust_name
from coupons c
join customer cust
on c.customer_id = cust.id and cust.reward_id is not null
However, I think there isn't an index on reward_id, so I should move cust.reward_id is not null in where clause:
select c.*, cust.id as cust_id, cust.first_name as cust_name
from coupons c
join customer cust
on c.customer_id = cust.id
where cust.reward_id is not null
I wonder if the second attempt would be more efficient than the first attempt.

It would be better if you see the execution plan on your own. Add EXPLAIN ANALYZE before your select statement and execute both to see the differences.
Here's how:
EXPLAIN ANALYZE select ...
What it does? It actually executes the select statement and gives you back the execution plan which was chosen by query optimizer. Without ANALYZE keyword it would only estimate the execution plan without actually executing the statement in the background.
Database won't use two indexes at one time, so having an index on customer(id) will make it unable to use index on customer(reward_id). This condition will actually be treated as a filter condition which is correct behaviour.
You could experiment with performance of a partial index created as such: customer(id) where reward_id is not null. This would decrease index size as it would only store these customer id's for which there is a reward_id assigned.
I generally like to split the relationship/join logic from conditions applied and I myself put them within the WHERE clause because it's more visible in there and easier to read for future if there are any more changes.
I suggest you see for yourself the possible performance gain, because it depends on how much data there is and the possible low cardinality for reward_id. For example if most rows have this column filled with a value it wouldn't make that much of a difference as the index size (normal vs partial) would be almost the same.

In a PostgreSQL inner join, whether a filter condition is placed in the ON clause or the WHERE clause does not impact a query result or performance.
Here is a guide that explores this topic in more detail: https://app.pluralsight.com/guides/using-on-versus-where-clauses-to-combine-and-filter-data-in-postgresql-joins

Performance impact due to Join

I have two tables PRODUCT and ACCOUNT.
PRODUCT table columns
product_id (PK)
subscription_id
ACCOUNT table columns
account_nbr
subscription_id
(account_nbr and subscription_id are primary key columns in account table)
... Other columns
I have to find account_nbr and subscription_id for a product_id.
I get product_id as input. Using it I can get the subscription_id from PRODUCT table and
using subscription_id I can get the account_nbr value from ACCOUNT table.
Instead of getting the info in two queries, can it be done in one query?
Something like the below:
select distinct a.acct_nbr,p.subscription_id
from ACCOUNT a,PRODUCT p
where v.product_id = ' val 1' and
v.subscription_id = p.subscription_id
Will the performance of the above query be low compared to two separate queries?

I have to find account_nbr and subscription_id for a product_id.
So, you're correct in your approach you need to JOIN the two result-sets together:
select p.account_nbr, p.subscription_id
from account a
join product p
on a.subscription_id = p.subscription_id
where p.product_id = :something
Some points:
You have an alias v in your query; I don't know where this came from.
Learn to use the ANSI join syntax; if you make a mistake it's a lot more obvious.
You're selecting a.acct_nbr, which doesn't exist as-is.
There's no need for a DISTINCT. ACCOUNT_NBR and PRODUCT_ID are the primary key of ACCOUNT.
Will the performance of the above query be low compared to two separate queries?
Probably not. If your query is correct and your tables are correctly indexed it's highly unlikely that whatever you've coded could beat the SQL engine. The database is designed to select data quickly.

Select query with join in huge table taking over 7 hours

Our system is facing performance issues selecting rows out of a 38 million rows table.
This table with 38 million rows stores information from clients/suppliers etc. These appear across many other tables, such as Invoices.
The main problem is that our database is far from normalized. The Clients_Suppliers table has a composite key made of 3 columns, the Code - varchar2(16), Category - char(2) and the last one is up_date, a date. Every change in one client's address is stored in that same table with a new date. So we can have records such as this:
code ca up_date
---------------- -- --------
1234567890123456 CL 01/01/09
1234567890123456 CL 01/01/10
1234567890123456 CL 01/01/11
1234567890123456 CL 01/01/12
6543210987654321 SU 01/01/10
6543210987654321 SU 08/03/11
Worst, in every table that uses a client's information, instead of the full composite key, only the code and category is stored. Invoices, for instance, has its own keys, including the emission date. So we can have something like this:
invoice_no serial_no emission code ca
---------- --------- -------- ---------------- --
1234567890 12345 05/02/12 1234567890123456 CL
My specific problem is that I have to generate a list of clients for which invoices where created in a given period. Since I have to get the most recent info from the clients, I have to use max(up_date).
So here's my query (in Oracle):
SELECT
CL.CODE,
CL.CATEGORY,
-- other address fields
FROM
CLIENTS_SUPPLIERS CL
INVOICES I
WHERE
CL.CODE = I.CODE AND
CL.CATEGORY = I.CATEGORY AND
CL.UP_DATE =
(SELECT
MAX(CL2.UP_DATE)
FROM
CLIENTS_SUPPLIERS CL2
WHERE
CL2.CODE = I.CODE AND
CL2.CATEGORY = I.CATEGORY AND
CL2.UP_DATE <= I.EMISSION
) AND
I.EMISSION BETWEEN DATE1 AND DATE2
It takes up to seven hours to select 178,000 rows. Invoices has 300,000 rows between DATE1 and DATE2.
It's a (very, very, very) bad design, and I've raised the fact that we should improve it, by normalizing the tables. That would involve creating a table for clients with a new int primary key for each pair of code/category and another one for Adresses (with the client primary key as a foreign key), then use the Adresses' primary key in each table that relates to clients.
But it would mean changing the whole system, so my suggestion has been shunned. I need to find a different way of improving performance (apparently using only SQL).
I've tried indexes, views, temporary tables but none have had any significant improvement on performance. I'm out of ideas, does anyone have a solution for this?
Thanks in advance!

What does the DBA have to say?
Has he/she tried:
Coalescing the tablespaces
Increasing the parallel query slaves
Moving indexes to a separate tablespace on a separate physical disk
Gathering stats on the relevant tables/indexes
Running an explain plan
Running the query through the index optimiser
I'm not saying the SQL is perfect, but if performance it is degrading over time, the DBA really needs to be having a look at it.

SELECT
CL2.CODE,
CL2.CATEGORY,
... other fields
FROM
CLIENTS_SUPPLIERS CL2 INNER JOIN (
SELECT DISTINCT
CL.CODE,
CL.CATEGORY,
I.EMISSION
FROM
CLIENTS_SUPPLIERS CL INNER JOIN INVOICES I ON CL.CODE = I.CODE AND CL.CATEGORY = I.CATEGORY
WHERE
I.EMISSION BETWEEN DATE1 AND DATE2) CL3 ON CL2.CODE = CL3.CODE AND CL2.CATEGORY = CL3.CATEGORY
WHERE
CL2.UP_DATE <= CL3.EMISSION
GROUP BY
CL2.CODE,
CL2.CATEGORY
HAVING
CL2.UP_DATE = MAX(CL2.UP_DATE)
The idea is to separate the process: first we tell oracle to give us the list of clients for which there are the invoices of the period you want, and then we get the last version of them. In your version there's a check against MAX 38000000 times, which I really think is what costed most of the time spent in the query.
However, I'm not asking for indexes, assuming they are correctly setup...

Assuming that the number of rows for a (code,ca) is smallish, I would try to force an index scan per invoice with an inline view, such as:
SELECT invoice_id,
(SELECT MAX(rowid) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC
FROM clients_suppliers c
WHERE c.code = i.code
AND c.category = i.category
AND c.up_date < i.invoice_date)
FROM invoices i
WHERE i.invoice_date BETWEEN :p1 AND :p2
You would then join this query to CLIENTS_SUPPLIERS hopefully triggering a join via rowid (300k rowid read is negligible).
You could improve the above query by using SQL objects:
CREATE TYPE client_obj AS OBJECT (
name VARCHAR2(50),
add1 VARCHAR2(50),
/*address2, city...*/
);
SELECT i.o.name, i.o.add1 /*...*/
FROM (SELECT DISTINCT
(SELECT client_obj(
max(name) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC),
max(add1) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC)
/*city...*/
) o
FROM clients_suppliers c
WHERE c.code = i.code
AND c.category = i.category
AND c.up_date < i.invoice_date)
FROM invoices i
WHERE i.invoice_date BETWEEN :p1 AND :p2) i

The correlated subquery may be causing issues, but to me the real problem is in what seems to be your main client table, you cannot easily grab the most recent data without doing the max(up_date) mess. Its really a mix of history and current data, and as you describe poorly designed.
Anyway, it will help you in this and other long running joins to have a table/view with ONLY the most recent data for a client. So, first build a mat view for this (untested):
create or replace materialized view recent_clients_view
tablespace my_tablespace
nologging
build deferred
refresh complete on demand
as
select * from
(
select c.*, rownumber() over (partition by code, category order by up_date desc, rowid desc) rnum
from clients c
)
where rnum = 1;
Add unique index on code,category. The assumption is that this will be refreshed periodically on some off hours schedule, and that your queries using this will be ok with showing data AS OF the date of the last refresh. In a DW env or for reporting, this is usually the norm.
The snapshot table for this view should be MUCH smaller than the full clients table with all the history.
Now, you are doing an joining invoice to this smaller view, and doing an equijoin on code,category (where emission between date1 and date2). Something like:
select cv.*
from
recent_clients_view cv,
invoices i
where cv.code = i.code
and cv.category = i.category
and i.emission between :date1 and :date2;
Hope that helps.

You might try rewriting the query to use analytic functions rather than a correlated subquery:
select *
from (SELECT CL.CODE, CL.CATEGORY, -- other address fields
max(up_date) over (partition by cl.code, cl.category) as max_up_date
FROM CLIENTS_SUPPLIERS CL join
INVOICES I
on CL.CODE = I.CODE AND
CL.CATEGORY = I.CATEGORY and
I.EMISSION BETWEEN DATE1 AND DATE2 and
up_date <= i.emission
) t
where t.up_date = max_up_date
You might want to remove the max_up_date column in the outside select.
As some have noticed, this query is subtly different from the original, because it is taking the max of up_date over all dates. The original query has the condition:
CL2.UP_DATE <= I.EMISSION
However, by transitivity, this means that:
CL2.UP_DATE <= DATE2
So the only difference is when the max of the update date is less than DATE1 in the original query. However, these rows would be filtered out by the comparison to UP_DATE.
Although this query is phrased slightly differently, I think it does the same thing. I must admit to not being 100% positive, since this is a subtle situation on data that I'm not familiar with.

Slow but simple Query, how to make it quicker?

I have a database which is 6GB in size, with a multitude of tables however smaller queries seem to have the most problems, and want to know what can be done to optimise them for example there is a Stock, Items and Order Table.
The Stock table is the items in stock this has around 100,000 records within with 25 fields storing ProductCode, Price and other stock specific data.
The Items table stores the information about the items there are over 2,000,000 of these with over 50 fields storing Item Names and other details about the item or product in question.
The Orders table stores the Orders of Stock Items, which is the when the order was placed plus the price sold for and has around 50,000 records.
Here is a query from this Database:
SELECT Stock.SKU, Items.Name, Stock.ProductCode FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
WHERE (Stock.Status = 1 OR Stock.Status = 2) AND Order.Customer = 12345
ORDER BY Order.OrderDate DESC;
Given the information here what could be done to improve this query, there are others like this, what alternatives are there. The nature of the data and the database cannot be detailed further however, so if general optmisation tricks and methods are given these will be fine, or anything which applies generally to databases.
The Database is MS SQL 2000 on Windows Server 2003 with the latest service packs for each.
DB Upgrade / OS Upgrade are not options for now.
Edit
Indices are Stock.SKU, Items.ProductCode and Orders.OrderID on the tables mentioned.
Execution plan is 13-16 seconds for a Query like this 75% time spent in Stock
Thanks for all the responses so far - Indexing seems to be the problem, all the different examples given have been helpful - dispite a few mistakes with the query, but this has helped me a lot some of these queries have run quicker but combined with the index suggestions I think I might be on the right path now - thanks for the quick responses - has really helped me and made me consider things I did not think or know about before!
Indexes ARE my problem added one to the Foriegn Key with Orders (Customer) and this
has improved performance by halfing execution time!
Looks like I got tunnel vision and focused on the query - I have been working with DBs for a couple of years now, but this has been very helpful. However thanks for all the query examples they are combinations and features I had not considered may be useful too!

is your code correct??? I'm sure you're missing something
INNER JOIN Batch ON Order.OrderID = Orders.OrderID
and you have a ) in the code ...
you can always test some variants against the execution plan tool, like
SELECT
s.SKU, i.Name, s.ProductCode
FROM
Stock s, Orders o, Batch b, Items i
WHERE
b.OrderID = o.OrderID AND
s.ProductCode = i.ProductCode AND
s.Status IN (1, 2) AND
o.Customer = 12345
ORDER BY
o.OrderDate DESC;
and you should return just a fraction, like TOP 10... it will take some milliseconds to just choose the TOP 10 but you will save plenty of time when binding it to your application.

The most important (if not already done): define your primary keys for the tables (if not already defined) and add indexes for the foreign keys and for the columns you are using in the joins.

Did you specify indexes? On
Items.ProductCode
Stock.ProductCode
Orders.OrderID
Orders.Customer
Sometimes, IN could be faster than OR, but this is not as important as having indexes.
See balexandre answer, you query looks wrong.

Some general pointers
Are all of the fields that you are joining on indexed?
Is the ORDER BY necessary?
What does the execution plan look like?
BTW, you don't seem to be referencing the Order table in the question query example.

Table index will certainly help as Cătălin Pitiș suggested.
Another trick is to reduce the size of the join rows by either use sub select or to be more extreme use temp tables. For example rather than join on the whole Orders table, join on
(SELECT * FROM Orders WHERE Customer = 12345)
also, don't join directly on Stock table join on
(SELECT * FROM Stock WHERE Status = 1 OR Status = 2)

Setting the correct indexes on the tables is usually what makes the biggest difference for performance.
In Management Studio (or Query Analyzer for earlier versions) you can choose to view the execution plan of the query when you run it. In the execution plan you can see what the database is really doing to get the result, and what parts takes the most work. There are some things to look for there, like table scans, that usually is the most costly part of a query.
The primary key of a table normally has an index, but you should verify that it's actually so. Then you probably need indexes on the fields that you use to look up records, and fields that you use for sorting.
Once you have added an index, you can rerun the query and see in the execution plan if it's actually using the index. (You may need to wait a while after creating the index for the database to build the index before it can use it.)

Could you give it a go?
SELECT Stock.SKU, Items.Name, Stock.ProductCode FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID AND (Order.Customer = 12345) AND (Stock.Status = 1 OR Stock.Status = 2))
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
ORDER BY Order.OrderDate DESC;

Elaborating on what Cătălin Pitiș said already: in your query
SELECT Stock.SKU, Items.Name, Stock.ProductCode
FROM Stock
INNER JOIN Order ON Order.OrderID = Stock.OrderID
INNER JOIN Items ON Stock.ProductCode = Items.ProductCode
WHERE (Stock.Status = 1 OR Stock.Status = 2) AND Order.Customer = 12345
ORDER BY Order.OrderDate DESC;
the criterion Order.Customer = 12345 looks very specific, whereas (Stock.Status = 1 OR Stock.Status = 2) sounds unspecific. If this is correct, an efficient query consists of
1) first finding the orders belonging to a specific customer,
2) then finding the corresponding rows of Stock (with same OrderID) filtering out those with Status in (1, 2),
3) and finally finding the items with the same ProductCode as the rows of Stock in 2)
For 1) you need an index on Customer for the table Order, for 2) an index on OrderID for the table Stock and for 3) an index on ProductCode for the table Items.
As long your query does not become much more complicated (like being a subquery in a bigger query, or that Stock, Order and Items are only views, not tables), the query optimizer should be able to find this plan already from your query. Otherwise, you'll have to do what kuoson is suggesting (but the 2nd suggestion does not help, if Status in (1, 2) is not very specific and/or Status is not indexed on the table Status). But also remember that keeping indexes up-to-date costs performance if you do many inserts/updates on the table.

To shorten my answer I gave 2 hours ago (when my cookies where switched off):
You need three indexes: Customer for table Order, OrderID for Stock and ProductCode for Items.
If you miss any of these, you'll have to wait for a complete table scan on the according table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas