There's a general type of query I'm trying to perform, and I'm not sure how to express it in words so that I can find a discussion of best practices and examples for executing it.
Here's an example use case.
I have a customers table that has info about customers and an orders table. I want to fetch a subset of records from orders based on customer characteristics, limited by the "earliest" and "latest" dates contained as data in the customers table. It's essential to the solution that I limit my query results to within this date range, which varies by customer.
CUSTOMERS
+------------+------------+----------+---------------------+-------------------+
| CustomerID | Location | Industry | EarliestActiveOrder | LatestActiveOrder |
+------------+------------+----------+---------------------+-------------------+
| 001 | New York | Finance | 2017-11-03 | 2019-07-30 |
| 002 | California | Tech | 2018-06-18 | 2019-09-22 |
| 003 | New York | Finance | 2015-09-30 | 2019-02-26 |
| 004 | California | Finance | 2019-02-02 | 2019-08-15 |
| 005 | New York | Finance | 2017-10-19 | 2018-12-20 |
+------------+------------+----------+---------------------+-------------------+
ORDERS
+----------+------------+------------+---------+
| OrderID | CustomerID | StartDate | Details |
+----------+------------+------------+---------+
| 5430 | 003 | 2015-06-30 | ... |
| 5431 | 003 | 2016-03-31 | ... |
| 5432 | 003 | 2018-09-30 | ... |
| 5434 | 001 | 2018-11-05 | ... |
| 5435 | 001 | 2019-10-11 | ... |
A sample use case expressed in words would be: "Give me all Active Orders from Finance customers in New York".
Desired result is to return the full records from orders table for OrderID's 5431,5432,5434.
What is a generally good approach for structuring this kind of query, given an orders table with ~10^6 records?
You are looking for a join:
select o.*
from orders o
inner join customers c
on c.Customer_id = o.Customer_id
and o.StartDate between c.EarliestActiveOrder and c.LatestActiveOrder
and c.Industry = 'Finance'
and c.Location = 'New York'
For performance in this query, consider the following indexes:
orders(customer_id, StartDate)
customers(Customer_id, Industry, Location, EarliestActiveOrder, LatestActiveOrder)
Assuming that the result set is a small subset of the orders (say less then 1% of orders but the 1% is for illustration), I would phrase the query like this:
select o.*
from customers c join
orders o
on o.Customer_id = c.Customer_id and
o.StartDate between c.EarliestActiveOrder and c.LatestActiveOrder
where c.Location = 'New York' and c.industry = 'Finance';
The indexing strategy is tricky. For smallish result sets, you probably want to restrict the customers first and then find the matching orders. This approach suggsts indexes on:
customers(location, industry, customer_id, EarliestActiveOrder, LatestActiveOrder)
orders(customer_id, startdate)
If you had other columns for filtering, you would need separate indexes for them. For instance, for industry-only filtering:
customers(industry, customer_id, EarliestActiveOrder, LatestActiveOrder)
This can get cumbersome.
If, on the other hand, your result set is likely to be a significant number of orders, then scanning the orders table might be more efficient. You can try to rely on the optimizer. Or just push it in the right direction by phrasing the query as:
select o.*
from orders o
where exists (select 1
from customers c
where o.Customer_id = c.Customer_id and
o.StartDate between c.EarliestActiveOrder and c.LatestActiveOrder and
c.Location = 'New York' and c.industry = 'Finance'
);
In this case, you want an index on customers(customer_id) -- but that is probably already the primary key so you are fine. This has the advantage that you don't need to worry about the exact filtering criteria. The downside is a full table scan on orders (but not additional work for a join, group by, or order by).
Related
I have three tables: ACCT, PERS, ORG. Each ACCT is owned by either a PERS or ORG. The PERS and ORG tables are very similar and so are all of their child tables, but all PERS and ORG data is separate.
I'm writing a query to get PERS and ORG information for each account in ACCT and I'm curious what the best method of combining the information is. Should I use a series of left joins and NULL functions to fill in the blanks, or should I write the queries separately and use UNION to combine?
I've already written separate queries for PERS ACCT's and another for ORG ACCT's and plan on using UNION. My question more pertains to best practice in the future.
I'm expecting both to give me my desired my results, but I want to find the most efficient method both in development time and run time.
EDIT: Sample Table Data
ACCT Table:
+---------+---------+--------------+-------------+
| ACCTNBR | ACCTTYP | OWNERPERSNBR | OWNERORGNBR |
+---------+---------+--------------+-------------+
| 555001 | abc | 3010 | |
| 555002 | abc | | 2255 |
| 555003 | tre | 5125 | |
| 555004 | tre | 4485 | |
| 555005 | dsa | | 6785 |
+---------+---------+--------------+-------------+
PERS Table:
+---------+--------------+---------------+----------+-------+
| PERSNBR | PHONE | STREET | CITY | STATE |
+---------+--------------+---------------+----------+-------+
| 3010 | 555-555-5555 | 1234 Main St | New York | NY |
| 5125 | 555-555-5555 | 1234 State St | New York | NY |
| 4485 | 555-555-5555 | 6542 Vine St | New York | NY |
+---------+--------------+---------------+----------+-------+
ORG Table:
+--------+--------------+--------------+----------+-------+
| ORGNBR | PHONE | STREET | CITY | STATE |
+--------+--------------+--------------+----------+-------+
| 2255 | 222-222-2222 | 1000 Main St | New York | NY |
| 6785 | 333-333-3333 | 400 4th St | New York | NY |
+--------+--------------+--------------+----------+-------+
Desired Output:
+---------+---------+--------------+-------------+--------------+---------------+----------+-------+
| ACCTNBR | ACCTTYP | OWNERPERSNBR | OWNERORGNBR | PHONE | STREET | CITY | STATE |
+---------+---------+--------------+-------------+--------------+---------------+----------+-------+
| 555001 | abc | 3010 | | 555-555-5555 | 1234 Main St | New York | NY |
| 555002 | abc | | 2255 | 222-222-2222 | 1000 Main St | New York | NY |
| 555003 | tre | 5125 | | 555-555-5555 | 1234 State St | New York | NY |
| 555004 | tre | 4485 | | 555-555-5555 | 6542 Vine St | New York | NY |
| 555005 | dsa | | 6785 | 333-333-3333 | 400 4th St | New York | NY |
+---------+---------+--------------+-------------+--------------+---------------+----------+-------+
Query Option 1: Write 2 queries and use UNION to combine them:
select a.acctnbr, a.accttyp, a.ownerpersnbr, a.ownerorgnbr, p.phone, p.street, p.city, p.state
from acct a
inner join pers p on p.persnbr = a.ownerpersnbr
UNION
select a.acctnbr, a.accttyp, a.ownerpersnbr, a.ownerorgnbr, o.phone, o.street, o.city, o.state
from acct a
inner join org o on o.orgnbr = a.ownerorgnbr
Option 2: Use NVL() or Coalesce to return a single data set:
SELECT a.acctnbr,
a.accttyp,
NVL(a.ownerpersnbr, a.ownerorgnbr) Owner,
NVL(p.phone, o.phone) Phone,
NVL(p.street, o.street) Street,
NVL(p.city, o.city) City,
NVL(p.state, o.state) State
FROM
acct a
LEFT JOIN pers p on p.persnbr = a.ownerpersnbr
LEFT JOIN org o on o.orgnbr = a.ownerorgnbr
There are way more fields in each of the 3 tables as well as many more PERS and ORG tables in my actual query. Is one way better (faster, more efficient) than another?
That depends, on what you consider "better".
Assuming, that you will always want to pull all rows from ACCT table, I'd say to go for the LEFT OUTER JOIN and no UNION. (If using UNION, then rather go for UNION ALL variant.)
EDIT: As you've already shown your queries, mine is no longer required, and did not match your structures. Removing this part.
Why LEFT JOIN? Because with UNION you'd have to go through ACCT twice, based on "parent" criteria (whether separate or done INNER JOIN criteria), while with plain LEFT OUTER JOIN you'll probably get just one pass through ACCT. In both cases, rows from "parents" will most probably be accessed based on primary keys.
As you are probably considering performance, when looking for "better", as always: Test your queries and look at the execution plans with adequate and fresh database statistics in place, as depending on the data "layout" (histograms, etc.) the "better" may be something completely different.
I think you misunderstand what a Union does versus a join statement. A union takes the records from multiple tables, generally similar or the same structure and combines them into a single resultset. It is not meant to combine multiple dissimilar tables.
What I am seeing is that you have two tables PERS and ORG with some of the same data in it. In this case I suggest you union those two tables and then join to ACCT to get the sample output.
In this case to get the output as you have shown you would want to use Outer joins so that you don't drop any records without a match. That will give you nulls in some places but most of the time that is what you want. It is much easier to filter those out later.
Very rough sample code.
SELECT a.*, b.*
from Acct as a
FULL OUTER JOIN (
Select * from PERS UNION Select * from ORG
) as b
ON a.ID = b.ID
I am breaking my head on joining of three tables. I have recreated a simple test case where I see the same problem, so it looks I make a fundamental mistake in my join query:
I have three tables:
case:
id (PK)| date_closed
155 | '2018-04-17 10:08'
156 | '2018-03-17 10:08'
pizza | '2018-02-17 10:08'
registration:
id (FK) | source | quantity
155 | market | 300
155 | sawdust| 200
bagged:
id | case_id (FK) | kg_bagged
X | 155 | 123
Y | 155 | 90
These tables I want to join to compare the total amounts per 'case' in quantity column and kg_bagged. So the case table has a 1:* many relationship to the other two. Therefore I make a join query like this:
SELECT case.id,
date_closed,
SUM(quantity),
SUM(kg_bagged),
SUM(kg_bagged)/SUM(quantity) AS reduction_factor
FROM case
JOIN bagged ON case.id = bagged.case_id
JOIN registration ON case.id = registration.id
Than I would think this would be a correct query, but Postgres tells me I have to add case.id, date_closed to the group by clause. So I add this:
GROUP BY case.id, date_closed;
This code is running without errors, but it shows 1000 for the quanity at case 155 not the expected 500 (200+300). This behaviour only appears when there is more than 1 record. When joining only 1 table to the case table it also works fine. Can someone see the mistake made at the JOIN query?
I also tried using a subquery for joining two tables and than use a join on the table left, but it gave me similar results
When you joining data 2 rows on 2 other tables it match together, so you get the multiplied result. In your example is 2*2 = 4
For easier understand, in your case when you execute the query
SELECT case.id, date_closed, source, quantity, kg_bagged
FROM case
JOIN registration ON registration.id = case.id
JOIN bagged ON bagged.case_id = case.id
You will get the data like this:
| id | date_closed | source | quantity | kg_bagged |
| :-: | :----------------: | :----: | :------: | :-------: |
| 155 | '2018-04-17 10:08' | market | 300 | 123 |
| 155 | '2018-04-17 10:08' | sawdust| 200 | 123 |
| 155 | '2018-04-17 10:08' | market | 300 | 90 |
| 155 | '2018-04-17 10:08' | sawdust| 200 | 90 |
In this case, as my experience before, I used to write subquery first to get the sum data first then joining it together.
Such as:
WITH r AS (SELECT id, sum(quantity) as quantity FROM registration GROUP BY id),
b as (SELECT case_id, SUM(kg_bagged) as kg_bagged FROM bagged GROUP BY case_id)
SELECT case.id,
date_closed,
quantity,
kg_bagged,
kg_bagged/quantity AS reduction_factor
FROM case
JOIN b ON case.id = b.case_id
JOIN r ON case.id = r.id
Hopefully, this answer will help you.
The database has thousands of individual items, each with multiple first sold dates and sales results by week. I need a total sum for each products first 12 weeks of sales.
Code was used for previous individual queries when we know the start date using a SUM(CASE. This is too manual though with thousands of products to review and we are looking for a smart way to speed this up.
Can I build on this so the sum find the minimum first shop date, and then sums the next 12 weeks of results? If so, how do I structure it, or is there a better way?
Columns in database I will need to reference with sample data
PROD_ID | WEEK_ID | STORE_ID | FIRST_SHOP_DATE | ITM_VALUE
12345543 | 201607 | 10000001 | 201542 | 24,356
12345543 | 201607 | 10000002 | 201544 | 27,356
12345543 | 201608 | 10000001 | 201542 | 24,356
12345543 | 201608 | 10000002 | 201544 | 27,356
32655644 | 201607 | 10000001 | 201412 | 103,245
32655644 | 201607 | 10000002 | 201420 | 123,458
32655644 | 201608 | 10000001 | 201412 | 154,867
32655644 | 201608 | 10000002 | 201420 | 127,865
You can do something like this:
select itemid, sum(sales)
from (select t.*, min(shopdate) over (partition by itemid) as first_shopdate
from t
) t
where shopdate < first_stopdate + interval '84' day
group by id;
You don't specify the database, so this uses ANSI standard syntax. The date operations (in particular) vary by database.
Hi Kirsty, Try like this -
select a.Item,sum(sales) as totla
from tableName a JOIN
(select Item, min(FirstSoldDate) as FirstSoldDate from tableName group by item) b
ON a.Item = b.Item
where a.FirstSoldDate between b.FirstSoldDate and (dateadd(day,84,b.FirstSoldDate))
group by a.Item
Thanks :)
I have a item table from which i want to get Sum of item quantity
Query:
Select item_id, Sum(qty) from item_tbl group by item_id
Result:
==================
| ID | Quantity |
===================
| 1 | 10 |
| 2 | 20 |
| 3 | 5 |
| 4 | 20 |
The second table is invoice table from which i am getting the item quantity which is sold. I am joining these two tables as
Query:
Select item_tbl.item_id, Sum(item_tbl.qty) as [item_qty],
-isnull(Sum(invoice.qty),0) as [invoice_qty]
from item_tbl
left join invoice on item_tbl.item_id = invoice invoice.item_id group by item_tbl.item_id
Result:
=================================
| ID | item_qty | invoice_qty |
=================================
| 1 | 10 | -5 |
| 2 | 20 | -20 |
| 3 | 10 | -25 | <------ item_qty raised from 5 to 10 ??
| 4 | 20 | -20 |
I don't know if i am joining these tables in right way. Because i want to get everything from item table and available things from invoice table to maintain the inventory. So i use left join. Help please..
Modification
when i added group by item_id, qty i got this:
=================================
| ID | item_qty | invoice_qty |
=================================
| 1 | 10 | -5 |
| 2 | 20 | -20 |
| 3 | 5 | -5 |
| 3 | 5 | -20 |
| 4 | 20 | -20 |
As its a view so ID is repeated. what should i do to avoid this ??
Clearing things up, my answer from the comments explained:
While using left join operation (A left join B) - a record will be created for every matching B record to an A record, also - a record will be created for any A record that has no matching B record, using null values wherever needed to complement the fields from B.
I would advise reading up on Using Joins in SQL when approaching such problems.
Below are 2 possible solutions, using different assumptions.
Solution A
Without any assumptions regarding primary key:
We have to sum up the item quantity column to determine the total quantity, resulting in two sums that need to be performed, I would advise using a sub query for readability and simplicity.
select item_tbl.item_id, Sum(item_tbl.qty) as [item_qty], -isnull(Sum(invoice_grouped.qty),0) as [invoice_qty]
from item_tbl left join
(select invoice.item_id as item_id, Sum(invoice.qty) as qty from invoice group by item_id) invoice_grouped
on (invoice_grouped.item_id = item_tbl.item_id)
group by item_tbl.item_id
Solution B
Assuming item_id is primary key for item_tbl:
Now we know we can rely on the fact that there is only one quantity for each item_id, so we can do without the sub query by selecting any (max) of the item quantities in the join result, resulting in a quicker execution plan.
select item_tbl.item_id, Max(item_tbl.qty) as [item_qty], -isnull(Sum(invoice.qty),0) as [invoice_qty]
from item_tbl left join invoice on (invoice.item_id = item_tbl.item_id)
group by item_tbl.item_id
If your database design is following the common rules, item_tbl.item_id must be unique.
So just change your query:
Select item_tbl.item_id, item_tbl.qty as [item_qty],
-isnull(Sum(invoice.qty),0) as [invoice_qty]
from item_tbl
left join invoice on item_tbl.item_id = invoice invoice.item_id group by item_tbl.item_id, item_tbl.qty
I have two tables, a master table and a general information table. I need to update my master table from the general table. How can I update the master table when the general info table can have slightly different values for the descriptions?
Master
+------+---------+
| Code | Desc |
+------+---------+
| 156 | Milk |
| 122 | Eggs |
| 123 | Diapers |
+------+---------+
Info
+------+---------------+--------+
| Code | Desc | Price |
+------+---------------+--------+
| 156 | Milk | $3.00 |
| 122 | Eggs | $2.00 |
| 123 | Diapers | $15.00 |
| 124 | Shopright Cola| $2.00 |
| 124 | SR Cola | $2.00 |
+------+---------------+--------+
As you can see item 124 has 2 descriptions. It does not matter which description.
My attempt is returning 124 with both codes, I understand my code is looking for both the unique Code and description in the master which is why it returns both 124 but I'm unsure how to fix it.
INSERT INTO MASTER
(
SELECT UNIQUE(Code), Desc FROM INFO A
WHERE NOT EXISTS
(SELECT Code FROM MASTER B
WHERE A.Code = B.Code )
);
I have also tried:
INSERT INTO MASTER
(
SELECT UNIQUE(PROC_CDE), Desc FROM FIR_CLAIM_DETAIL A
WHERE Code NOT IN
(SELECT Code FROM FIR_CODE_PROC_CDE_MSTR B
WHERE A.Code = B.Code )
);
Unique filters the duplicated entries in the SELECTed result set across all columns, not just one key.
When you want to extract the other attributes of a key you filtered, you have to instruct the database to first group the unique keys. To choose one of attributes of a grouped key, we can use an AGGREGATE function. Like MAX(), MIN().
INSERT INTO MASTER
(
SELECT PROC_CDE, MAX(Desc) FROM FIR_CLAIM_DETAIL A
WHERE Code NOT IN
(SELECT Code FROM FIR_CODE_PROC_CDE_MSTR B
WHERE A.Code = B.Code )
GROUP BY PROC_CDE
);
There're analytical functions which can be used for even complex requirements.