Most efficient way to get records from a table for which a record exists in another table for each month - sql

I have two tables as below:
User: User_ID, User_name and some other columns (has approx 1000 rows)
Fee: Created_By_User_ID, Created_Date and many other columns (has 17 million records)
Fee table does not have any index (and I can't create one).
I need a list of users for each month of a year (say 2016) who have created at least one fee record.
I do have a working query below which is taking long time to execute. Can someone help me with a better query? May be using EXIST clause (I tried one but still takes time as it scans Fee table)
SELECT MONTH(f.Created_Date), f.Created_By_User_ID
FROM Fees f
JOIN [User] u ON f.Created_By_User_ID= u.User_ID
WHERE f.Created_Date BETWEEN '2016-01-01' AND '2016-12-31'

You will require a full scan of the fee table once in the original query you are using. If you use just the join directly, as you have in the original query, you will require multiple scans of the fee table, many of which will go through redundant rows while the join occurs. Same scenario will occur when you use an inner query as suggested by Mansoor.
An optimization could be to decrease the number of rows on which the joins are happening.
Assuming that the user table contains only one record per user and the Fee table has multiple records per person, we can attempt to find distinct months users made a purchase for by using a CTE.
Then we can make a join on top of this CTE, this will reduce the computation performed by the join and should give a slightly better output time when performing over a large data set.
Try this:
WITH CTE_UserMonthwiseFeeRecords AS
(
SELECT DISTINCT Created_By_User_ID, MONTH(Created_Date) AS FeeMonth
FROM Fee
WHERE Created_Date BETWEEN '2016-01-01' AND '2016-12-31'
)
SELECT User_name, FeeMonth
FROM CTE_UserMonthwiseFeeRecords f
INNER JOIN [User] u ON f.Created_By_User_ID= u.User_ID
Also, you have not mentioned that you require the user names and all, if only id is required for the purpose of finding distinct users making purchases per month, then you can just use the query within the CTE and not even require the JOIN as:
SELECT DISTINCT Created_By_User_ID, MONTH(Created_Date) AS FeeMonth
FROM Fee
WHERE Created_Date BETWEEN '2016-01-01' AND '2016-12-31'

Try below query :
SELECT MONTH(f.Created_Date), f.Created_By_User_ID
FROM Fees f
WHERE EXISTS(SELECT 1 FROM [User] u WHERE f.Created_By_User_ID= u.User_ID
AND DATEDIFF(DAY,f.Created_Date,'2016-01-01') <= 0 AND
DATEDIFF(DAY,f.Created_Date,'2016-12-31') >= 0

You may try this approach to reduce the query run time. however, It does duplicate the huge data and store a instance of table (Temp_Fees), On every DML performed on table Fees/User require truncate and fresh load of table Temp_Fees.
Select * into Temp_Fees from (SELECT MONTH(f.Created_Date) as Created_MONTH, f.Created_By_User_ID
FROM Fees f
WHERE f.Created_Date BETWEEN '2016-01-01' AND '2016-12-31' )
SELECT f.Created_MONTH, f.Created_By_User_ID
FROM Temp_Fees f
JOIN [User] u ON f.Created_By_User_ID= u.User_ID

Related

PostgreSQL - Optimize subquery by referencing outer query

I have two tables: users and orders. Orders is a massive table (>100k entries) and users is relatively small (around 400 entries).
I want to find the number of orders per user. The column linking both tables is the email column.
I can achieve this with the following query:
SELECT sub_1.num, u.id FROM users AS u,
(SELECT cust_email AS email, COUNT(purchaseid) AS num
FROM orders AS o
WHERE o.status = 'COMPLETED'
GROUP BY cust_email) sub_1
WHERE u.email = sub_1.email
ORDER BY createdate DESC NULLS LAST
However, as mentioned previously, the order table is very large, so I would ideally want to add another condition to the WHERE clause in the Subquery to only retrieve those emails that exist in the User table.
I can simply add the user table to the subquery like this:
SELECT sub_1.num, u.id FROM users AS u,
(SELECT cust_email AS email, COUNT(purchaseid) AS num
FROM orders AS o, users AS u
WHERE o.status = 'COMPLETED'
and o.cust_email = u.email
GROUP BY cust_email) sub_1
WHERE u.email = sub_1.email
ORDER BY createdate DESC NULLS LAST
This does speed up the query, but sometimes the outer query is much more complex than just selecting all entries from the user table. Therefore, this solution does not always work. The goal would be to somehow link the outer and the inner query. I've thought of joint queries but cannot figure out how to get it to work.
I noticed that the first query seems to perform faster than I expected, so perhaps PostgreSQL is already smart enough to connect the outer and inner tables. However, I was hoping that someone could shed some light on how this works and what the best way to perform these types of subqueries is.

SQL Query Creates Duplicated Results

My task is to produce a report that shows the on time delivery of products to consumers. In essence I have achieved this. However, as you will see only some of the data is accurate.
Here is our test case: we have a sales order number '12312.' This sales order has had 5 partial shipments made (200 pieces each). The result is shown below from our DUE_DTS table.
Due Dates table data
The following code gives me the information I need (excluding due date information) to show the packing details of the 5 shipments:
DECLARE #t AS TABLE (
CUSTNAME char(35),
SONO char(10),
INVDATE date,
PACKLISTNO char(10),
PART_NO char(25),
SOBALANCE numeric(9,2)
)
INSERT INTO #t
SELECT DISTINCT c.CUSTNAME, s.SONO, p.INVDATE, p.PACKLISTNO, i.PART_NO, q.SOBALANCE
FROM [manex].[dbo].[SODETAIL]
INNER JOIN [manex].[dbo].[SOMAIN] s ON s.SONO = SODETAIL.SONO
INNER JOIN [manex].[dbo].[CUSTOMER] c ON c.CUSTNO = s.CUSTNO
INNER JOIN [manex].[dbo].[INVENTOR] i ON i.UNIQ_KEY = SODETAIL.UNIQ_KEY
INNER JOIN [manex].[dbo].[DUE_DTS] d ON d.SONO = s.SONO
INNER JOIN [manex].[dbo].[PLMAIN] p ON p.SONO = s.SONO
INNER JOIN [manex].[dbo].[PLDETAIL] q ON q.PACKLISTNO = p.PACKLISTNO
WHERE s.SONO LIKE '%12312'
SELECT * FROM #t
Here is a screenshot of the results from running this query:
Query Result
Now is when it should be time to join my due dates table (adding in the appropriate column(s) to my table definition and select statement) and make DATEDIFF comparisons to determine if shipments were on time or late. However, once I reference the due dates table, each of the 5 shipments is compared to all 5 dates in the due dates table, resulting in 25 rows. The only linking column DUE_DTS has is the SONO column. I've tried using DISTINCT and variations of the group by clause without success.
I've put enough together myself to figure joining the DUE_DTS table on SONO must be causing this to happen, as there are 5 instances of that value in the table (making it not unique) and a join should be based on a unique column. Is there a workaround for something like this?
You will need to use additional fields to join the records and reduce the results. You may need to link SONO to SODETAIL to DUE_DTS because the dates are tied to the items, not to the SONO.

Perform SQL query and then join

Lets say I have two tables:
ticket with columns [id,date, userid] userid is a foreign key that references user.id
user with columns [id,name]
Owing to really large tables I would like to first filter the tickets table by date
SELECT id FROM ticket WHERE date >= 'some date'
then I would like to do a left join with the user table. Is there a way to do it. I tried the follwoing but it doesnt work.
select ticket.id, user.name from ticket where ticket.date >= '2015-05-18' left join user on ticket.userid=user.id;
Apologies if its a stupid question. I have searched on google but most answers involve subqueries after the join instead of what I want which is to perfrom the query first and then do the join for the items returned
To make things a little more clear, the problem I am facing is that I have large tables and join takes time. I am joining 3 tables and the query takes almost 3 seconds. Whats the best way to reduce time. Instead of joining and then doing the where clause, I figured I should first select a small subset and then join.
Simply put everything in the right order:
select - from - where - group by - having - order by
select ticket.id, user.name
from ticket left join user on ticket.user_id=user.id
where ticket.date >= '2015-05-18'
Or put it in a Derived Table:
select ticket.id, user.name
from
(
select * from ticket
where ticket.date >= '2015-05-18'
) as ticket
left join user on ticket.user_id=user.id

Select query with join in huge table taking over 7 hours

Our system is facing performance issues selecting rows out of a 38 million rows table.
This table with 38 million rows stores information from clients/suppliers etc. These appear across many other tables, such as Invoices.
The main problem is that our database is far from normalized. The Clients_Suppliers table has a composite key made of 3 columns, the Code - varchar2(16), Category - char(2) and the last one is up_date, a date. Every change in one client's address is stored in that same table with a new date. So we can have records such as this:
code ca up_date
---------------- -- --------
1234567890123456 CL 01/01/09
1234567890123456 CL 01/01/10
1234567890123456 CL 01/01/11
1234567890123456 CL 01/01/12
6543210987654321 SU 01/01/10
6543210987654321 SU 08/03/11
Worst, in every table that uses a client's information, instead of the full composite key, only the code and category is stored. Invoices, for instance, has its own keys, including the emission date. So we can have something like this:
invoice_no serial_no emission code ca
---------- --------- -------- ---------------- --
1234567890 12345 05/02/12 1234567890123456 CL
My specific problem is that I have to generate a list of clients for which invoices where created in a given period. Since I have to get the most recent info from the clients, I have to use max(up_date).
So here's my query (in Oracle):
SELECT
CL.CODE,
CL.CATEGORY,
-- other address fields
FROM
CLIENTS_SUPPLIERS CL
INVOICES I
WHERE
CL.CODE = I.CODE AND
CL.CATEGORY = I.CATEGORY AND
CL.UP_DATE =
(SELECT
MAX(CL2.UP_DATE)
FROM
CLIENTS_SUPPLIERS CL2
WHERE
CL2.CODE = I.CODE AND
CL2.CATEGORY = I.CATEGORY AND
CL2.UP_DATE <= I.EMISSION
) AND
I.EMISSION BETWEEN DATE1 AND DATE2
It takes up to seven hours to select 178,000 rows. Invoices has 300,000 rows between DATE1 and DATE2.
It's a (very, very, very) bad design, and I've raised the fact that we should improve it, by normalizing the tables. That would involve creating a table for clients with a new int primary key for each pair of code/category and another one for Adresses (with the client primary key as a foreign key), then use the Adresses' primary key in each table that relates to clients.
But it would mean changing the whole system, so my suggestion has been shunned. I need to find a different way of improving performance (apparently using only SQL).
I've tried indexes, views, temporary tables but none have had any significant improvement on performance. I'm out of ideas, does anyone have a solution for this?
Thanks in advance!
What does the DBA have to say?
Has he/she tried:
Coalescing the tablespaces
Increasing the parallel query slaves
Moving indexes to a separate tablespace on a separate physical disk
Gathering stats on the relevant tables/indexes
Running an explain plan
Running the query through the index optimiser
I'm not saying the SQL is perfect, but if performance it is degrading over time, the DBA really needs to be having a look at it.
SELECT
CL2.CODE,
CL2.CATEGORY,
... other fields
FROM
CLIENTS_SUPPLIERS CL2 INNER JOIN (
SELECT DISTINCT
CL.CODE,
CL.CATEGORY,
I.EMISSION
FROM
CLIENTS_SUPPLIERS CL INNER JOIN INVOICES I ON CL.CODE = I.CODE AND CL.CATEGORY = I.CATEGORY
WHERE
I.EMISSION BETWEEN DATE1 AND DATE2) CL3 ON CL2.CODE = CL3.CODE AND CL2.CATEGORY = CL3.CATEGORY
WHERE
CL2.UP_DATE <= CL3.EMISSION
GROUP BY
CL2.CODE,
CL2.CATEGORY
HAVING
CL2.UP_DATE = MAX(CL2.UP_DATE)
The idea is to separate the process: first we tell oracle to give us the list of clients for which there are the invoices of the period you want, and then we get the last version of them. In your version there's a check against MAX 38000000 times, which I really think is what costed most of the time spent in the query.
However, I'm not asking for indexes, assuming they are correctly setup...
Assuming that the number of rows for a (code,ca) is smallish, I would try to force an index scan per invoice with an inline view, such as:
SELECT invoice_id,
(SELECT MAX(rowid) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC
FROM clients_suppliers c
WHERE c.code = i.code
AND c.category = i.category
AND c.up_date < i.invoice_date)
FROM invoices i
WHERE i.invoice_date BETWEEN :p1 AND :p2
You would then join this query to CLIENTS_SUPPLIERS hopefully triggering a join via rowid (300k rowid read is negligible).
You could improve the above query by using SQL objects:
CREATE TYPE client_obj AS OBJECT (
name VARCHAR2(50),
add1 VARCHAR2(50),
/*address2, city...*/
);
SELECT i.o.name, i.o.add1 /*...*/
FROM (SELECT DISTINCT
(SELECT client_obj(
max(name) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC),
max(add1) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC)
/*city...*/
) o
FROM clients_suppliers c
WHERE c.code = i.code
AND c.category = i.category
AND c.up_date < i.invoice_date)
FROM invoices i
WHERE i.invoice_date BETWEEN :p1 AND :p2) i
The correlated subquery may be causing issues, but to me the real problem is in what seems to be your main client table, you cannot easily grab the most recent data without doing the max(up_date) mess. Its really a mix of history and current data, and as you describe poorly designed.
Anyway, it will help you in this and other long running joins to have a table/view with ONLY the most recent data for a client. So, first build a mat view for this (untested):
create or replace materialized view recent_clients_view
tablespace my_tablespace
nologging
build deferred
refresh complete on demand
as
select * from
(
select c.*, rownumber() over (partition by code, category order by up_date desc, rowid desc) rnum
from clients c
)
where rnum = 1;
Add unique index on code,category. The assumption is that this will be refreshed periodically on some off hours schedule, and that your queries using this will be ok with showing data AS OF the date of the last refresh. In a DW env or for reporting, this is usually the norm.
The snapshot table for this view should be MUCH smaller than the full clients table with all the history.
Now, you are doing an joining invoice to this smaller view, and doing an equijoin on code,category (where emission between date1 and date2). Something like:
select cv.*
from
recent_clients_view cv,
invoices i
where cv.code = i.code
and cv.category = i.category
and i.emission between :date1 and :date2;
Hope that helps.
You might try rewriting the query to use analytic functions rather than a correlated subquery:
select *
from (SELECT CL.CODE, CL.CATEGORY, -- other address fields
max(up_date) over (partition by cl.code, cl.category) as max_up_date
FROM CLIENTS_SUPPLIERS CL join
INVOICES I
on CL.CODE = I.CODE AND
CL.CATEGORY = I.CATEGORY and
I.EMISSION BETWEEN DATE1 AND DATE2 and
up_date <= i.emission
) t
where t.up_date = max_up_date
You might want to remove the max_up_date column in the outside select.
As some have noticed, this query is subtly different from the original, because it is taking the max of up_date over all dates. The original query has the condition:
CL2.UP_DATE <= I.EMISSION
However, by transitivity, this means that:
CL2.UP_DATE <= DATE2
So the only difference is when the max of the update date is less than DATE1 in the original query. However, these rows would be filtered out by the comparison to UP_DATE.
Although this query is phrased slightly differently, I think it does the same thing. I must admit to not being 100% positive, since this is a subtle situation on data that I'm not familiar with.

Uses of unequal joins

Of all the thousands of queries I've written, I can probably count on one hand the number of times I've used a non-equijoin. e.g.:
SELECT * FROM tbl1 INNER JOIN tbl2 ON tbl1.date > tbl2.date
And most of those instances were probably better solved using another method. Are there any good/clever real-world uses for non-equijoins that you've come across?
Bitmasks come to mind. In one of my jobs, we had permissions for a particular user or group on an "object" (usually corresponding to a form or class in the code) stored in the database. Rather than including a row or column for each particular permission (read, write, read others, write others, etc.), we would typically assign a bit value to each one. From there, we could then join using bitwise operators to get objects with a particular permission.
How about for checking for overlaps?
select ...
from employee_assignments ea1
, employee_assignments ea2
where ea1.emp_id = ea2.emp_id
and ea1.end_date >= ea2.start_date
and ea1.start_date <= ea1.start_date
Whole-day inetervals in date_time fields:
date_time_field >= begin_date and date_time_field < end_date_plus_1
Just found another interesting use of an unequal join on the MCTS 70-433 (SQL Server 2008 Database Development) Training Kit book. Verbatim below.
By combining derived tables with unequal joins, you can calculate a variety of cumulative aggregates. The following query returns a running aggregate of orders for each salesperson (my note - with reference to the ubiquitous AdventureWorks sample db):
select
SH3.SalesPersonID,
SH3.OrderDate,
SH3.DailyTotal,
SUM(SH4.DailyTotal) RunningTotal
from
(select SH1.SalesPersonID, SH1.OrderDate, SUM(SH1.TotalDue) DailyTotal
from Sales.SalesOrderHeader SH1
where SH1.SalesPersonID IS NOT NULL
group by SH1.SalesPersonID, SH1.OrderDate) SH3
join
(select SH1.SalesPersonID, SH1.OrderDate, SUM(SH1.TotalDue) DailyTotal
from Sales.SalesOrderHeader SH1
where SH1.SalesPersonID IS NOT NULL
group by SH1.SalesPersonID, SH1.OrderDate) SH4
on SH3.SalesPersonID = SH4.SalesPersonID AND SH3.OrderDate >= SH4.OrderDate
group by SH3.SalesPersonID, SH3.OrderDate, SH3.DailyTotal
order by SH3.SalesPersonID, SH3.OrderDate
The derived tables are used to combine all orders for salespeople who have more than one order on a single day. The join on SalesPersonID ensures that you are accumulating rows for only a single salesperson. The unequal join allows the aggregate to consider only the rows for a salesperson where the order date is earlier than the order date currently being considered within the result set.
In this particular example, the unequal join is creating a "sliding window" kind of sum on the daily total column in SH4.
Dublicates;
SELECT
*
FROM
table a, (
SELECT
id,
min(rowid)
FROM
table
GROUP BY
id
) b
WHERE
a.id = b.id
and a.rowid > b.rowid;
If you wanted to get all of the products to offer to a customer and don't want to offer them products that they already have:
SELECT
C.customer_id,
P.product_id
FROM
Customers C
INNER JOIN Products P ON
P.product_id NOT IN
(
SELECT
O.product_id
FROM
Orders O
WHERE
O.customer_id = C.customer_id
)
Most often though, when I use a non-equijoin it's because I'm doing some kind of manual fix to data. For example, the business tells me that a person in a user table should be given all access roles that they don't already have, etc.
If you want to do a dirty join of two not really related tables, you can join with a <>.
For example, you could have a Product table and a Customer table. Hypothetically, if you want to show a list of every product with every customer, you could do somthing like this:
SELECT *
FROM Product p
JOIN Customer c on p.SKU <> c.SSN
It can be useful. Be careful, though, because it can create ginormous result sets.