SQL Join Challenge - sql

Ok, so I've been stuck on this for 2 days! I've solved it from a semantic point of view but the query can take up to 10 minutes to execute. My database of choice for this is SQLite (for reasons I do not want to elaborate here), but I have tried running the same thing on a SQL Server 2012, it didn't make much of a difference in performance.
So, the problem is that I have 2 tables
prices (product_id INT, for_date DATE, value INT)
events (starts_on DATE, ends_on DATE NULLABLE)
I have approximately 500K rows in the prices table and around 100 rows in the events table.
Now I need to write a query to do the following.
Pseudo code is:
For each event:
IF the event has an ends_on value THEN fetch all product_id(s) that have a matching for_date, For products that DO NOT MATCH then fetch the last for_date which is less than the ends_on value but greater than starts_on for that event.
ELSE IF the ends_on date of the event is NULL, THEN fetch all product_id(s) that have a for_date that matches to starts_on, For products that DO NOT MATCH fetch the last for_date which is less than the starts_on value.
The query I have written in SQL Server 2012 is
SELECT
sp.for_date, sp.value
FROM
prices sp
INNER JOIN
events ev ON (((ev.ends_on IS NOT NULL AND
(sp.for_date = (SELECT for_date
FROM prices
WHERE for_date <= ev.ends_on
AND for_date > ev.starts_on
ORDER BY for_date DESC
OFFSET 0 ROWS
FETCH NEXT 1 ROWS ONLY))))
OR
((ev.ends_on is null
and
(sp.for_date = (SELECT for_date
FROM prices
WHERE
for_date <= ev.starts_on_j
AND for_date > dateadd(day, -14, ev.starts_on)
order by for_date desc
offset 0 rows
fetch next 1 row only))))
);
Btw I have also tried to create temp tables with partial data and done the same op on them. It just gets stuck.
The strange thing is if I run the 2 "OR" conditions separately, the response time is perfect !
Update
Sample Dataset and Expected Result
Price Entries
Product ID, ForDt, Value
1, 25-01-2010, 123
1, 26-01-2010, 112
1, 29-01-2010, 334
1, 02-02-2010, 512
1, 03-02-2010, 765
1, 04-02-2010, 632
1, 05-02-2010, 311
1, 06-02-2010, 555
2, 03-02-2010, 854
2, 04-02-2010, 625
2, 05-02-2010, 919
3, 20-01-2010, 777
3, 06-02-2010, 877
3, 10-03-2010, 444
3, 11-03-2010, 888
Event Entries (To make it more understandable, Im adding an event id also)
Event ID, StartsOn, EndsOn
22, 27-01-2010, NULL
33, 02-02-2010, 06-02-2010
44, 01-03-2010, 13-03-2010
Expected Result Set
Event ID, Product ID, ForDt, Value
22, 1, 26-01-2010, 112
33, 1, 06-02-2010, 311
44, 1, 06-02-2010, 311
33, 2, 05-02-2010, 919
44, 2, 05-02-2010, 919
22, 3, 20-01-2010, 777
33, 3, 06-02-2010, 877
44, 3, 11-03-2010, 888

Okay, now that you have shown the expected results being a list of events and associated products the question makes sense. Your query only selecting dates and values didn't.
You are looking for the best product price record per event. This would be easily done with analytic functions, but SQLite doesn't support them. So we must write a more complicated query.
Let's look at events with ends_on null first. Here is how to find the best product prices (i.e. last before starts_on):
select e.event_id, p.product_id, max(for_date) as best_for_date
from events e
join prices p on p.for_date < e.starts_on
where e.ends_on is null
group by e.event_id, p.product_id;
We extend this query to also find the best product prices for events with an ends_on and then access the products table again so we get the full records with the values:
select ep.event_id, p.product_id, p.for_date, p.value
from
(
select e.event_id, p.product_id, max(for_date) as best_for_date
from events e
join prices p on (e.ends_on is null and p.for_date < e.starts_on)
or (e.ends_on is not null and p.for_date between e.starts_on and e.ends_on)
group by e.event_id, p.product_id
) ep
join prices p on p.product_id = ep.product_id and p.for_date = ep.best_for_date;
(By the way: You are describing a very special case here. The databases I have seen so far would treat an ends_on null as unlimited or "still active". Thus the price to retrieve for such an event would not be the last before starts_on, but the most current one at or after starts_on.)

Related

Redshift aggregate grouped data by date range

I have following table that contains quantities of items per day.
ID Date Item Count
-----------------------------
1 2022-01-01 Milk 10
2 2022-01-11 Milk 20
3 2022-01-12 Milk 10
4 2022-01-15 Milk 12
5 2022-01-16 Milk 10
6 2022-01-02 Bread 20
7 2022-01-03 Bread 22
8 2022-01-05 Bread 24
9 2022-01-08 Bread 20
10 2022-01-12 Bread 10
I want to aggregate (sum, avg, ...) the quantity per item for the last 7 days (or 14, 28 days). The expected outcome would look like this table.
ID Date Item Count Sum_7d
-------------------------------------
1 2022-01-01 Milk 10 10
2 2022-01-11 Milk 20 20
3 2022-01-12 Milk 10 30
4 2022-01-15 Milk 12 42
5 2022-01-16 Milk 10 52
6 2022-01-02 Bread 20 20
7 2022-01-03 Bread 22 42
8 2022-01-05 Bread 24 66
9 2022-01-08 Bread 10 56
10 2022-01-12 Bread 10 20
My first approach was using Redshift window functions like this
SELECT *, SUM(Count) OVER (PARTITION BY Item
ORDER BY Date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
FROM my_table
but it does not give the expected results because there are missing dates and I could not figure out how to put a condition on the time range.
My fallback solution is a cross product, but that's not desirable because it is inefficient for large data.
SELECT l.Date, l.Item, l.Count, sum(r.Count) as Sum_7d
FROM my_table l,
my_table r
WHERE l.Date - r.Date < 7
AND l.Date - r.Date >= 0
AND l.Item = r.Item
GROUP BY 1, 2, 3
Is there any efficient and concise way to do such an aggregation on date ranges in Redshift?
Related:
Can I put a condition on a window function in Redshift?
Redshift SQL Window Function frame_clause with days
This is a missing data problem and a common way to "fill in the blanks" is with a cross join. You correctly point out that this can get very expensive because the cross joining (usually) massively expands the data being worked upon AND because Redshift isn't great at creating data. But you do have to fill in the missing data. The best way I have found is to create the (near) minimum data set that will complete the data then UNION this data to the original table. The code below performs this path.
There is a way to do this w/o adding rows but the SQL is large, inflexible, error prone and just plain ugly. You could create new columns (date and count) based on LAG(6), LAG(5), LAG(4) ... and compare the date of each and use the count if the date is truly in range. If you want to sum a different date look-back you need to add columns and things get uglier. Also this will only be faster that the code below for certain circumstances (very few repeats of item). It just replaces making new data in rows for making new data in columns. So don't go this way unless absolutely necessary.
Now to what I think will work for you. You need a dummy row for every date and item combination that doesn't already exist. This is the minimal set of new data that will make you window function work. In reality I make all the combinations of data and item and merge these with the existing - a slight compromise from the ideal.
First let's set up your data. I changed some names as using reserved words for column names is not ideal.
create table test (ID int, dt date, Item varchar(16), Cnt int);
insert into test values
(1, '2022-01-01', 'Milk', 10),
(2, '2022-01-11', 'Milk', 20),
(3, '2022-01-12', 'Milk', 10),
(4, '2022-01-15', 'Milk', 12),
(5, '2022-01-16', 'Milk', 10),
(6, '2022-01-02', 'Bread', 20),
(7, '2022-01-03', 'Bread', 22),
(8, '2022-01-05', 'Bread', 24),
(9, '2022-01-08', 'Bread', 20),
(10, '2022-01-12', 'Bread', 10);
The SQL for generating what you want is:
with recursive dates(dt) as
( select min(dt) as dt
from test
union all
select dt + 1
from dates d
where d.dt <= current_date
)
select *
from (
SELECT *, SUM(Cnt) OVER (PARTITION BY Item
ORDER BY Dt
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS Sum_7d
FROM (
select min(id) as id, dt, item, sum(cnt) as cnt
from (
select *
from test
union all
select NULL as id, dt, item, NULL as cnt
from ( select distinct item from test) as items
cross join dates
) as all_item_dates
group by dt, item
) as grouped
) as windowed
where id is not null
order by id, dt;
Quickly here what this does.
A recursive CTE creates the date range in question (from min date in table until today).
These dates are cross joined with the distinct list of items resulting in every date for every unique item.
This is UNIONed to the table so all data exists.
GROUP By is used to merge real data rows with dummy rows for the same item and date.
Your window function is run.
A surrounding SELECT has a WHERE clause to remove any dummy rows.
As you will note this does use a cross-join but on a much reduced set of data (just the unique item list). As long as this distinct list of items is much shorter than the size of the table (very likely) then this will perform much faster than other techniques. Also if this is the kind of data you have you might find interest in this post I wrote - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp_2.html

Finding all instances where a foreign key appears multiple times grouped by month

I am not too familiar with SQL, and I have been tasked with something that I quite frankly have no clue how to go about it.
I am just going to simplify the tables to the point where only the necessary fields are taken into consideration.
The tables look as follows.
Submission(course(string), student(foreign_key), date-submitted)
Student(id)
What I need to do is produce a table of active students per month, per course with a total. An active student being anyone who has more than 4 submissions in the month. I am only looking at specific courses, so I will need to hard code the values that I need, for the sake of the example "CourseA" and "CourseB"
The result should be as follows
month | courseA | CourseB | Total
------------------------------------------
03/2020 50 27 77
02/2020 25 12 37
01/2020 43 20 63
Any help would be greatly apreciated
You can do this with two levels of aggregation: first by month, course and student (while filtering on students having more than 4 submissions), then by month (while pivoting the dataset):
select
month_submitted,
count(*) filter(where course = 'courseA') active_students_in_courseA,
count(*) filter(where course = 'courseB') active_students_in_courseB,
count(*) total
from (
select
date_trunc('month', date_submitted) month_submitted,
course,
student_id,
count(*) no_submissions
from submission
where course in ('courseA', 'courseB')
group by 1, 2, 3
having count(*) > 4
) t
group by 1
You could do subqueries using the WITH keyword like this:
WITH monthsA AS (
SELECT to_char(date-submitted, "MM/YYYY") as month, course, COUNT(*) as students
FROM Submission
WHERE course = 'courseA'
GROUP BY 1, 2
), monthsB AS (
SELECT to_char(date-submitted, "MM/YYYY") as month, course, COUNT(*) AS students
FROM Submission
WHERE course = 'courseB'
GROUP BY 1, 2
)
SELECT ma.month,
COALESE(ma.students, 0) AS courseA,
COALESCE(mb.students) AS courseB,
COALESCE(ma.students, 0) + COALESCE(mb.students, 0) AS Total
FROM monthsA ma
LEFT JOIN monthsB mb ON ma.month = mb.month
ORDER BY 1 DESC

Hive query to select only records in certain percentile

I have table with two columns - ID and total duration:
id tot_dur
123 1
124 2
125 5
126 8
I want to have a Hive query that select only 75th percentile. It should be only the last record:
id tot_dur
126 8
This is what I have, but its hard for me to understand the use of OVER() and PARTITIONED BY() functions, since from what I researched, this are the functions I should use. Before I get the tot_dur column I should sum and group by column duration. Not sure if percentile is the correct function, because I found use cases with percentile_approx.
select k1.id as id, percentile(cast(tot_dur as bigint),0.75) OVER () as tot_dur
from (
SELECT id, sum(duration) as tot_dur
FROM data_source
GROUP BY id) k1
group by id
If I've got you right, this is what you want:
with data as (select stack(4,
123, 1,
124, 2,
125, 5,
126, 8) as (id, tot_dur))
-----------------------------------------------------------------------------
select data.id, data.tot_dur
from data
join (select percentile(tot_dur, 0.75) as threshold from data) as t
where data.tot_dur >= t.threshold;

Postgresql : Check if the last number is the highest

I have large database and one field should be an incremental number, but it sometimes resets and I must detect them (the bold rows)
Table 1:
Shop #Sell DATE
EC1 56 1/10/2015
EC1 57 2/10/2015
**EC1 11 3/10/2015
EC1 12 4/10/2015**
AS2 20 1/10/2015
AS2 21 2/10/2015
AS2 22 3/10/2015
AS2 23 4/10/2015
To solve this problem I thought to find the highest number of each SHOP and check if it is the number with the highest DATE. Do you know another easier way to do it?
My concern is that it can be a problem to do the way I am thinking since I have a large database.
Do you know how I can do the query I am thinking of or do you have any others ideas?
The query you have in mind will give you all Shop values having a discontinuity in Sell number.
If you want to get the offending record you can use the following query:
SELECT Shop, Sell, DATE
FROM (
SELECT Shop, Sell, DATE,
LAG(Sell) OVER (PARTITION BY Shop ORDER BY DATE) AS prevSell
FROM Shops ) t
WHERE Sell < prevSell
ORDER BY DATE
LIMIT 1
The above query will return the first discontinuity found within each Shop partition.
Output:
Shop Sell DATE
---------------------
EC1 11 2015-03-10
Demo here
EDIT:
In case you cannot use windowed function and you only want the id of the shop having the discontinuity, then you can use the following query:
SELECT s.Shop
FROM Shops AS s
INNER JOIN (
SELECT Shop, MAX(Sell) AS Sell, MAX(DATE) AS DATE
FROM Shops
GROUP BY Shop ) t
ON s.Shop = t.Shop AND s.DATE = t.DATE
WHERE t.Sell <> s.Sell
The above will work provided that you have unique DATE values per Shop.
I think the following is the type of query you want:
select s.*
from (select shop, max(sell) as maxsell,
first_value(sell) over (partition by shop order by date desc) as lastsell
from shops s
group by shop
) s
where maxsell <> lastsell;

Oracle dedupe rows based on max values of 2 columns in conjunction

Was wondering if anyone knew an efficient way to dedupe records in a large dataset using Oracle SQL based on the max values of 2 attributes in conjunction.
In the hypothetical example below, I am looking remove all duplicate COMPANYID / CHILD ID Pairs by selecting first the maximum transactionid. Where the payload ID still has duplicates, the maximum BATCHID.
note: transactionID and batchID may have null values (which would be expected to the lowest value)
Table: Transaction
<p> CompanyID| ChildID | transactionid| BatchID | Product Details </P>
<p> ABC EFG 306 Product1 </p>
<p>ABC EFG 306 54 Product2</p>
<p>ZXY BFG 405 003 Product1</p>
<p>ZXY BFG 405 004 Product2</p>
<p>ZXY BFG 407 Product3</p>
Expected Result:
<p>ABC | EFG | 306 | 54 | Product 2 --selected on basis of highest transactionid and batchid </P>
<p>ZXY | BFG | 405 | 407 | Product 3 --selected on basis of highest transactionid </p>
I envisioned simply:
1) Using a max function on the transactionid and subquerying the result to max the batchID in addition
2) Self joining the "de-duped' set to the original set to obtain product information
Does anybody know of a more efficient / cleaner way to achieve this and a way to handle the nulls better?
Appreciate any feedback.
From Oracle 11g, you can use this kind of requests:
with w(CompanyID, ChildID, transactionid, BatchID, Product_Details) as
(
select 'ABC', 'EFG', 306, null, 'Product1 ' from dual
union all
select 'ABC', 'EFG', 306, 54, 'Product2' from dual
union all
select 'ZXY', 'BFG', 405, 003, 'Product1' from dual
union all
select 'ZXY', 'BFG', 405, 004, 'Product2' from dual
union all
select 'ZXY', 'BFG', 407, null, 'Product3' from dual
)
select w.CompanyID,
w.ChildID,
max(w.transactionid) keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_transactionid,
max(w.batchid) keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_batchid,
max(w.Product_Details) keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_Product_Details
from w
group by w.CompanyID, w.ChildID
;
The nvl function allows you to handle null cases. Here is the output (which does not fit yours, but I did the request as I understood what you wanted):
COMPANYID CHILDID MAX_TRANSACTIONID MAX_BATCHID MAX_PRODUCT_DETAILS
ABC EFG 306 54 Product2
ZXY BFG 407 Product3
EDIT: Let me try to explain further DENSE_RANK and LAST: inside a GROUP BY, this syntax appears as an aggregate function (like SUM, AVG...).
In a group, the ORDER BY gives the sorting (here, transactionid and batchid)
then the DENSE_RANK LAST states that you will focus on the last ranked row(s) of this sorting (you can have indeed several rows with same rank)
the MAX takes the maximum value inside these top-ranked rows. Most of the time, you only have one row so MAX can appear useless, but it is not. So you will often see MIN and DENSE_RANK FIRST, or MAX and DENSE_RANK LAST.
Here is the Oracle doc on this subject.
Because you are dealing with multiple columns, you should also consider just using row_number():
select t.*
from (select t.*,
row_number() over (partition by CompanyId, ChildId
order by transactionid desc nulls last, BatchID desc nulls last
) as seqnum
from t
) t
where seqnum = 1;
The keep/dense_rank method is fast. I'm not sure if doing it multiple times is faster than using row_number(). Testing can give you this information.