max records with dense rank - sql

Is there a better alternative to using max to get the max records.
I have been playing with dense rank and partition over with the below query
but I am getting undesired results and poor performance.
select Tdate = (Select max(Date)
from Industries
where Industries.id = i.id
and Industries.Date <= '22 June 2011')
from #ii_t i
Many Thanks.

The supplied query doesn't use the DENSE_RANK windowing function. Not being familiar with your data structure, I believe your query is attempting to find the largest value of Date for each Industry id, yes? Rewriting the above query to use a ranking function, I would write it as a common table expression.
;
WITH RANKED AS
(
SELECT
II.*
-- RANK would serve just as well in this scenario
, DENSE_RANK() OVER (PARTITION BY II.id ORDER BY II.Date desc) AS most_recent
FROM Industries II
WHERE
II.Date <= '22 June 2011'
)
, MOST_RECENT AS
(
-- This query restricts it to the most recent row by id
SELECT
R.*
FROM
RANKED R
WHERE
R.most_recent = 1
)
SELECT
*
FROM
MOST_RECENT MR
INNER JOIN
#ii_t i
ON i.id = MR.id
Also, to address the question of performance, you might need to look at how Industries is structured. There may not be an index on that table and if there is, it might not cover the Date (descending) and id field. To improve the efficiency of the above query, don't pull back everything in the RANKED section. I did that as I was not sure what fields you would need but obviously the less you have to pull back, the more efficient the engine can be in retrieving data.

Try this (untested) code and see if it does what you want. By the looks of it, it should return the same things and hopefully a bit faster.
select Tdate = max(Industries.Date)
from #ii_t i
left outer join Industries
on Industries.id = i.id and
Industries.Date <= '22 June 2011'
group by i.id

Related

Can I avoid joining the same table multiple times?

Is there a way to improve the following query?
I would need an optimized version of the following query.
The reason I'm joining the Date_Table multiple times is because the ID and date_value columns are not in ascending order.
ie
ID = 1, date_value = '2022-09-07'; ID = 2, date_value = '2022-02-02'; ID = 3, date_value = '2022-11-12';
Sample data:
The maximum Date from the Agreements table is calculated based on the Date_Table.date_value column. The query will only return a row. In this case, the row highlighted in green will be the result.
Thank you so much!
SELECT * FROM Agreement
WHERE
dim_date_id = (
SELECT
Date_Table.ID
FROM (
SELECT
MAX(Date_Table.date_value) AS date_value
FROM Agreement
INNER JOIN Date_Table
ON Agreement.DIM_DATE_ID = Date_Table.ID
) AS last_day
INNER JOIN Date_Table
ON last_day.date_value = Date_Table.date_value
);
If Agreement is a large table, you should first find all the distinct date_ids, then join it to Date_Table. Also use a rank() windowing function to find the id of the most recent record:
Select Agreement.* From Agreement Inner Join (
Select ID From (
Select Date_Table.ID
,rank() Over (Order by Date_Table.date_value desc) as recent
From Date_Table Inner Join (
Select Distinct Dim_Date_ID as ID From Agreement
) A On A.ID=Date_Table.ID
) where recent=1
) X On Agreement.DIM_DATE_ID = X.ID
On first glance this looks just as complicated as your original query. But it quickly reduces the Agreement results to only a list of date ids, and especially if that field is indexed it is a fast query. Date_Table is then Inner Joined to find the best (most recent) Date_Value using a rank() function. The whole thing is filtered to retain only one record, the most recent, and that date_id is used to filter Agreement.
Again, I recommend that you index Agreement.Dim_Date_ID to make this query perform well.

How to join two tables based on a calculated field?

I have two SQL queries that output the same kind of output and have the same grouping and order :
select date_trunc('month', inserted_at)::date as date, count(id) from payment_logs where payment_logs.event_name = 'subscription_created' group by date order by date desc;
select date_trunc('month', inserted_at)::date as date, count(id) from users group by date order by date desc;
I would like to join those two results based on the calculated date field (which is the month), and have a result with 3 columns : date, count_users and count_payment_logs.
How can I achieve that? Thanks.
Something like this
select plog.date as odata, usr.cntusr, plog.cntlog
from (
select date_trunc('month', inserted_at)::date as date, count(id) cntlog
from payment_logs
where payment_logs.event_name = 'subscription_created'
group by date order by date desc
) plog
join (
select date_trunc('month', inserted_at)::date as date, count(id) cntusr
from users
group by date
) usr on plog.data = usr.data
order by odata desc
Nothing wrong with the accepted answer, but I wanted to show an alternative and add some color. Instead of subqueries, you can also use common table expressions (CTEs) which improve readability but also have some other features as well. Here is an example using CTEs:
with payments as (
select
date_trunc('month', inserted_at)::date as date,
count(id) as payment_count
from payment_logs
where
event_name = 'subscription_created'
group by date
),
users as (
select
date_trunc('month', inserted_at)::date as date,
count(id) as user_count
from users
group by date
)
select
p.date, p.payment_count, u.user_count
from
payments p
join users u on
p.date = u.date
order by
p.date desc
In my opinion the abstraction is neater and makes the code much easier to follow (and thus maintain).
Other notes:
The order by is expensive, and you can avoid it within each of the subqueries/CTEs since it's being done at the end anyway. The ones in the subqueries will be clobbered by whatever you do in the main query anyway, so just omit them completely. Your results will not differ, and your query will be more efficient.
In this example, you probably don't have any missing months, but it's possible... especially if you expand this concept to future queries. In such a case, you may want to consider a full outer join instead of an inner join (you have months that appear in the users that may not be in the payments or vice versa):
select
coalesce (p.date, u.date) as date,
p.payment_count, u.user_count
from
payments p
full outer join users u on
p.date = u.date
order by
1 desc
Another benefit of CTEs vs subqueries is that you can reuse them. In this example, I want to mimic the full outer join concept but with one additional twist -- I have data from another table by month that I want in the query. The CTE lets me use the CTE for "payments" and "users" as many times as I want. Here I use them in the all_dates CTE and again in the main query. By creating "all_dates" I can now use left joins and avoid weird coalescing in joins (not wrong, just ugly).
with payments as (
-- same as above
),
users as (
-- same as above
),
all_dates as (
select date from payments -- referred to payments here
union
select date from users
)
select
a.date, ac.days_in_month, p.payment_count, u.user_count
from
all_dates a
join accounting_calendar ac on
a.date = ac.accounting_month
left join payments p on -- referred to it here again, same CTE
a.date = p.date
left join users u on
a.date = u.date
order by
p.date desc
The point is you can reuse the CTEs.
A final advantage is that you can declare the CTE materialized or non-materialized (default). The materialized CTE will essentially pre-process and store the results, which in certain cases may have better performance. A non-materialized on, on the other hand, will mimic a standard subquery which is nice because you can pass where clause conditions from outside the query to inside of it.

What is the most efficient way to find the first and last entry of an entity in SQL?

I was asked this question in an interview. A table, trips, contains the following columns( customer_id, start_from, end_at, start_at_time, end_at_time), with data structured so that each trip is stored as a separate row and a part of the table looks like this: How would you find the list of all the customers who started yesterday from point A and ended yesterday at point P?
I provided solution using windowing functions that identified the list of all customers that started their day at A and then did an inner join of a list of these customers with the customers who ended their day at P( using the same windowing functions).
The solution I gave was this:
SELECT a.customer_id
FROM
(SELECT a.customer_id
FROM
(SELECT customer_id,
start_from,
row_number() OVER (PARTITION BY customer_id
ORDER BY start_at_time ASC) AS rnk
FROM trips
WHERE to_date(start_at_time)= date_sub(CURRENT_DATE, 1) ) AS a
WHERE a.rnk=1
AND a.start_from='A' ) AS a
INNER JOIN
(SELECT a.customer_id
FROM
(SELECT customer_id,
end_at,
row_number() OVER (PARTITION BY customer_id
ORDER BY end_at_time DESC) AS rnk
FROM trips
WHERE to_date(end_at_time)= date_sub(CURRENT_DATE, 1) ) AS a
WHERE a.rnk=1
AND a.end_at='P' ) AS b ON a.customer_id=b.customer_id
My interviewer said my solution was correct but there is a more efficient way to solve this problem. I've searching and trying to find a more efficient way but I could not find one so far. Can you suggest a more efficient way to solve this problem?
I might use first_value() for this:
select t.customer_id
from (select t.*,
first_value(start_from) over (partition by customer_id order by start_at_time) as first_start,
first_value(end_at) over (partition by customer_id order by start_at_time desc) as last_end
from t
where start_at_time >= date_sub(CURRENT_DATE, 1) and
start_at_time < CURRENT_DATE
) t
where first_start = start_from and -- just some filtering so select distinct is not needed
first_start = 'A' and
last_end = 'P';
I should add that many databases support an equivalent function for aggregation, and I would use that instead.
This assumes that starts are not repeated. To be safe, you can add select distinct, but there is a performance hit for that.
A generalized version of what I would probably have done:
SELECT fandl.a
FROM (
SELECT a, MIN(start) AS t0, MAX(start) AS tN
FROM someTable
WHERE start >= DATE_SUB(CURRENT_DATE, 1) AND start < CURRENT_DATE
GROUP BY a
) AS fandl
INNER JOIN someTable AS st0 ON fandl.a = st0.a AND fandl.t0 = st0.start
INNER JOIN someTable AS stN ON fandl.a = stN.a AND fandl.tN = stN.start
WHERE st0.b1 = 'A' AND stN.b2 = 'P'
;
Using the date function you did, since you did not specify sql dialect.
Note that, in many RDBMS, if there is an (a, start) index, the subquery and joins can be done with the index alone; actual table access would only be required for the final WHERE evaluation.

Complex Aggregations in SQL

All I want to do is get a data set that shows me how many orders were placed and how many calls were made for each order. I also need the date of the calls and the date of the order, in the same table. The table is 10M+ rows, so aggregation of the result set is essential for analysis. The only analysis I want to do is the sum of calls/the total orders, and to be able to see how many support_tickets were generated from orders within an order range, up to a call_date. Very simple, but surprisingly complex to code up. Here is my attempt. I have also tried to change the below into a union, but still get wrong aggregate results.
-- The Query:-
SELECT
category_name
count(order_code)
order_date
sum(support_ticket_call)
call_date
FROM
(Select distinct name, order_code, order_date from table1) b
left join
(select count(call_ids), call_date FROM table2) b
on b.order_ID_code = a.order_id_code
group by category_name, order_date, call_date
Whenever there are no support_ticket_calls, the call_date is NULL, as you would expect. The count of orders is like 60,000 though, which is different from the usual 12 or so in the rest of the result set. I know something is wrong with this query, but It's driving me insane trying to solve it, literally all day so far.
It's a little difficult to answer this question without sample data and expected results, but the comment was getting too long.
You have several problems with your current query. First you need to join on the date fields by using on criteria. You also need to add group by to those queries that use aggregation. Finally, where does support_ticket_call come from? Can I presume it's the alias for the count(call_ids)?
Something like this should get you close:
SELECT
a.name as category_name,
count(a.order_code),
sum(b.support_ticket_call),
a.order_date as call_date
FROM
(Select distinct name, order_code, order_date
from table1) a
left join
(select count(call_ids) as support_ticket_call, call_date
from table2 group by call_date) b on a.order_date = b.call_date
group by a.name, a.order_date

Find max over multiple columns

I am trying to query a list of meetings from the most recent semester, where semester is determined by two fields (year, semester). Here's a basic outline of the schema:
Otherfields Year Semester
meeting1 2014 1
meeting2 2014 1
meeting3 2013 2
... etc ...
As the max should be considered for the Year first, and then the Semester, my results should look like this:
Otherfields Year Semester
meeting1 2014 1
meeting2 2014 1
Unfortunately simply using the MAX() function on each column separately will try to find Year=2014, Semester=2, which is incorrect. I tried a couple approaches using nested subqueries and inner joins but couldn't quite get something to work. What is the most straightforward approach to solving this?
Using a window function:
SELECT Year, Semester, RANK() OVER(ORDER BY Year DESC, Semester DESC) R
FROM your_table;
R will be a column containing the "rank" of the couple (Year, Semester). You can then use this column as a filter, for instance :
WITH TT AS (
SELECT Year, Semester, RANK() OVER(ORDER BY Year DESC, Semester DESC) R
FROM your_table
)
SELECT ...
FROM TT
WHERE R = 1;
If you don't want gaps between ranks, you can use dense_rank instead of rank.
This answer assumes you use a RDBMS who is advanced enough to offer window functions (i.e. not MySQL)
I wouldn't be surprised if there's a more effecient way to do this (and avoid the duplicate subquery), but this will get you the answer you want:
SELECT * FROM table WHERE Year =
(SELECT MAX(Year) FROM table)
AND Semester =
(SELECT MAX(Semester) FROM table WHERE Year =
(SELECT MAX(Year) FROM table))
Here's Postgres:
with table2 as /*virtual temporary table*/
(
select *, year::text || semester as yearsemester
from table
)
select Otherfields, year, semester
from table2
where (Otherfields, yearsemester) in
(
select Otherfields, max(yearsemester)
from table2
group by Otherfields
)
I've been overthinking this, there's a much simpler way to get this:
SELECT Meeting.year, Meeting.semester, Meeting.otherFields
FROM Meeting
JOIN (SELECT year, semester
FROM Meeting
WHERE ROWNUM = 1
ORDER BY year DESC, semester DESC) MostRecent
ON MostRecent.year = Meeting.year
AND MostRecent.semester = Meeting.semester
(and working Fiddle)
Note that variations of this should work for pretty much all dbs (anything that supports a limiting clause in a subquery); here's the MySQL version, for example:
SELECT Meeting.year, Meeting.semester, Meeting.otherFields
FROM Meeting
JOIN (SELECT year, semester
FROM Meeting
ORDER BY year DESC, semester DESC
LIMIT 1) MostRecent
ON MostRecent.year = Meeting.year
AND MostRecent.semester = Meeting.semester
(...and working fiddle)
Given some of the data in this answer this should be performant for Oracle, and I suspect other dbs as well (given the shortcuts the optimizer is allowed to take). This should be able to replace the use of things like ROW_NUMBER() in most instances where no partitioning clause is provided (no window).
why don't you simply use ORDER BY???
that way, it would be easier to handle and less messy!! :)
SELECT * FROM table
Where Year = (Select Max(Year) from table) /* optional clause to select only 2014*/
Order by Semester ASC, Year DESC, Otherfields; /*numericaly lowest sem first. in case of sem clash, sort by descending year first */
EDIT
In case, you need limited results from 2014, use Limit clause ( for mysql )
SELECT * FROM table
Where Year = (Select Max(Year) from table)
Order by Semester ASC, Year DESC, Otherfields
LIMIT 10;
It will order first, then get the Limit - 10, so u get your limited result set!
This will fetch output like :
Otherfields Year Semester
meeting1 2014 1
meeting2 2014 1
meeting1 2013 1
meeting2 2013 2
Answering my own question here:
This query was run in a stored procedure, so I went ahead and found the maximum year/semester in separate queries before the rest of the query. This is most likely inefficient and inelegant, but it is also the most understandable method- I don't need to worry about other members of my team getting confused by it. I'll leave this question here since it's generally applicable to many other situations, and there appear to be some good answers providing alternative approaches.
-- Find the most recent year.
SELECT MAX(year) INTO max_year FROM meeting;
-- Find the most recent semester in the year.
SELECT MAX(semester) INTO max_semester FROM meeting WHERE year = max_year;
-- Open a ref cursor for meetings in most recent year/semester.
OPEN meeting_list FOR
SELECT otherfields, year, semester
FROM meeting
WHERE year = max_year
AND semester = max_semester;