Selecting 5 Most Recent Records Of Each Group - sql

The below statement retrieves the top 2 records within each group in SQL Server. It works correctly, however as you can see it doesn't scale at all. I mean that if I wanted to retrieve the top 5 or 10 records instead of just 2, you can see how this query statement would grow very quickly.
How can I convert this query into something that returns the same records, but that I can quickly change it to return the top 5 or 10 records within each group instead, rather than just 2? (i.e. I want to just tell it to return the top 5 within each group, rather than having 5 unions as the below format would require)
Thanks!
WITH tSub
as (SELECT CustomerID,
TransactionTypeID,
Max(EventDate) as EventDate,
Max(TransactionID) as TransactionID
FROM Transactions
WHERE ParentTransactionID is NULL
Group By CustomerID,
TransactionTypeID)
SELECT *
from tSub
UNION
SELECT t.CustomerID,
t.TransactionTypeID,
Max(t.EventDate) as EventDate,
Max(t.TransactionID) as TransactionID
FROM Transactions t
WHERE t.TransactionID NOT IN (SELECT tSub.TransactionID
FROM tSub)
and ParentTransactionID is NULL
Group By CustomerID,
TransactionTypeID

Use Partition by to solve this type problem
select values from
(select values ROW_NUMBER() over (PARTITION by <GroupColumn> order by <OrderColumn>)
as rownum from YourTable) ut where ut.rownum<=5
This will partitioned the result on the column you wanted order by EventDate Column then then select those entry having rownum<=5. Now you can change this value 5 to get the top n recent entry of each group.

Related

How to use Array_agg without returning the same values in different Order?

When using Array_agg, it returns the same values in different orders. I tried using distinct in a few places and it didn't work. I tried using an order before and after the array and it would fail or not properly exclude results.
I am trying to find all fields in the field column that share the same time and same ID and put them into an array.
Columns are Fieldname, ID, Time
select b.Field, count(*)
from (select Time, ID, array_agg(fieldname) as Field
from a
group by 1,2
order by 3) b
group by b.field
order by 1 desc
This produces duplicate results
For example I will have:
Field Name Count
Ghost,Mark 1234
Mark,Ghost 1234
I also tried this below where I add a subquery where I first order the fields alphabetically when grouping time and ID but it failed to execute. I think due to array_agg not being the root query?
select a.Field, count(*)
from
(select Time, ID, array_agg(fieldname) as field
from
(select Time, ID, fieldname
from a
group by 1,2
order by 3 desc) a
group by 1,2 ) b
group by 1
order by 2 desc

Get the top N rows by row count in GROUP BY

I'm querying a records table to find which users are my top record creators for certain record types. The basic starting point of my query looks something like this:
SELECT recordtype, createdby, COUNT(*)
FROM recordtable
WHERE recordtype in (...)
GROUP BY recordtype, createdby
ORDER BY recordtype, createdby DESC
But there are many users who have created records - I want to narrow this down further.
I added HAVING COUNT(*) > ..., but some record types only have a few records, while others have hundreds. If I do HAVING COUNT(*) > 10, I won't see that all 9 records of type "XYZ" were created by the same person, but I will have to scroll through every person that's created only 15, 30, 50, etc. of the 3,500 records of type "ABC."
I only want the top 5, 10, or so creators for each record type.
I've found a few questions that address the "select top N in group" part of the question, but I can't figure out how to apply them to what I need. The answers I could find are in cases where the "rank by" column is a value stored in the table, not an aggregate.
(Example: "what are the top cities in each country by population?", with data that looks like this:)
Country City Population
United States New York 123456789
United States Chicago 123456789
France Paris 123456789
I don't know how to apply the methods I've seen used to answer that (row_number(), mostly) to get the top N by COUNT(*).
here is one way , to get top 10 rows in each group:
select * from(
select *, row_number() over (partition by recordtype order by cnt desc) rn
from (
SELECT recordtype, createdby, COUNT(*) cnt
FROM recordtable
WHERE recordtype in (...)
GROUP BY recordtype, createdby
)t
)t where rn <= 10
If i understand well you want to get the top N records with biggest count. You can achieve this with a subquery like this (I suppose you are using MySQL or PostGRESQL or db2, in other DB engines the limit and offset may differ, as for example in sqlserver that is achieved with select top n * from...
SELECT A.recordtype, A.createdby, A.total FROM (
SELECT recordtype, createdby, COUNT(*) as total
FROM recordtable
WHERE recordtype in (...)
GROUP BY recordtype, createdby
) AS A ORDER BY recordtype, createdby, total DESC
LIMIT 10 OFFSET 0
Limit is the number of records you want in the results page, and offset is the number of records to skip before taking the result page.
If you use sqlserver it may look like this (there is also a way to apply an offset, you can take a look here SQL Server OFFSET and LIMIT)
SELECT TOP 10 A.recordtype, A.createdby, A.total FROM (
SELECT recordtype, createdby, COUNT(*) as total
FROM recordtable
WHERE recordtype in (...)
GROUP BY recordtype, createdby
) AS A ORDER BY recordtype, createdby, total DESC
For a grouped result then you can take a look at this post http://www.silota.com/docs/recipes/sql-top-n-group.html
So to take first 10 records in groups not only the first i mix this answer, with the link below and the approach of #eshirvana
SELECT * FROM (
SELECT *, row_number() OVER (PARTITION recordtype BY ORDER BY total DESC) rn
FROM (
SELECT recordtype, createdby, COUNT(*) as total
FROM recordtable
WHERE recordtype in (...)
GROUP BY recordtype, createdby
) t
) t WHERE total <= 10

How to filter records by them amount per date?

i have a tablet 'A' that have a column of date. and the same date can be in a few records. I'm trying to filter the records where the amount of the records by day is less than 5. And still keep all the fields of the tablet.
I mean that if i have only 4 records on 11/10/2017 I need to filter all of this 4 records.
So You can SELECT them basing at sub-query . In SUB-Query group them by this date column and then use HAVING with aggregated count to know how many in every date-group we have and then select all which have this count lesser than 5 ;
SELECT *
FROM A
WHERE A.date in (SELECT subA.date
FROM A
GROUP BY A.date
HAVING COUNT(*) < 5 );
Take Care's answer is good. Alternatively, you can use an analytic/windowing function. I'd benchmark both and see which one works better.
with cte as (
select *, count(1) over (partition by date) as cnt
from table_a
)
select *
from cte
where cnt < 5

Nested SQL Server Query Max Date

Ladies and Gents,
I need to write a query that grabs data from a view, but I'm not sure how to go about this. The issue is there is really no key and there are two fields I'm concerned with that will control what rows I need to retrieve.
The view looks something like this:
Category columna columnb uploaddate
-----------------------------------------------------
a value value 1/30/2013 04:04:04:000
a value value 1/29/2013 04:04:04:000
b value value 1/28/2013 01:23:04:000
b value value 1/30/2013 04:04:04:000
b value value 1/30/2013 04:04:04:000
c value value 1/30/2013 01:01:01:000
c value value 1/30/2013 01:01:01:000
What I need to retrieve is all rows for each unique category and the newest uploaddate. So in the example above I would get 1 row for category a which would have the newest uploaddate. Category b would have 2 rows which have the 1/30/2013 date. Category c would have two rows also.
I also need to just compare the date of upload, not the time. As the loading can take a couple seconds. I was trying to use max date but it would only grab the time to the second.
Any guidance/thoughts would be great.
Thanks!
EDIT:
Here is what I threw together so far and I think it's close but it's not working yet and I doubt this is the most efficient way to do this.
select
*
from
VIEW c
INNER JOIN
(
SELECT
Category,
MAX(CONVERT(DateTime, Convert(VarChar, UploadDate, 101))) as maxuploaddate
FROM
View
GROUP BY
Category,
UploadDate
) temp ON temp.Category = c.Category AND CONVERT(VarChar, UploadDate, 101) = temp.maxuploaddate
The problem lies in the nested selected statement as it's still grabbing all combinations of Category and Upload date. Is there a way to do a distinct on the Category and UploadDate, just getting the newest combination?
Thanks Again
Your query is close, you have a mistake in the group by. I'd also get rid of the date conversions; date comparisons work fine.
select
*
from
VIEW c
INNER JOIN
(
SELECT
Category,
MAX(UploadDate) as maxuploaddate
FROM
View
GROUP BY
Category
) temp ON temp.Category = c.Category AND UploadDate = temp.maxuploaddate
If you want to do this to the nearest date, you need to convert to a date first. In SQL Server syntax:
select *
from (select category, columna, columnb, uploaddate,
rank() over ( partition by category order by cast(uploaddate as date) desc) as seqnum
from view
) v
where seqnum = 1
In Oracle syntax:
select *
from (select category, columna, columnb, uploaddate,
rank() over ( partition by category order by to_char(uploaddate, 'YYYY-MM-DD') desc) as seqnum
from view
) v
where seqnum = 1
Because you want ties, these use rank() instead of row_number().
In Oracle you can use Rank() to achieve this. Rank() creates a duplicate number if the same criteria are met.
Edit: And you can use Trunc() to "trim" the time from the uploaddate.
select *
from (select category, columna, columnb, uploaddate,
rank() over ( partition by category order by trunc(uploaddate) desc) rank
from view)
where rank = 1
Also Dense_Rank() exists, which won't create duplicate numbers. So this is not applicable here. See this question for more info on the differences.

Over clause in SQL Server

I have the following query
select * from
(
SELECT distinct
rx.patid
,rx.fillDate
,rx.scriptEndDate
,MAX(datediff(day, rx.filldate, rx.scriptenddate)) AS longestScript
,rx.drugClass
,COUNT(rx.drugName) over(partition by rx.patid,rx.fillDate,rx.drugclass) as distinctFamilies
FROM [I 3 SCI control].dbo.rx
where rx.drugClass in ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
GROUP BY rx.patid, rx.fillDate, rx.scriptEndDate,rx.drugName,rx.drugClass
) r
order by distinctFamilies desc
which produces results that look like
This should mean that between the two dates in the table the patID that there should be 5 unique drug names. However, when I run the following query:
select distinct *
from rx
where patid = 1358801781 and fillDate between '2008-10-17' and '2008-11-16' and drugClass='H4B'
I have a result set returned that looks like
You can see that while there are in fact five rows returned for the second query between the dates of 2008-10-17 and 2009-01-15, there are only three unique names. I've tried various ways of modifying the over clause, all with different levels of non-success. How can I alter my query so that I only find unique drugNames within the timeframe specified for each row?
Taking a shot at it:
SELECT DISTINCT
patid,
fillDate,
scriptEndDate,
MAX(DATEDIFF(day, fillDate, scriptEndDate)) AS longestScript,
drugClass,
MAX(rn) OVER(PARTITION BY patid, fillDate, drugClass) as distinctFamilies
FROM (
SELECT patid, fillDate, scriptEndDate, drugClass,rx.drugName,
DENSE_RANK() OVER(PARTITION BY patid, fillDate, drugClass ORDER BY drugName) as rn
FROM [I 3 SCI control].dbo.rx
WHERE drugClass IN ('h3a','h6h','h4b','h2f','h2s','j7c','h2e')
)x
GROUP BY x.patid, x.fillDate, x.scriptEndDate,x.drugName,x.drugClass,x.rn
ORDER BY distinctFamilies DESC
Not sure if DISTINCT is really necessary - left it in since you've used it.