SQL server - count distinct over function or row_numer with rows window function - sql

I am currently trying to get a distinct count for customers over a 90 day rolling period. I have got the amount using sum amount and over partition. However, when I do this with count distinct, SQL doesn't have functionality.
I have attempted to use row_number() with the over partition and use rows current row and 90 preceding but this also isn't available.
Would greatly appreciate any suggested work around to resolve this problem.
I have attempted to solve the problem using 2 approaches, both which have failed based on the limitations outlined above.
Approach 1
select date
,count(distinct(customer_id)) over partition () order by date rows current row and 89 preceding as cust_count_distinct
from table
Approach 2
select date
,customer_id
,row_number() over partition (customer_id) order by date rows current row and 89 preceding as rn
from table
-- was then going to filter for rn = '1' but the rows functionality not possible with ranking function windows.

The simplest method is a correlated subquery of some sort:
select d.date, c.nt
from (select distinct date from t) d cross apply
(select count(distinct customerid) as cnt
from t t2
where t2.date >= dateadd(day, -89, d.date) and
t2.date <= d.date
) c;
This is not particularly efficient (i.e. a killer) on even a medium data set. But it might serve your needs.
You can restrict the dates being returned to test to see if it works.

Related

can we get totalcount and last record from postgresql

i am having table having 23 records , I am trying to get total count of record and last record also in single query. something like that
select count(*) ,(m order by createdDate) from music m ;
is there any way to pull this out only last record as well as total count in PostgreSQL.
This can be done using window functions
select *
from (
select m.*,
row_number() over (order by createddate desc) as rn,
count(*) over () as total_count
from music
) t
where rn = 1;
Another option would be to use a scalar sub-query and combine it with a limit clause:
select *,
(select count(*) from order_test.orders) as total_count
from music
order by createddate desc
limit 1;
Depending on the indexes, your memory configuration and the table definition might be faster then the two window functions.
No, it's not not possible to do what is being asked, sql does not function that way, the second you ask for a count () sql changes the level of your data to an aggregation. The only way to do what you are asking is to do a count() and order by in a separate query.
Another solution using windowing functions and no subquery:
SELECT DISTINCT count(*) OVER w, last_value(m) OVER w
FROM music m
WINDOW w AS (ORDER BY date DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
The point here is that last_value applies on partitions defined by windows and not on groups defined by GROUP BY.
I did not perform any test but I suspect my solution to be the less effective amongst the three already posted. But it is also the closest to your example query so far.

Define groups of row by logic

I have a unique scenario to which i can't find a solution, so i thought to ask the experts :)
I have a query that returns a course syllabus, which each row represent a day of training. You can see in the picture below that there are rest days in the middle of the training
I can't find a way to group the each consecutive training days
Please see screenshot below detailed the rows and what i want to achieve
I am using MS-SQL 2014
Here is a Fiddle with the data i have and the expected results
SQL Fiddle
The simplest method is a difference of row_number(). The following identifies each consecutive group with a number:
select td.*,
dense_rank() over (order by dateadd(day, - seqnum, DayOfTraining)) as grpnum
from (select td.*,
row_number() over (order by DayOfTraining) as seqnum
from TrainingDays td
) td;
The key idea is that subtracting a sequence from consecutive days produces a constant for those days.
Here is the SQL Fiddle.
After many hit and trials, this is the closest I could come up with
http://rextester.com/ECBQ88563
The problem here is that if last row belongs to another group, it will still use it with previous group. So in your sample if you change last date from 19 to 20, the output will still be the same. May be with another condition we can eliminate it. Other than that this should work.
SELECT DayOfTraining1,
dense_rank() over (ORDER BY grp_dt) AS grp
FROM
(SELECT DayOfTraining1,
min(DayOfTraining) AS grp_dt
FROM
(SELECT trng.DayOfTraining AS DayOfTraining1,
dd.DayOfTraining
FROM trng
CROSS JOIN
(SELECT d.*
FROM
(SELECT trng.*,
lag (DayOfTraining,1) OVER (
ORDER BY DayOfTraining) AS nxt_DayOfTraining,
lead (DayOfTraining,1) OVER (
ORDER BY DayOfTraining) AS prev_DayOfTraining,
datediff(DAY, lag (DayOfTraining,1) OVER (
ORDER BY DayOfTraining), DayOfTraining) AS ddf
FROM trng
) d
WHERE d.ddf <> 1
OR prev_DayOfTraining IS NULL
) dd
WHERE trng.DayOfTraining <= dd.DayOfTraining
) t
GROUP BY DayOfTraining1
) t1;
Explanation: The inner query d is using lag and lead functions to capture previous and next rows values. Then we are taking the days difference and using and capturing dates where difference is not 1. These are the dates where group should switch. Use a derived table dd for the same.
Now cross join this with main table and use aggregate function to determine the continuous groups (took me many hit and trials) to achieve this.
Then use dense_rank function on it to get the group.

Filtering by analytic function results without subquery or subtable

I'm working on Netsuite project which has limited SQL capabilites. It's difficult to test as I am basically guessing the SQL they are building in their GUI.
I'd like to filter the results of a query to the results with a negative value of a culmative sum.
Is the following a valid PL/SQL construct (barring any small syntactical errors)?
SELECT SUM(amount) OVER(PARTITION BY itemid ORDER BY date ROWS UNBOUNDED
PRECEDING) AS "sum" FROM table WHERE sum < 0
Secondly, due to limitations in Netsuite, is the following a valid construct?
SELECT SUM(amount) OVER(PARTITION BY itemid ORDER BY date ROWS UNBOUNDED
PRECEDING) AS "sum" FROM table WHERE SUM(amount) OVER(PARTITION BY itemid
ORDER BY date ROWS UNBOUNDED PRECEDING) < 0
Oracle's documentation suggests that neither of these are valid and filtering an analytic function should be done via subquery but some google groups and other websites suggest otherwise. Most however are using RANK() and DENSE_RANK() functions in their examples which may function differently.
To filter in on the result of analytic functions you have to use inline views (subqueries in the from clause).
For example, your query might look like:
select *
from (
select itemid,
date,
sum(amount) over (
partition by itemid
order by date
rows between unbounded preceding
and current row
) as run_sum
from table
)
where run_sum < 0
This will show all items, and the associated dates, on which the running sum for that item was less than zero (if there are any such dates for a given item).

oracle sql to get min timestamp when the count of results large than a number

in order to improve the performance, I need a sql to implement the following requirement.
If there is a table and has the following column:
id timestamp value
How can I get the min timestamp(e.g. :t1) when the count of the result > 100000 ?
then the following sql result--count(*) will > 100000
select count(*) from table where timestamp < :t1
My understanding of your question is: Find the earliest timestamp in the table for which there are at least 100,000 earlier rows.
There are probably many ways to do it; the main difficulty is trying to come up with an efficient one.
I think an analytic-function approach is most likely to work well. The most obvious choice is to use COUNT:
select min(timestamp) from (
select timestamp, count(*) over (order by timestamp rows between unbounded preceding and 1 preceding) earlier_rows
from table
)
where earlier_rows >= 100000
But I suspect using RANK or something similar will be faster:
select min(timestamp) from (
select timestamp, rank() over (order by timestamp) time_rank
from table
)
where time_rank > 100000
I'm not sure off the top of my head, but these may give slightly different results if there are duplicate timestamps.
This will give you the min and max value and the count
select
count(t.*),
min(t.timestamp),
max(t.timestamp)
from table t
where ( select count(*) from table t where t.timestamp < :t1 ) > 10000

A SQL query to work on a table with subgroups

I have a large Oracle DB table which contains nearly 200 millions of rows. It has only three columns: A subscriber id field, a date field and an offer id field.
For each row in this table, I need to find whether this row has any corresponding rows in the table such that:
1) They belong to the same subscriber (same subscriber id)
2) They are from a certain distance in the past from the current row (for example if our current row is A, the row B with the same subscriber id should have that A.date > B.date >= A.date - 30(days))
3) In addition to 2) we will have to query for a specific offer id as well: (A.date > B.date >= A.date - 30 and B.offerid == some_id)
I am aware of the Oracle Analytics functions lag and lead and I plan to use them for this purpose. These function returns value of the fields above or below of the current row on the ordered table, according to some given fields. The disturbing thing is that the number of rows with the same subscriber id field varies up to 84. When I use an ORDER BY statement on (SUBSCRIBER_ID,DATE) with lag function, then for each row, I need to check 84 rows above of the current one, in order to make sure that the rows above share the same SUBSCRIBER_ID with my current row. Since some subscriber id subgroups only have entries around of 3 - 4 rows, this amount of unnecessary row accesses is wasteful.
How can I accomplish this job, without being in need to check 84 rows each time, for each row? Does Oracle support any methods which work solely on subgroups generated by the GROUP BY statement?
One option is to use a self-join like this:
SELECT t1.*, NVL2(t2.subscriber_id, 'Yes', 'No') as match_found
FROM
myTable t1 LEFT JOIN
myTable t2 ON t1.subscriber_id = t2.subscriber_id
AND t1.date > t2.date AND t2.date >= t1.date - 30
AND t2.offerid = <filter_offer_id>
Actually the analytic function COUNT(*) in Oracle did the necessary stuff for me. I used the following structure
SELECT
SUBSCRIBER_ID,
SEGMENTATION_DATE,
OFFER_ID,
COUNT(*) OVER (PARTITION BY SUBSCRIBER_ID ORDER BY SEGMENTATION_DATE RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS SENDEVER,
COUNT(*) OVER (PARTITION BY SUBSCRIBER_ID ORDER BY SEGMENTATION_DATE RANGE BETWEEN 30 PRECEDING AND 1
COUNT(CASE WHEN (OFFER_ID =580169) THEN 1 ELSE NULL END ) OVER (PARTITION BY SUBSCRIBER_ID ORDER BY SEGMENTATION_DATE RANGE BETWEEN 180 PRECEDING AND 1 PRECEDING) AS SEND6M580169
FROM myTable
PARTITION BY groups the table according to the SUBSCRIBER_ID field and by using proper RANGE BETWEEN statements on each group's rows, I only pick the ones which has the appropriate dates, in the desired time interval.
By using a CASE WHEN statement on the OFFER_ID field, I further filter the rows in the current SUBSCRIBER_ID group and throw out all rows with the invalid offer id.
The nice thing is there is no self join needed here, reducing the order of the operation a magnitude down.