SQL count new values only with partition by - running count with no duplicates - sql

Based on table below in Presto I need a column for all new 'rid'. What I managed to do is the same what I can achieve with partition by but it's not exactly what I'm looking for (db<>fiddle demo).
Goal is to have many groupings counts but I think this should describe problem sufficiently.
I need data truncated by days and column for new users every day as shown at example below. In simple words - if value repeats don't count it. I've tried to find correlation between this and relational division problem but I just stuck.

You could use row_number() to rank the records of each rid by time; then you can aggregate and count in only the top record per group.
select
date_trunc(day, t.time) dy,
count(*) rid_count,
sum(case when t.rn = 1 then 1 else 0 end) new_rid_count
from (
select
t.*
row_number() over(partition by t.rid order by t.time) rn
from mytable t
) t
group by date_trunc(day, t.time)

I think of this as two levels of aggregation. The inner one to get the earliest date. The outer to aggregate:
select first_day, count(*)
from (select rid, date_trunc('day', min(time))::date as first_day
from orders o
group by rid
) r
group by 1

Related

count consecutive number of -1 in a column. count >=14

I'm trying to figure out query to count "-1" that have occurred for more than 14 times. Can anyone help me here. I tried everything from lead, row number, etc but nothing is working out.
The BP is recorded for every minute and I need to figure the id's who's bp_level was "-1" for more than 14min
You may try the following:
Select Distinct B.Person_ID, B.[Consecutive]
From
(
Select D.person_ID, COUNT(D.bp_level) Over (Partition By D.grp, D.person_ID Order By D.Time_) [Consecutive]
From
(
Select Time_, Person_ID, bp_level,
DATEADD(Minute, -ROW_NUMBER() Over (Partition By Person_ID Order By Time_), Time_) grp
From mytable Where bp_level = -1
) D
) B
Where B.[Consecutive] >= 14
See a demo from db<>fiddle. Using SQL Server.
DATEADD(Minute, -ROW_NUMBER() Over (Partition By Person_ID Order By Time_), Time_): to define a unique group for consecutive times per person, where (bp_level = -1).
COUNT(D.bp_level) Over (Partition By D.grp, D.person_ID Order By D.Time_): to find the cumulative sum of bp_level over the increasing of time for each group.
Once a none -1 value appeared the group will split into two groups and the counter will reset to 0 for the other group.
NOTE: this solution works only if there are no gaps between the consecutive times, the time is increased by one minute for each row/ person, otherwise, the query will not work but can be modified to cover the gaps.
with data as (
select *,
count(case when bp_level = 1 then 1 end) over
(partition by person_id order by time) as grp
from T
)
select distinct person_id
from data
where bp_level = -1
group by person_id, grp
having count(*) > 14; /* = or >= ? */
If you want to rely on timestamps rather than a count of rows then you could use the time difference:
...
-- where 1 = 1 /* all rows */
group by person_id, grp
having datediff(minute, min(time), max(time)) > 14;
The accepted answer would have issues with scenarios where there are multiple rows with the same timestamp if there's any potential for that to happen.
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=2ad6a1b515bb4091efba9b8831e5d579

Using Subquery in Sequence function PrestoSQL

Use case -
I am trying to find weekly frequency of a customer from a dataset. Now, not all customers have "events" happening in all of the weeks, and I would need to fill them in with zero values for the "count" column.
I was trying to do this using the sequence function of PrestoSQL. However, this would need me to get the value of max week from the customer's orders itself ( I don't want to hardcode this since the result would be going into a BI tool and I dont want to update this manually every week )
with all_orders_2020 as (select customer, cast(date_parse(orderdate, '%Y-%m-%d') as date) as order_date
from orders
where orderdate > '2020-01-01' and customer in (select customer from some_customers)),
orders_with_week_number as (select *, week(order_date) as week_number from all_orders_2020),
weekly_count as (select customer, week_number, count(*) as ride_count from orders_with_week_number
where customer = {{some_customer}} group by customer, week_number)
SELECT
week_number
FROM
(VALUES
(SEQUENCE(1,(select max(week_number) from weekly_count)))
) AS t1(week_array)
CROSS JOIN
UNNEST(week_array) AS t2(week_number)
Presto complaints about this saying -
Unexpected subquery expression in logical plan: (SELECT "max"(week_number)
FROM
weekly_count
)
Any clues how this can be done ?
Had a similar use case and followed the example from here: https://docs.aws.amazon.com/athena/latest/ug/flattening-arrays.html
Bring the SEQUENCE out and define the subquery using a WITH clause:
WITH dataset AS (
SELECT SEQUENCE(1, (SELECT MAX(week_number) FROM weekly_count)) AS week_array
)
SELECT week_number FROM dataset
CROSS JOIN UNNEST(week_array) as t(week_number)

SQL: transposing a time series table into a start-end time table if an event occur

I am trying to use a select statement to create a view, transposing a table with datetime into a table with records in each row, the start-end time when the consecutive values by time (partition by station) in 'record' field is not 0.
Here is a sample of the initial table.
And how it should look like after transposing.
Can anyone help?
You can use the conditional_change_event analytical function to create a special grouping identifier to split these out in a simple query:
select row_number() over () unique_id,
station,
min(datetime) startdate,
max(datetime) enddate
from (
select t.*, CONDITIONAL_CHANGE_EVENT(decode(recording,0,0,1))
over (partition by station order by datetime) chg
from mytable t
) x
where recording > 0
group by station, chg
order by 1, 2
The decode is just to set up your islands and gaps (where gaps are recording <= 0 and islands are recording > 0). Then the change event on that will generate a new identifier for grouping. Also note that I am grouping on the change event even though it isn't part of the output.
ROW_NUMBER() is the best for partitioning. Next, you can do a self join on the partitioned tables to see if the difference between times is greater than five minutes. I think the best solution is to partition on the rolling sum of the timestamp difference, offset by 5 minutes based on your pattern. If the five minutes is not a regular pattern then there is probably a generalized approach that can be used with the zeroes.
Solution written as a CTE below for easy view creation (it's a slow view though).
WITH partitioned as (
SELECT datetime, station, recording,
ROW_NUMBER() OVER(PARTITION BY station
ORDER BY datetime ASC) rn
FROM table --Not sure what the tablename is
WHERE recording != 0),
diffed as (
SELECT a.datetime, a.station,
DATEDIFF(mi,ISNULL(b.datetime,a.datetime),a.datetime)-5) Difference
--The ISNULL logic is for when a.datetime is the beginning of the block,
--we want a 0
FROM partitioned a
LEFT JOIN partitioned b on a.rn = b.rn + 1 and a.station=b.station
GROUP BY a.datetime,a.station),
cumulative as (
SELECT a.datetime, a.station, SUM(b.difference) offset_grouping
FROM diff a
LEFT JOIN diff b on a.datetime >= b.datetime and a.station = b.station ),
ordered as (SELECT datetime,station,
ROW_NUMBER() OVER(PARTITION BY station,offset_grouping ORDER BY datetime asc) starter,
ROW_NUMBER() OVER(PARTITION BY station,offset_grouping ORDER BY datetime desc) ender
FROM cumulative)
SELECT ROW_NUMBER() OVER(ORDER BY a.datetime) unique_id,a.station,a.datetime startdate, b.datetime enddate
FROM ordered a
JOIN ordered b on a.starter = b.ender and a.station=b.station and a.starter=1
This is the only solution I can think of but again, it's slow depending on the amount of data you have.

Aggregates for today and the previous day depending on data

Having trouble putting together a query to pull the aggregate values of a give timestamp and the timestamp before it. Given the following schema:
name TEXT,
ts TIMESTAMP,
X NUMERIC,
Y NUMERIC
where there are gaps in the ts column due to gaps in data, I'm trying to construct a query to produce
name,
date_trunc('day' q1.ts),
avg(q1.X),
sum(q2.Y),
date_trunc('day', q2.ts),
avg(q2.X),
sum(q2.Y)
The first half is straightforward:
SELECT q1.name, date_trunc('day', q1.ts), avg(q1.X), sum(q1.Y)
FROM data as q1
GROUP BY 1, 2
ORDER BY 1, 2;
But not sure how to generate the relation to find the "day" before for each row. I'm trying to work an inner join like this:
SELECT q1.name, q1.day, q1.avg, q1.sum, q2.day, q2.avg, q2.sum
FROM (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q1 INNER JOIN (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q2 ON (
q1.name = q2.name
AND q2.day = q1.day - interval '1 day'
);
The problem with this is, it doesn't cover the cases when the next "day" is more than 1 day before the current day.
The special difficulty here is that you need to number days after aggregating rows. You can do this in a single query level with the window function row_number(), since window functions are applied after aggregation by GROUP BY.
Also, use a CTE to avoid executing the same subquery multiple times:
WITH q AS (
SELECT name, ts::date AS day
,avg(x) AS avg_x, sum(y) AS sum_y
,row_number() OVER (PARTITION BY name ORDER BY ts::date) AS rn
FROM data
GROUP BY 1,2
)
SELECT q1.name, q1.day, q1.avg_x, q1.sum_y
,q2.day AS day2, q2.avg_x AS avg_x2, q2.sum_y AS sum_y2
FROM q q1
LEFT JOIN q q2 ON q1.name = q2.name
AND q1.rn = q2.rn + 1
ORDER BY 1,2;
Using the simpler cast to date (ts::date) instead of date_trunc('day', ts) to get "days".
LEFT [OUTER] JOIN (as opposed to [INNER] JOIN) is instrumental to preserve the corner case of the first row, where there is no previous day.
And ORDER BY should be applied to the outer query.
The question isn't crystal clear, but it sounds like you're actually trying to fill gaps while keeping track of leading/lagging rows.
To fill the gaps, look into generate_series() and left join it with your table:
select d
from generate_series(timestamp '2013-12-01', timestamp '2013-12-31', interval '1 day') d;
http://www.postgresql.org/docs/current/static/functions-srf.html
For previous and next row values, look into lead() and lag() window functions:
select date_trunc('day', ts) as curr_row_day,
lag(date_trunc('day', ts)) over w as prev_row_day
from data
window w as (order by ts)
http://www.postgresql.org/docs/current/static/tutorial-window.html

Last day of the month with a twist in SQLPLUS

I would appreciate a little expert help please.
in an SQL SELECT statement I am trying to get the last day with data per month for the last year.
Example, I am easily able to get the last day of each month and join that to my data table, but the problem is, if the last day of the month does not have data, then there is no returned data. What I need is for the SELECT to return the last day with data for the month.
This is probably easy to do, but to be honest, my brain fart is starting to hurt.
I've attached the select below that works for returning the data for only the last day of the month for the last 12 months.
Thanks in advance for your help!
SELECT fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,fd.column_name
FROM super_table fd,
(SELECT TRUNC(daterange,'MM')-1 first_of_month
FROM (
select TRUNC(sysdate-365,'MM') + level as DateRange
from dual
connect by level<=365)
GROUP BY TRUNC(daterange,'MM')) fom
WHERE fd.cust_id = :CUST_ID
AND fd.coll_date > SYSDATE-400
AND TRUNC(fd.coll_date) = fom.first_of_month
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)
You probably need to group your data so that each month's data is in the group, and then within the group select the maximum date present. The sub-query might be:
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY YEAR(coll_date) * 100 + MONTH(coll_date);
This presumes that the functions YEAR() and MONTH() exist to extract the year and month from a date as an integer value. Clearly, this doesn't constrain the range of dates - you can do that, too. If you don't have the functions in Oracle, then you do some sort of manipulation to get the equivalent result.
Using information from Rhose (thanks):
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY TO_CHAR(coll_date, 'YYYYMM');
This achieves the same net result, putting all dates from the same calendar month into a group and then determining the maximum value present within that group.
Here's another approach, if ANSI row_number() is supported:
with RevDayRanked(itemDate,rn) as (
select
cast(coll_date as date),
row_number() over (
partition by datediff(month,coll_date,'2000-01-01') -- rewrite datediff as needed for your platform
order by coll_date desc
)
from super_table
)
select itemDate
from RevDayRanked
where rn = 1;
Rows numbered 1 will be nondeterministically chosen among rows on the last active date of the month, so you don't need distinct. If you want information out of the table for all rows on these dates, use rank() over days instead of row_number() over coll_date values, so a value of 1 appears for any row on the last active date of the month, and select the additional columns you need:
with RevDayRanked(cust_id, server_name, coll_date, rk) as (
select
cust_id, server_name, coll_date,
rank() over (
partition by datediff(month,coll_date,'2000-01-01')
order by cast(coll_date as date) desc
)
from super_table
)
select cust_id, server_name, coll_date
from RevDayRanked
where rk = 1;
If row_number() and rank() aren't supported, another approach is this (for the second query above). Select all rows from your table for which there's no row in the table from a later day in the same month.
select
cust_id, server_name, coll_date
from super_table as ST1
where not exists (
select *
from super_table as ST2
where datediff(month,ST1.coll_date,ST2.coll_date) = 0
and cast(ST2.coll_date as date) > cast(ST1.coll_date as date)
)
If you have to do this kind of thing a lot, see if you can create an index over computed columns that hold cast(coll_date as date) and a month indicator like datediff(month,'2001-01-01',coll_date). That'll make more of the predicates SARGs.
Putting the above pieces together, would something like this work for you?
SELECT fd.cust_id,
fd.server_name,
fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,
fd.column_name
FROM super_table fd,
WHERE fd.cust_id = :CUST_ID
AND TRUNC(fd.coll_date) IN (
SELECT MAX(TRUNC(coll_date))
FROM super_table
WHERE coll_date > SYSDATE - 400
AND cust_id = :CUST_ID
GROUP BY TO_CHAR(coll_date,'YYYYMM')
)
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)