SQL Count Consecutive Rows - sql

I have a SQL table that I need to count the rows with 0 turnover, but the challenge is they resets. I only need the number of consecutive rows since it last generated any turnover.
Source data (it has a lot of different ID, just using 442 and 4500 in this case):
ID |Date | T/O |
442 |2019-12-31 | 0 |
442 |2020-01-01 |200.00|
442 |2020-01-02 | 0 |
442 |2020-02-06 | 0 |
442 |2020-02-07 | 0 |
442 |2020-02-08 | 0 |
442 |2020-02-09 |150.00|
442 |2020-02-10 | 0 |
442 |2020-02-11 | 0 |
442 |2020-02-15 | 0 |
4500 |2020-01-01 | 0 |
Intended results:
442 | 3 |
4500 | 1 |
I thought of using LAG(), but the number of rows between turnover generated can vary significantly. Sometimes it can be even 30+ rows.

SELECT id, COUNT(*) as [result]
FROM SourceData sd1
WHERE t_o=0
AND NOT EXISTS (SELECT 1
FROM SourceData sd2
WHERE sd1.id=sd2.id AND t_o != 0 AND sd2.[Date] > sd1.[Date])
GROUP BY id
DEMO

First we can get the last non-zero date for each id.
select id, max(date) as date
from example
where t_o > 0
group by id
This will not show a value for 4500 because it lacks a non-zero value.
Then we can use this to select and group only the values after those dates, or all rows if there was no non-zero date for an id.
with last_to as(
select id, max(date) as date
from example
where t_o > 0
group by id
)
select example.id, count(example.t_o)
from example
-- Use a left join to get all ids in example,
-- even those missing from last_to.
left join last_to on last_to.id = example.id
-- Account for the lack of a last_to row
-- if the ID has no non-zero values.
where last_to.date is null
or example.date > last_to.date
group by example.id
Demonstration.

Related

SQL Day-over-Day count miscalculation

I'm encountering a bug in my SQL code that calculates the day-over-day (DoD) count difference. This table (curr_day) summarizes the count of trades on any business day (i.e. excluding weekends and government-mandated holidays) and is joined by a similar table (prev_day) that is day-lagged (previous day). The joining is based on the day's rank; for example the first day on the curr_day table is Jan-01 and it's rank is 1, the first day (rank 1) for the prev_day table is Dec-31.
My issue is that the trade count does not seem to calculate positive changes (see table below), only negative or no changes. This problem does not affect other fields that calculate the value of a trade, simply the amount of trades on a given day.
Sample of query
with curr_day as (select GROUP, COUNT from DB where DATE is not HOLIDAY),
prev_day as (select rank()over(partition by GROUP order by DATE) as RANK, GROUP, DATE, COUNT
from curr_day where DATE is not HOLIDAY)
select ID, DATE, curr_day.COUNT-prev_day.COUNT
from (select rank()over(partition by curr_day.GROUP order by curr_day.DATE) as RANK
from curr_day
where curr_day.DATE >= (select min(curr_day.DATE)+1) from curr_day)
left join prev_day on curr_day.RANK = prev_day.RANK and curr_day.GROUP = prev_day.GROUP)
;
Output table
Date | Group | Count | DoD_Cnt_Diff
2020-12-31 | A | 1 | 0
2021-01-01 | A | 1 | 0
2021-01-02 | A | 0 | -1
2021-01-03 | A | 1 | (null)
2021-01-04 | A | 0 | -1
2021-01-05 | A | 0 | 0
2021-12-31 | B | 0 | 0

Merge rows based on a condition

Is it possible to merge a collection of rows based on a condition in Spark SQL using a sql query ?
If the difference between purch_dt of two consecutive rows placed in order (line_num) is less than 5 days, then combine them into 1 row and output that merged row and the merged row should have the max value of purch_dt for that group. I tried using the LEAD function but I can't get it to reset after each false condition is encountered and consider the following rows as a new group. I am not being able to get the max of purch_dt for each such group.
Input:
orderid | line_num | purch_dt
1 | 1 | 10-02-2020
1 | 2 | 12-02-2020
1 | 3 | 14-02-2020
1 | 4 | 21-03-2020
1 | 5 | 23-03-2020
Output:
orderid | purch_dt
1 | 14-02-2020 -- 1 - 3 combined into 1 row because difference is <5 between each
1 | 23-02-2020 -- 4 - 5 combined into 1 row because difference is <5 between each
Total Output rows = 2 because we have 2 groups.
Please note that line_num 4 is used as a set break since its difference between line_num = 3 is greater than 5. Hence it should have its own merged record set.
I have the sql below so far, but I can't get to break out and create the groups.
create temporary view next_dt as
select
order,
LEAD(purch_dt) over (partition by orderid order by line_num asc) AS next_purch_dt,
purch_dt
from orders;
select *
from (
select
order,
CASE WHEN datediff(next_purch_dt, purch_dt) < 5 OR next_purch IS NULL THEN 'Y'
ELSE 'N'
END AS flg
from
next_dt)
WHERE flg = 'Y';
Any help is appreciated.
UPDATE:
Slight change in the requirements:-
The comparison has now to be made between two different fields in consecutive records - purch_dt of the current record and the return_dt of the next record.
Also, when a merged record group is being output, it should have the purch_dt populated with the value of the record with the least line_num in that group. And the return_dt column populated with the value of the max line_num record of that same group.
Input:
orderid | line_num | purch_dt | return_dt
1 | 1 | 10-02-2020 | 10-02-2020
1 | 2 | 12-02-2020 | 13-02-2020
1 | 3 | 14-02-2020 | 14-02-2020
1 | 4 | 21-03-2020 | 23-02-2020
1 | 5 | 23-03-2020 | 24-02-2020
Output:
orderid | purch_dt | return_dt
1 | 10-02-2020 | 14-02-2020
1 | 21-03-2020 | 24-02-2020
Total Output rows = 2 because we have 2 groups.
Note that each output record contains the purch_dt of the record with min line_num in that group. And contains return_dt populated as per the record with max line_num in that group.
You almost got this, below query has worked for me,
sql("""create temporary view next_dt_orders as
select *
from (
select
orderid,line_num,purch_dt,
case when datediff(
(lead(purch_dt) over (partition by orderid order by line_num asc)),
purch_dt) < 5
then "N"
else "Y"
end as flag
from
orders) tab
where
flag='Y'""")
sql("select * from next_dt_orders").show()
+-------+--------+----------+----+
|orderid|line_num| purch_dt|flag|
+-------+--------+----------+----+
| 1| 3|2020-02-14| Y|
| 1| 5|2020-03-23| Y|
+-------+--------+----------+----+

Oracle SQL - Selecting records into groups and filtering based on a comparison of row 1 + row 2

I've got a database that contains data on monitored manufacturing machines that has these fields within (and more) :
ID | WORK_ORDER_ID | WORK_CENTER_ID | MFGNO | ...
The records are realtime data and are entered in sequentially based on when work_order_id changes. I want to check between work orders if the MFGNO is the same but grouped based on the work_center_id.
For example:
1. 998 | 100 | 205 | TEST_MFG
2. 997 | 100 | 205 | TEST_ MFG
This would return true (or 1 row), as the mfgno's are the same.
Currently I'm able to do this for each work_center_id individually like this:
SELECT * FROM
(
select * FROM (select ID, WORKORDER_ID, TIMESTAMP, MFGNO from
HIST_ILLUM_RT where WORK_CENTER_ID = 5237
ORDER BY ID desc) where rownum = 1
)
where MFGNO = (
SELECT mfgno FROM
(
select * FROM (select ID, WORKORDER_ID, TIMESTAMP, MFGNO from
HIST_ILLUM_RT where WORK_CENTER_ID = 5237
ORDER BY ID desc
) where rownum < 3 order by id asc
) where rownum = 1
)
This produces either 0 rows if there are no current back to back MFGNO's, then 1> if there is.
This way I have to write this expression for each individual work_center_id (there's about 40). I want to have an expression that checks the top two rows of each grouped work_center_id and only returns a row if the MFGNO's match.
For example:
1. 998 | 101 | 205 | TEST_MFG
2. 997 | 098 | 206 | SomethingElse
3. 996 | 424 | 205 | TEST_MFG
4. 995 | 521 | 206 | NotAMatch
5. 994 | 123 | 205 | Doesn'tCompareThis
6. 993 | 664 | 195 | Irrelevant
For this it would only return 1, as only the work_center_id = 205 has a back to back (row 1&2) MFGNO, compared to 206 which doesn't for example.
I'm running Oracle 11g which seems to be limiting me, but I am unable to upgrade or find a work around to create this expression in this current version.
I think you want lag() and some logic:
select count(*)
from (select t.*,
lag(MFGNO) over (partition by WORK_CENTER_ID order by id) as prev_mfgno
from t
) t
where prev_mfgno = mfgno

Calculate time span over a number of records

I have a table that has the following schema:
ID | FirstName | Surname | TransmissionID | CaptureDateTime
1 | Billy | Goat | ABCDEF | 2018-09-20 13:45:01.098
2 | Jonny | Cash | ABCDEF | 2018-09-20 13:45.01.108
3 | Sally | Sue | ABCDEF | 2018-09-20 13:45:01.298
4 | Jermaine | Cole | PQRSTU | 2018-09-20 13:45:01.398
5 | Mike | Smith | PQRSTU | 2018-09-20 13:45:01.498
There are well over 70,000 records and they store logs of transmissions to a web-service. What I'd like to know is how would I go about writing a script that would select the distinct TransmissionID values and also show the timespan between the earliest CaptureDateTime record and the latest record? Essentially I'd like to see what the rate of records the web-service is reading & writing.
Is it even possible to do so in a single SELECT statement or should I just create a stored procedure or report in code? I don't know where to start aside from SELECT DISTINCT TransmissionID for this sort of query.
Here's what I have so far (I'm stuck on the time calculation)
SELECT DISTINCT [TransmissionID],
COUNT(*) as 'Number of records'
FROM [log_table]
GROUP BY [TransmissionID]
HAVING COUNT(*) > 1
Not sure how to get the difference between the first and last record with the same TransmissionID I would like to get a result set like:
TransmissionID | TimeToCompletion | Number of records |
ABCDEF | 2.001 | 5000 |
Simply GROUP BY and use MIN / MAX function to find min/max date in each group and subtract them:
SELECT
TransmissionID,
COUNT(*),
DATEDIFF(second, MIN(CaptureDateTime), MAX(CaptureDateTime))
FROM yourdata
GROUP BY TransmissionID
HAVING COUNT(*) > 1
Use min and max to calculate timespan
SELECT [TransmissionID],
COUNT(*) as 'Number of records',datediff(s,min(CaptureDateTime),max(CaptureDateTime)) as timespan
FROM [log_table]
GROUP BY [TransmissionID]
HAVING COUNT(*) > 1
A method that returns the average time for all transmissionids, even those with only 1 record:
SELECT TransmissionID,
COUNT(*),
DATEDIFF(second, MIN(CaptureDateTime), MAX(CaptureDateTime)) * 1.0 / NULLIF(COUNT(*) - 1, 0)
FROM yourdata
GROUP BY TransmissionID;
Note that you may not actually want the maximum of the capture date for a given transmissionId. You might want the overall maximum in the table -- so you can consider the final period after the most recent record.
If so, this looks like:
SELECT TransmissionID,
COUNT(*),
DATEDIFF(second,
MIN(CaptureDateTime),
MAX(MAX(CaptureDateTime)) OVER ()
) * 1.0 / COUNT(*)
FROM yourdata
GROUP BY TransmissionID;

POSTGRESQL : How to select the first row of each group?

With this query :
WITH responsesNew AS
(
SELECT DISTINCT responses."studentId", notation, responses."givenHeart",
SUM(notation + responses."givenHeart") OVER (partition BY responses."studentId"
ORDER BY responses."createdAt") AS total, responses."createdAt",
FROM responses
)
SELECT responsesNew."studentId", notation, responsesNew."givenHeart", total,
responsesNew."createdAt"
FROM responsesNew
WHERE total = 3
GROUP BY responsesNew."studentId", notation, responsesNew."givenHeart", total,
responsesNew."createdAt"
ORDER BY responsesNew."studentId" ASC
I get this data table :
studentId | notation | givenHeart | total | createdAt |
----------+----------+------------+-------+--------------------+
374 | 1 | 0 | 3 | 2017-02-13 12:43:03
374 | null | 0 | 3 | 2017-02-15 22:22:17
639 | 1 | 2 | 3 | 2017-04-03 17:21:30
790 | 1 | 0 | 3 | 2017-02-12 21:12:23
...
My goal is to keep only in my data table the early row of each group like shown below :
studentId | notation | givenHeart | total | createdAt |
----------+----------+------------+-------+--------------------+
374 | 1 | 0 | 3 | 2017-02-13 12:43:03
639 | 1 | 2 | 3 | 2017-04-03 17:21:30
790 | 1 | 0 | 3 | 2017-02-12 21:12:23
...
How can I get there?
I've read many topics over here but nothing I've tried with DISTINCT, DISTINCT ON, subqueries in WHERE, LIMIT, etc have worked for me (surely due to my poor understanding). I've met errors related to window function, missing column in ORDER BY and a few others I can't remember.
You can do this with distinct on. The query would look like this:
WITH responsesNew AS (
SELECT DISTINCT r."studentId", notation, r."givenHeart",
SUM(notation + r."givenHeart") OVER (partition BY r."studentId"
ORDER BY r."createdAt") AS total,
r."createdAt"
FROM responses r
)
SELECT DISTINCT ON (r."studentId") r."studentId", notation, r."givenHeart", total,
r."createdAt"
FROM responsesNew r
WHERE total = 3
ORDER BY r."studentId" ASC, r."createdAt";
I'm pretty sure this can be simplified. I just don't understand the purpose of the CTE. Using SELECT DISTINCT in this way is very curious.
If you want a simplified query, ask another question with sample data, desired results, and explanation of what you are doing and include the query or a link to this question.
use Row_number() window function to add a row number to each partition and then only show row 1.
no need to fully qualify names if only one table is involved. and use aliases when qualifying to simplify readability.
WITH responsesNew AS
(
SELECT "studentId"
, notation
, "givenHeart"
, SUM(notation + "givenHeart") OVER (partition BY "studentId" ORDER BY "createdAt") AS total
, "createdAt"
, Row_number() OVER ("studentId" ORDER BY "createdAt") As RNum
FROM responses r
)
SELECT RN."studentId"
, notation, RN."givenHeart"
, total
, RN."createdAt"
FROM responsesNew RN
WHERE total = 3
AND RNum = 1
GROUP BY RN."studentId"
, notation
, RN."givenHeart", total
, RN."createdAt"
ORDER BY RN."studentId" ASC