Given a TripID I need to grab the next result that satistfies certain criteria (TripSource <> 1 AND HasLot = 1) but I've found the problem that the order to consider "the next Trip" has to be "ORDER BY TripDate, TripOrder". So I mean that TripID has nothing to do with the order.
(I'm using SQL Server 2008, so I can't use LEAD or LAG but I'm also interested in answers using them.)
Example datasource:
+--------+-------------------------+-----------+------------+--------+
| TripID | TripDate | TripOrder | TripSource | HasLot |
+--------+-------------------------+-----------+------------+--------+
1. | 37172 | 2019-08-01 00:00:00.000 | 0 | 1 | 0 |
2. | 37211 | 2019-08-01 00:00:00.000 | 1 | 1 | 0 |
3. | 37198 | 2019-08-01 00:00:00.000 | 2 | 2 | 1 |
4. | 37213 | 2019-08-01 00:00:00.000 | 3 | 1 | 0 |
5. | 37245 | 2019-08-02 00:00:00.000 | 0 | 1 | 0 |
6. | 37279 | 2019-08-02 00:00:00.000 | 1 | 1 | 0 |
7. | 37275 | 2019-08-02 00:00:00.000 | 2 | 1 | 0 |
8. | 37264 | 2019-08-02 00:00:00.000 | 3 | 2 | 0 |
9. | 37336 | 2019-08-03 00:00:00.000 | 0 | 1 | 1 |
10. | 37320 | 2019-08-05 00:00:00.000 | 0 | 1 | 0 |
11. | 37354 | 2019-08-05 00:00:00.000 | 1 | 1 | 0 |
12. | 37329 | 2019-08-05 00:00:00.000 | 2 | 1 | 0 |
13. | 37373 | 2019-08-06 00:00:00.000 | 0 | 1 | 0 |
14. | 37419 | 2019-08-06 00:00:00.000 | 1 | 1 | 0 |
15. | 37421 | 2019-08-06 00:00:00.000 | 2 | 1 | 0 |
16. | 37414 | 2019-08-06 00:00:00.000 | 3 | 1 | 1 |
17. | 37459 | 2019-08-07 00:00:00.000 | 0 | 2 | 1 |
18. | 37467 | 2019-08-07 00:00:00.000 | 1 | 1 | 0 |
19. | 37463 | 2019-08-07 00:00:00.000 | 2 | 1 | 0 |
20. | 37461 | 2019-08-07 00:00:00.000 | 3 | 0 | 0 |
+--------+-------------------------+-----------+------------+--------+
Results I need:
Given TripID 37211 (Row 2.) I need to get 37198 (Row 3.)
Given TripID 37198 (Row 3.) I need to get 37459 (Row 17.)
Given TripID 37459 (Row 17.) I need to get null
Given TripID 37463 (Row 19.) I need to get null
You can use a correlated subquery or outer apply:
select t.*, t2.tripid
from trips t outer apply
(select top (1) t2.*
from trips t2
where t2.tripsource <> 1 and t2.haslot = 1 and
(t2.tripdate > t.tripdate or
t2.tripdate = t.tripdate and t2.triporder > t.triporder
)
order by t2.tripdate desc, t2.triporder desc
) t2;
Related
I have a table that looks like this:
| id | date_start | gap_7_days |
| -- | ------------------- | --------------- |
| 1 | 2021-06-10 00:00:00 | 0 |
| 1 | 2021-06-13 00:00:00 | 0 |
| 1 | 2021-06-19 00:00:00 | 0 |
| 1 | 2021-06-27 00:00:00 | 0 |
| 2 | 2021-07-04 00:00:00 | 1 |
| 2 | 2021-07-11 00:00:00 | 1 |
| 2 | 2021-07-18 00:00:00 | 1 |
| 2 | 2021-07-25 00:00:00 | 1 |
| 2 | 2021-08-01 00:00:00 | 1 |
| 2 | 2021-08-08 00:00:00 | 1 |
| 2 | 2021-08-09 00:00:00 | 0 |
| 2 | 2021-08-16 00:00:00 | 1 |
| 2 | 2021-08-23 00:00:00 | 1 |
| 2 | 2021-08-30 00:00:00 | 1 |
| 2 | 2021-08-31 00:00:00 | 0 |
| 2 | 2021-09-01 00:00:00 | 0 |
| 2 | 2021-08-08 00:00:00 | 1 |
| 2 | 2021-08-15 00:00:00 | 1 |
| 2 | 2021-08-22 00:00:00 | 1 |
| 2 | 2021-08-23 00:00:00 | 1 |
For each ID, I check whether consecutive date_start values are 7 days apart, and put a 1 or 0 in gap_7_days accordingly.
I want to do the following (using Redshift SQL only):
Get the length of each sequence of consecutive 1s in gap_7_days for each ID
Expected output:
| id | date_start | gap_7_days | sequence_length |
| -- | ------------------- | --------------- | --------------- |
| 1 | 2021-06-10 00:00:00 | 0 | |
| 1 | 2021-06-13 00:00:00 | 0 | |
| 1 | 2021-06-19 00:00:00 | 0 | |
| 1 | 2021-06-27 00:00:00 | 0 | |
| 2 | 2021-07-04 00:00:00 | 1 | 6 |
| 2 | 2021-07-11 00:00:00 | 1 | 6 |
| 2 | 2021-07-18 00:00:00 | 1 | 6 |
| 2 | 2021-07-25 00:00:00 | 1 | 6 |
| 2 | 2021-08-01 00:00:00 | 1 | 6 |
| 2 | 2021-08-08 00:00:00 | 1 | 6 |
| 2 | 2021-08-09 00:00:00 | 0 | |
| 2 | 2021-08-16 00:00:00 | 1 | 3 |
| 2 | 2021-08-23 00:00:00 | 1 | 3 |
| 2 | 2021-08-30 00:00:00 | 1 | 3 |
| 2 | 2021-08-31 00:00:00 | 0 | |
| 2 | 2021-09-01 00:00:00 | 0 | |
| 2 | 2021-08-08 00:00:00 | 1 | 4 |
| 2 | 2021-08-15 00:00:00 | 1 | 4 |
| 2 | 2021-08-22 00:00:00 | 1 | 4 |
| 2 | 2021-08-23 00:00:00 | 1 | 4 |
Get the number of sequences for each ID
Expected output:
| id | num_sequences |
| -- | ------------------- |
| 1 | 0 |
| 2 | 3 |
How can I achieve this?
If you want the number of sequences, just look at the previous value. When the current value is "1" and the previous is NULL or 0, then you have a new sequence.
So:
select id,
sum( (gap_7_days = 1 and coalesce(prev_gap_7_days, 0) = 0)::int ) as num_sequences
from (select t.*,
lag(gap_7_days) over (partition by id order by date_start) as prev_gap_7_days
from t
) t
group by id;
If you actually want the lengths of the sequences, as in the intermediate results, then ask a new question. That information is not needed for this question.
I have a table of daily data and a table of monthly data. I'm trying to retrieve one daily record corresponding to each monthly record. The wrinkles are that some days are missing from the daily data and the field I care about, new_status, is sometimes null on the month_end_date.
month_df
| ID | month_end_date |
| -- | -------------- |
| 1 | 2019-07-31 |
| 1 | 2019-06-30 |
| 2 | 2019-10-31 |
daily_df
| ID | daily_date | new_status |
| -- | ---------- | ---------- |
| 1 | 2019-07-29 | 1 |
| 1 | 2019-07-30 | 1 |
| 1 | 2019-08-01 | 2 |
| 1 | 2019-08-02 | 2 |
| 1 | 2019-08-03 | 2 |
| 1 | 2019-06-29 | 0 |
| 1 | 2019-06-30 | 0 |
| 2 | 2019-10-30 | 5 |
| 2 | 2019-10-31 | NULL |
| 2 | 2019-11-01 | 6 |
| 2 | 2019-11-02 | 6 |
I want to fuzzy join daily_df to monthly_df where daily_date is >= month_end_dt and less than some buffer afterwards (say, 5 days). I want to keep only the record with the minimum daily date and a non-null new_status.
This post solves the issue using an OUTER APPLY in SQL Server, but that seems not to be an option in Spark SQL. I'm wondering if there's a method that is similarly computationally efficient that works in Spark.
I need to increase the value of the proceeding row number by 1. When the row encounters another condition I then need to reset the counter. This is probably easiest explained with an example:
+---------+------------+------------+-----------+----------------+
| Acct_ID | Ins_Date | Acct_RowID | indicator | Desired_Output |
+---------+------------+------------+-----------+----------------+
| 5841 | 07/11/2019 | 1 | 1 | 1 |
| 5841 | 08/11/2019 | 2 | 0 | 2 |
| 5841 | 09/11/2019 | 3 | 0 | 3 |
| 5841 | 10/11/2019 | 4 | 0 | 4 |
| 5841 | 11/11/2019 | 5 | 1 | 1 |
| 5841 | 12/11/2019 | 6 | 0 | 2 |
| 5841 | 13/11/2019 | 7 | 1 | 1 |
| 5841 | 14/11/2019 | 8 | 0 | 2 |
| 5841 | 15/11/2019 | 9 | 0 | 3 |
| 5841 | 16/11/2019 | 10 | 0 | 4 |
| 5841 | 17/11/2019 | 11 | 0 | 5 |
| 5841 | 18/11/2019 | 12 | 0 | 6 |
| 5132 | 11/03/2019 | 1 | 1 | 1 |
| 5132 | 12/03/2019 | 2 | 0 | 2 |
| 5132 | 13/03/2019 | 3 | 0 | 3 |
| 5132 | 14/03/2019 | 4 | 1 | 1 |
| 5132 | 15/03/2019 | 5 | 0 | 2 |
| 5132 | 16/03/2019 | 6 | 0 | 3 |
| 5132 | 17/03/2019 | 7 | 0 | 4 |
| 5132 | 18/03/2019 | 8 | 0 | 5 |
| 5132 | 19/03/2019 | 9 | 1 | 1 |
| 5132 | 20/03/2019 | 10 | 0 | 2 |
+---------+------------+------------+-----------+----------------+
The column I want to create is 'Desired_Output'. It can be seen from this table that I need to use the column 'indicator'. I want the following row to be n+1; unless the next row is 1. The counter needs to reset when the value 1 is encountered again.
I have tried to use a loop method of some sort but this did not produce the desired results.
Is this possible in some way?
The trick is to identify the group of consecutive rows starts from indicator 1 to the next 1. This is achieve by using the cross apply finding the Acct_RowID with indicator = 1 and use that as a Grp_RowID to use as partition by in the row_number() window function
select *,
Desired_Output = row_number() over (partition by t.Acct_ID, Grp_RowID
order by Acct_RowID)
from your_table t
cross apply
(
select Grp_RowID = max(Acct_RowID)
from your_table x
where x.Acct_ID = t.Acct_ID
and x.Acct_RowID <= t.Acct_RowID
and x.indicator = 1
) g
I have a table with the following design:
+------+-------------------------+-------------+
| Shop | Date | SafetyEvent |
+------+-------------------------+-------------+
| 1 | 2018-06-25 10:00:00.000 | 0 |
| 1 | 2018-06-25 10:30:00.000 | 1 |
| 1 | 2018-06-25 10:45:00.000 | 0 |
| 2 | 2018-06-25 11:00:00.000 | 0 |
| 2 | 2018-06-25 11:30:00.000 | 0 |
| 2 | 2018-06-25 11:45:00.000 | 0 |
| 3 | 2018-06-25 12:00:00.000 | 1 |
| 3 | 2018-06-25 12:30:00.000 | 0 |
| 3 | 2018-06-25 12:45:00.000 | 0 |
+------+-------------------------+-------------+
Basically at each shop, we track the date/time of a repair and flag if a safety event occurred. I want to add an additional column that tracks if a safety event has occurred in the last 8 hours at each shop. The end result will be like this:
+------+-------------------------+-------------+-------------------+
| Shop | Date | SafetyEvent | SafetyEvent8Hours |
+------+-------------------------+-------------+-------------------+
| 1 | 2018-06-25 10:00:00.000 | 0 | 0 |
| 1 | 2018-06-25 10:30:00.000 | 1 | 1 |
| 1 | 2018-06-25 10:45:00.000 | 0 | 1 |
| 2 | 2018-06-25 11:00:00.000 | 0 | 0 |
| 2 | 2018-06-25 11:30:00.000 | 0 | 0 |
| 2 | 2018-06-25 11:45:00.000 | 0 | 0 |
| 3 | 2018-06-25 12:00:00.000 | 1 | 1 |
| 3 | 2018-06-25 12:30:00.000 | 0 | 1 |
| 3 | 2018-06-25 12:45:00.000 | 0 | 1 |
+------+-------------------------+-------------+-------------------+
I was trying to use DATEDIFF but couldn't figure out how to have it occur for each row.
This isn't particularly efficient, but you can use apply or a correlated subquery:
select t.*, t8.SafetyEvent8Hours
from t apply
(select max(SafetyEvent) as SafetyEvent8Hours
from t t2
where t2.shop = t.shop and
t2.date <= t.date and
t2.date > dateadd(hour, -8, t.date)
) t8;
If you can rely on events being logged every 15 minutes, then a more efficient method is to use window functions:
select t.*,
max(SafetyEvent) over (partition by shop order by date rows between 31 preceding and current row) as SafetyEvent8Hours
from t
I have data
| account | type | position | created_date |
|---------|------|----------|------|
| 1 | 1 | 1 | 2016-08-01 00:00:00 |
| 2 | 1 | 2 | 2016-08-01 00:00:00 |
| 1 | 2 | 2 | 2016-08-01 00:00:00 |
| 2 | 2 | 1 | 2016-08-01 00:00:00 |
| 1 | 1 | 2 | 2016-08-02 00:00:00 |
| 2 | 1 | 1 | 2016-08-02 00:00:00 |
| 1 | 2 | 1 | 2016-08-03 00:00:00 |
| 2 | 2 | 2 | 2016-08-03 00:00:00 |
| 1 | 1 | 2 | 2016-08-04 00:00:00 |
| 2 | 1 | 1 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 2016-08-07 00:00:00 |
| 2 | 2 | 1 | 2016-08-07 00:00:00 |
I need to get last positions (account, type, position) and delta from previous position. I'm trying to use Window functions but only get all rows and can't grouping them/get last.
SELECT
account,
type,
FIRST_VALUE(position) OVER w AS position,
FIRST_VALUE(position) OVER w - LEAD(position, 1, 0) OVER w AS delta,
created_date
FROM table
WINDOW w AS (PARTITION BY account ORDER BY created_date DESC)
I have result
| account | type | position | delta | created_date |
|---------|------|----------|-------|--------------|
| 1 | 1 | 1 | 1 | 2016-08-01 00:00:00 |
| 1 | 1 | 2 | 1 | 2016-08-02 00:00:00 |
| 1 | 1 | 2 | 0 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 2 | 2016-08-01 00:00:00 |
| 1 | 2 | 1 | -1 | 2016-08-03 00:00:00 |
| 1 | 2 | 2 | 1 | 2016-08-07 00:00:00 |
| 2 | 1 | 2 | 2 | 2016-08-01 00:00:00 |
| 2 | 2 | 1 | 1 | 2016-08-01 00:00:00 |
| and so on |
but i need only last record for each account/type pair
| account | type | position | delta | created_date |
|---------|------|----------|-------|--------------|
| 1 | 1 | 2 | 0 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 1 | 2016-08-07 00:00:00 |
| 2 | 1 | 1 | 0 | 2016-08-04 00:00:00 |
| and so on |
Sorry for my bad language and Thanks for any help.
My "best" try..
WITH cte_delta AS (
SELECT
account,
type,
FIRST_VALUE(position) OVER w AS position,
FIRST_VALUE(position) OVER w - LEAD(position, 1, 0) OVER w AS delta,
created_date
FROM table
WINDOW w AS (PARTITION BY account ORDER BY created_date DESC)
),
cte_date AS (
SELECT
account,
type,
MAX(created_date) AS created_date
FROM cte_delta
GROUP BY account, type
)
SELECT cd.*
FROM
cte_delta cd,
cte_date ct
WHERE
cd.account = ct.account
AND cd.type = ct.type
AND cd.created_date = ct.created_date