Teradata conditional expand - sql

I have a table with dates and val that I trying to expand and fill in missing dates in order. Not shown is that I am doing this by group and location, but the crux of what I need to do is below. Say I have the following table
dt | val
2014-01-01 | 10
2014-02-17 | 9
2014-04-21 | 5
I have expanded to this is a week table filling in missing with zeros
week_bgn_dt| week_end_dt| val
2014-01-01 | 2014-01-08 | 10
2014-01-09 | 2014-01-16 | 0
2014-01-17 | 2014-01-24 | 0
...
2014-02-10 | 2014-02-17 | 0
2014-02-18 | 2014-02-25 | 9
2014-02-26 | 2014-03-05 | 0
2014-03-06 | 2014-03-13 | 0
...
2014-03-30 | 2014-04-06 | 0
2014-04-07 | 2014-04-14 | 0
2014-04-15 | 2014-04-22 | 5
what I want is fill in with the last value until a change, so the output would looks like
week_bgn_dt| week_end_dt| val
2014-01-01 | 2014-01-08 | 10
2014-01-09 | 2014-01-16 | 10
2014-01-17 | 2014-01-24 | 10
...
2014-02-10 | 2014-02-17 | 10
2014-02-18 | 2014-02-25 | 9
2014-02-26 | 2014-03-05 | 9
2014-03-06 | 2014-03-13 | 9
...
2014-03-30 | 2014-04-06 | 9
2014-04-07 | 2014-04-14 | 9
2014-04-15 | 2014-04-22 | 5
In teradata I have tried this
case when val <> 0 then val
else sum(val) over (partition by group, location order by group, store, week_bgn_dt 1 preceding to current row) as val2
but this only give the last value once, like so,
week_bgn_dt| week_end_dt| val | val2
2014-01-01 | 2014-01-08 | 10 | 10
2014-01-09 | 2014-01-16 | 0 | 10
2014-01-17 | 2014-01-24 | 0 | 0
...
2014-02-10 | 2014-02-17 | 0 | 0
2014-02-18 | 2014-02-25 | 9 | 9
2014-02-26 | 2014-03-05 | 0 | 9
2014-03-06 | 2014-03-13 | 0 | 0
...
2014-03-30 | 2014-04-06 | 0 | 0
2014-04-07 | 2014-04-14 | 0 | 0
2014-04-15 | 2014-04-22 | 5 | 5
If I make the window unbounded, it sums when I hit a new value
case when val <> 0 then val
else sum(val) over (partition by group, location order by group, store, week_bgn_dt unbounded preceding to current row) as val2
week_bgn_dt| week_end_dt| val | val2
2014-01-01 | 2014-01-08 | 10 | 10
2014-01-09 | 2014-01-16 | 0 | 10
2014-01-17 | 2014-01-24 | 0 | 10
...
2014-02-10 | 2014-02-17 | 0 | 10
2014-02-18 | 2014-02-25 | 9 | 9
2014-02-26 | 2014-03-05 | 0 | 19
2014-03-06 | 2014-03-13 | 0 | 19
...
2014-03-30 | 2014-04-06 | 0 | 19
2014-04-07 | 2014-04-14 | 0 | 19
2014-04-15 | 2014-04-22 | 5 | 5
I have tried with max() and min(), but to similar results. Thank you for any assistance.

This seems to be an issue with the partitioning in the SUM operation. Remember that when OVER clause is specified, SUM calculates its results for each partition separately starting from zero for each partition. It appears that you want SUM to function over multiple partitions. As we cannot tell SUM in any way (that I'm aware of) to operate over multiple partitions, the way around is to redefine the partitioning to something else.
I your case, it appears that SUM should not use partitions at all. All we need are the RESET WHEN feature and the windowing operation of OVER. Using your expanded results filled with zeros, I have achieved the required output with following query.
SELECT
week_bgn_dt,
week_end_dt,
val,
SUM(val) OVER ( PARTITION BY 1
ORDER BY location ASC, week_bgn_dt ASC
RESET WHEN val<>0
ROWS UNBOUNDED PRECEDING ) AS val2
FROM test
week_bgn_dt | week_end_dt | val | val2
2014-01-01 | 2014-01-08 | 10 | 10
2014-01-09 | 2014-01-16 | 0 | 10
2014-01-17 | 2014-01-24 | 0 | 10
2014-02-10 | 2014-02-17 | 0 | 10
2014-02-18 | 2014-02-25 | 9 | 9
2014-02-26 | 2014-03-05 | 0 | 9
2014-03-06 | 2014-03-13 | 0 | 9
2014-03-30 | 2014-04-06 | 0 | 9
2014-04-07 | 2014-04-14 | 0 | 9
2014-04-15 | 2014-04-22 | 5 | 5
You might have noticed that I have added only location to the provided data. I believe you can add rest of the fields to ORDER BY clause and get the right results.

Related

Group by bursts of occurences in TimescaleDB/PostgreSQL

this is my first question in stackoverflow, any advice on how to ask a well structured question will be welcomed.
So, I have a TimescaleDB database, which is time-series databases built over Postgres. It has most of its functionalities, so if any of you don't know about Timescale it won't be an issue.
I have a select statement which returns:
time | num_issues | actor_login
------------------------+------------+------------------
2015-11-10 01:00:00+01 | 2 | nifl
2015-12-10 01:00:00+01 | 1 | anandtrex
2016-01-09 01:00:00+01 | 1 | isaacrg
2016-02-08 01:00:00+01 | 1 | timbarclay
2016-06-07 02:00:00+02 | 1 | kcalmes
2016-07-07 02:00:00+02 | 1 | cassiozen
2016-08-06 02:00:00+02 | 13 | phae
2016-09-05 02:00:00+02 | 2 | phae
2016-10-05 02:00:00+02 | 13 | cassiozen
2016-11-04 01:00:00+01 | 6 | cassiozen
2016-12-04 01:00:00+01 | 4 | cassiozen
2017-01-03 01:00:00+01 | 5 | cassiozen
2017-02-02 01:00:00+01 | 8 | cassandraoid
2017-03-04 01:00:00+01 | 16 | erquhart
2017-04-03 02:00:00+02 | 3 | erquhart
2017-05-03 02:00:00+02 | 9 | erquhart
2017-06-02 02:00:00+02 | 5 | erquhart
2017-07-02 02:00:00+02 | 2 | greatwarlive
2017-08-01 02:00:00+02 | 8 | tech4him1
2017-08-31 02:00:00+02 | 7 | tech4him1
2017-09-30 02:00:00+02 | 17 | erquhart
2017-10-30 01:00:00+01 | 7 | erquhart
2017-11-29 01:00:00+01 | 12 | erquhart
2017-12-29 01:00:00+01 | 8 | tech4him1
2018-01-28 01:00:00+01 | 6 | ragasirtahk
And it follows. Basically it returns a username in a bucket of time, in this case 30 days.
The SQL query is:
SELECT DISTINCT ON(time_bucket('30 days', created_at))
time_bucket('30 days', created_at) as time,
count(id) as num_issues,
actor_login
FROM
issues_event
WHERE action = 'opened' AND repo_name='netlify/netlify-cms'
group by time, actor_login
order by time, num_issues DESC
My question is, how can i detect or group the rows which have equal actor_login and are consecutive.
For example, I would like to group the cassiozen from 2016-10-05 to 2017-01-03, but not with the other cassiozen of the column.
I have tried with auxiliar columns, with window functions such as LAG, but without a function or a do statement I don't think it is possible.
I also tried with functions but I can't find a way.
Any approach, idea or solution will be fully appreciated.
Edit: I show my desired output.
time | num_issues | actor_login | actor_group_id
------------------------+------------+------------------+----------------
2015-11-10 01:00:00+01 | 2 | nifl | 0
2015-12-10 01:00:00+01 | 1 | anandtrex | 1
2016-01-09 01:00:00+01 | 1 | isaacrg | 2
2016-02-08 01:00:00+01 | 1 | timbarclay | 3
2016-06-07 02:00:00+02 | 1 | kcalmes | 4
2016-07-07 02:00:00+02 | 1 | cassiozen | 5
2016-08-06 02:00:00+02 | 13 | phae | 6
2016-09-05 02:00:00+02 | 2 | phae | 6
2016-10-05 02:00:00+02 | 13 | cassiozen | 7
2016-11-04 01:00:00+01 | 6 | cassiozen | 7
2016-12-04 01:00:00+01 | 4 | cassiozen | 7
2017-01-03 01:00:00+01 | 5 | cassiozen | 7
2017-02-02 01:00:00+01 | 8 | cassandraoid | 12
2017-03-04 01:00:00+01 | 16 | erquhart | 13
2017-04-03 02:00:00+02 | 3 | erquhart | 13
2017-05-03 02:00:00+02 | 9 | erquhart | 13
2017-06-02 02:00:00+02 | 5 | erquhart | 13
2017-07-02 02:00:00+02 | 2 | greatwarlive | 17
2017-08-01 02:00:00+02 | 8 | tech4him1 | 18
2017-08-31 02:00:00+02 | 7 | tech4him1 | 18
2017-09-30 02:00:00+02 | 17 | erquhart | 16
2017-10-30 01:00:00+01 | 7 | erquhart | 16
2017-11-29 01:00:00+01 | 12 | erquhart | 16
2017-12-29 01:00:00+01 | 8 | tech4him1 | 21
2018-01-28 01:00:00+01 | 6 | ragasirtahk | 24
The solution of MatBaille is almost perfect.
I just wanted to group the consecutive actors like this so I could extract a bunch of metrics with other attributes of the table.
You could use a so-called "gaps-and-islands" approach
WITH
sorted AS
(
SELECT
*,
ROW_NUMBER() OVER ( ORDER BY time) AS rn,
ROW_NUMBER() OVER (PARTITION BY actor_login ORDER BY time) AS rn_actor
FROM
your_results
)
SELECT
*,
rn - rn_actor AS actor_group_id
FROM
sorted
Then the combination of (actor_login, actor_group_id) will group consecutive rows together.
db<>fiddle demo

Get next result with specific ORDER BY satisfying the WHERE clause

Given a TripID I need to grab the next result that satistfies certain criteria (TripSource <> 1 AND HasLot = 1) but I've found the problem that the order to consider "the next Trip" has to be "ORDER BY TripDate, TripOrder". So I mean that TripID has nothing to do with the order.
(I'm using SQL Server 2008, so I can't use LEAD or LAG but I'm also interested in answers using them.)
Example datasource:
+--------+-------------------------+-----------+------------+--------+
| TripID | TripDate | TripOrder | TripSource | HasLot |
+--------+-------------------------+-----------+------------+--------+
1. | 37172 | 2019-08-01 00:00:00.000 | 0 | 1 | 0 |
2. | 37211 | 2019-08-01 00:00:00.000 | 1 | 1 | 0 |
3. | 37198 | 2019-08-01 00:00:00.000 | 2 | 2 | 1 |
4. | 37213 | 2019-08-01 00:00:00.000 | 3 | 1 | 0 |
5. | 37245 | 2019-08-02 00:00:00.000 | 0 | 1 | 0 |
6. | 37279 | 2019-08-02 00:00:00.000 | 1 | 1 | 0 |
7. | 37275 | 2019-08-02 00:00:00.000 | 2 | 1 | 0 |
8. | 37264 | 2019-08-02 00:00:00.000 | 3 | 2 | 0 |
9. | 37336 | 2019-08-03 00:00:00.000 | 0 | 1 | 1 |
10. | 37320 | 2019-08-05 00:00:00.000 | 0 | 1 | 0 |
11. | 37354 | 2019-08-05 00:00:00.000 | 1 | 1 | 0 |
12. | 37329 | 2019-08-05 00:00:00.000 | 2 | 1 | 0 |
13. | 37373 | 2019-08-06 00:00:00.000 | 0 | 1 | 0 |
14. | 37419 | 2019-08-06 00:00:00.000 | 1 | 1 | 0 |
15. | 37421 | 2019-08-06 00:00:00.000 | 2 | 1 | 0 |
16. | 37414 | 2019-08-06 00:00:00.000 | 3 | 1 | 1 |
17. | 37459 | 2019-08-07 00:00:00.000 | 0 | 2 | 1 |
18. | 37467 | 2019-08-07 00:00:00.000 | 1 | 1 | 0 |
19. | 37463 | 2019-08-07 00:00:00.000 | 2 | 1 | 0 |
20. | 37461 | 2019-08-07 00:00:00.000 | 3 | 0 | 0 |
+--------+-------------------------+-----------+------------+--------+
Results I need:
Given TripID 37211 (Row 2.) I need to get 37198 (Row 3.)
Given TripID 37198 (Row 3.) I need to get 37459 (Row 17.)
Given TripID 37459 (Row 17.) I need to get null
Given TripID 37463 (Row 19.) I need to get null
You can use a correlated subquery or outer apply:
select t.*, t2.tripid
from trips t outer apply
(select top (1) t2.*
from trips t2
where t2.tripsource <> 1 and t2.haslot = 1 and
(t2.tripdate > t.tripdate or
t2.tripdate = t.tripdate and t2.triporder > t.triporder
)
order by t2.tripdate desc, t2.triporder desc
) t2;

Set a flag based on the value of another flag in the past hour

I have a table with the following design:
+------+-------------------------+-------------+
| Shop | Date | SafetyEvent |
+------+-------------------------+-------------+
| 1 | 2018-06-25 10:00:00.000 | 0 |
| 1 | 2018-06-25 10:30:00.000 | 1 |
| 1 | 2018-06-25 10:45:00.000 | 0 |
| 2 | 2018-06-25 11:00:00.000 | 0 |
| 2 | 2018-06-25 11:30:00.000 | 0 |
| 2 | 2018-06-25 11:45:00.000 | 0 |
| 3 | 2018-06-25 12:00:00.000 | 1 |
| 3 | 2018-06-25 12:30:00.000 | 0 |
| 3 | 2018-06-25 12:45:00.000 | 0 |
+------+-------------------------+-------------+
Basically at each shop, we track the date/time of a repair and flag if a safety event occurred. I want to add an additional column that tracks if a safety event has occurred in the last 8 hours at each shop. The end result will be like this:
+------+-------------------------+-------------+-------------------+
| Shop | Date | SafetyEvent | SafetyEvent8Hours |
+------+-------------------------+-------------+-------------------+
| 1 | 2018-06-25 10:00:00.000 | 0 | 0 |
| 1 | 2018-06-25 10:30:00.000 | 1 | 1 |
| 1 | 2018-06-25 10:45:00.000 | 0 | 1 |
| 2 | 2018-06-25 11:00:00.000 | 0 | 0 |
| 2 | 2018-06-25 11:30:00.000 | 0 | 0 |
| 2 | 2018-06-25 11:45:00.000 | 0 | 0 |
| 3 | 2018-06-25 12:00:00.000 | 1 | 1 |
| 3 | 2018-06-25 12:30:00.000 | 0 | 1 |
| 3 | 2018-06-25 12:45:00.000 | 0 | 1 |
+------+-------------------------+-------------+-------------------+
I was trying to use DATEDIFF but couldn't figure out how to have it occur for each row.
This isn't particularly efficient, but you can use apply or a correlated subquery:
select t.*, t8.SafetyEvent8Hours
from t apply
(select max(SafetyEvent) as SafetyEvent8Hours
from t t2
where t2.shop = t.shop and
t2.date <= t.date and
t2.date > dateadd(hour, -8, t.date)
) t8;
If you can rely on events being logged every 15 minutes, then a more efficient method is to use window functions:
select t.*,
max(SafetyEvent) over (partition by shop order by date rows between 31 preceding and current row) as SafetyEvent8Hours
from t

Match every record only once in a joined table

I have two tables. The first inv containing records of invoices, the second containing payments. I want to match the payments in the inv table by inv_amount and inv_date. There might be more than one invoice with the same amount on the same day and also more than one payment of the same amount on the same day.
The payment should be matched with the first matching invoice and every payment must only be matched once.
This is my data:
Table inv
inv_id | inv_amount | inv_date | inv_number
--------+------------+------------+------------
1 | 10 | 2018-01-01 | 1
2 | 16 | 2018-01-01 | 1
3 | 12 | 2018-02-02 | 2
4 | 14 | 2018-02-03 | 3
5 | 19 | 2018-02-04 | 3
6 | 19 | 2018-02-04 | 5
7 | 5 | 2018-02-04 | 6
8 | 40 | 2018-02-04 | 7
9 | 19 | 2018-02-04 | 8
10 | 19 | 2018-02-05 | 9
11 | 20 | 2018-02-05 | 10
12 | 20 | 2018-02-07 | 11
Table pay
pay_id | pay_amount | pay_date
--------+------------+------------
1 | 10 | 2018-01-01
2 | 12 | 2018-02-02
4 | 19 | 2018-02-04
3 | 14 | 2018-02-03
5 | 5 | 2018-02-04
6 | 19 | 2018-02-04
7 | 19 | 2018-02-05
8 | 20 | 2018-02-07
My Query:
SELECT DISTINCT ON (inv.inv_id) inv.inv_id,
inv.inv_amount,
inv.inv_date,
inv.inv_number,
pay.pay_id
FROM ("2016".pay
RIGHT JOIN "2016".inv ON (((pay.pay_amount = inv.inv_amount) AND (pay.pay_date = inv.inv_date))))
ORDER BY inv.inv_id
resulting in:
inv_id | inv_amount | inv_date | inv_number | pay_id
--------+------------+------------+------------+--------
1 | 10 | 2018-01-01 | 1 | 1
2 | 16 | 2018-01-01 | 1 |
3 | 12 | 2018-02-02 | 2 | 2
4 | 14 | 2018-02-03 | 3 | 3
5 | 19 | 2018-02-04 | 3 | 4
6 | 19 | 2018-02-04 | 5 | 4
7 | 5 | 2018-02-04 | 6 | 5
8 | 40 | 2018-02-04 | 7 |
9 | 19 | 2018-02-04 | 8 | 6
10 | 19 | 2018-02-05 | 9 | 7
11 | 20 | 2018-02-05 | 10 |
12 | 20 | 2018-02-07 | 11 | 8
The record inv_id = 6 should not match with pay_id = 4 for it would mean that payment 4 was inserted twice
Desired result:
inv_id | inv_amount | inv_date | inv_number | pay_id
--------+------------+------------+------------+--------
1 | 10 | 2018-01-01 | 1 | 1
2 | 16 | 2018-01-01 | 1 |
3 | 12 | 2018-02-02 | 2 | 2
4 | 14 | 2018-02-03 | 3 | 3
5 | 19 | 2018-02-04 | 3 | 4
6 | 19 | 2018-02-04 | 5 | <- should be empty**
7 | 5 | 2018-02-04 | 6 | 5
8 | 40 | 2018-02-04 | 7 |
9 | 19 | 2018-02-04 | 8 | 6
10 | 19 | 2018-02-05 | 9 | 7
11 | 20 | 2018-02-05 | 10 |
12 | 20 | 2018-02-07 | 11 | 8
Disclaimer: Yes I asked that question yesterday with the original data but someone pointed out that my sql was very hard to read. I, therefore, tried to create a cleaner representation of my problem.
For convenience, here's an SQL Fiddle to test: http://sqlfiddle.com/#!17/018d7/1
After seeing the example I think I've got the query for you:
WITH payments_cte AS (
SELECT
payment_id,
payment_amount,
payment_date,
ROW_NUMBER() OVER (PARTITION BY payment_amount, payment_date ORDER BY payment_id) AS payment_row
FROM payments
), invoices_cte AS (
SELECT
invoice_id,
invoice_amount,
invoice_date,
invoice_number,
ROW_NUMBER() OVER (PARTITION BY invoice_amount, invoice_date ORDER BY invoice_id) AS invoice_row
FROM invoices
)
SELECT invoice_id, invoice_amount, invoice_date, invoice_number, payment_id
FROM invoices_cte
LEFT JOIN payments_cte
ON payment_amount = invoice_amount
AND payment_date = invoice_date
AND payment_row = invoice_row
ORDER BY invoice_id, payment_id

Impala SQL build columns based on row data and populating columns with additional row data

I'm working in Impala and, while I'm fairly inexperienced in both Impala and SQL, I need to be able to build a data set that looks like the following:
|dayname | 2017-11-08 00:00:00 | 2017-11-08 01:00:00 | ... |
|---------|---------------------+---------------------+-----|
|Wednesday| 20 | 11 | ... |
|---------|---------------------|---------------------|-----|
|Thursday | 287 | 17 | ... |
|---------|---------------------|---------------------|-----|
|... | ... | ... | ... |
|---------|---------------------|---------------------|-----|
I am unable, due to the constraints of Impala, to use pivot, which would under normal circumstances produce the desired result.
Thus far, I have a SQL SELECT statement which looks like this:
select
dayname(date) as dayname,
utc_hour,
sum(case when (`type` IN ('Awesome')) then 1 else 0 end) as some
FROM (select *, trunc(cast(floor(date / 1000) as timestamp), "HH") as utc_hour
FROM COOLNESSTYPES
WHERE date >= 1510082633596 and month >= '2017-11'
)  a
GROUP BY utc_hour, dayname
ORDER BY utc_hour;
and returns the following data:
+-----------+---------------------+-------+
| dayname   | utc_hour | some |
+-----------+---------------------+-------+
| Wednesday | 2017-11-08 00:00:00 | 20 |
| Wednesday | 2017-11-08 01:00:00 | 11 |
| Wednesday | 2017-11-08 09:00:00 | 1 |
| Wednesday | 2017-11-08 11:00:00 | 40 |
| Wednesday | 2017-11-08 12:00:00 | 0 |
| Wednesday | 2017-11-08 13:00:00 | 6 |
| Wednesday | 2017-11-08 14:00:00 | 0 |
| Wednesday | 2017-11-08 16:00:00 | 2 |
| Wednesday | 2017-11-08 17:00:00 | 10 |
| Wednesday | 2017-11-08 19:00:00 | 5 |
| Thursday | 2017-11-09 07:00:00 | 1 |
| Thursday | 2017-11-09 12:00:00 | 0 |
| Thursday | 2017-11-09 13:00:00 | 0 |
| Thursday | 2017-11-09 14:00:00 | 58 |
| Friday | 2017-11-10 09:00:00 | 0 |
| Friday | 2017-11-10 10:00:00 | 0 |
| Friday | 2017-11-10 16:00:00 | 0 |
+-----------+---------------------+-------+
So, how do I go about doing something like this? On Cloudera's community pages, someone recommends using unions, but I'm not really clear on how I'd label my columns as the row values from my utc_hour column. (see https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Transpose-columns-to-rows/td-p/49667 for more information on the union suggestion, if needed.)
Any help or ideas on this would be greatly appreciated. Thanks!
There is added complexity if you really require column names that change. If you can tolerate fixed column names the pivot is simple, along these lines:
select
dayname
, extract(dow from utc_hour) d_of_w
, max(case when date_part('day', utc_hour) = 0 then somecol end) hour_0
, max(case when date_part('day', utc_hour) = 7 then somecol end) hour_7
, max(case when date_part('day', utc_hour) = 9 then somecol end) hour_9
, max(case when date_part('day', utc_hour) = 12 then somecol end) hour_12
, max(case when date_part('day', utc_hour) = 14 then somecol end) hour_14
from COOLNESSTYPES
group by
d_of_w
, dayname
I used Postgres to develop my example for this example using extract(hour from utc_hour) instead of the date_part() now shown above (thanks to hbomb).
| dayname | d_of_w | hour_0 | hour_7 | hour_9 | hour_12 | hour_14 |
|-----------|--------|--------|--------|--------|---------|---------|
| Wednesday | 3 | 20 | (null) | 1 | 0 | 0 |
| Friday | 5 | (null) | (null) | 0 | (null) | (null) |
| Thursday | 4 | (null) | 1 | (null) | 0 | 58 |
see: http://sqlfiddle.com/#!17/81cfd/2 (Postgres)
To achieve column names that change you need "dynamic sql" and to be frank it isn't clear to be if this is possible in Impala (as I don't use that product).