Postgres 9.1 - Numbering groups of rows - sql

I have some data that represents different 'actions'. These 'actions' collectively comprise an 'event'.
The data looks like this:
EventID | UserID | Action | TimeStamp
--------------+------------+------------+-------------------------
1 | 111 | Start | 2012-01-01 08:00:00
1 | 111 | Stop | 2012-01-01 08:59:59
1 | 999 | Start | 2012-01-01 09:00:00
1 | 999 | Stop | 2012-01-01 09:59:59
1 | 111 | Start | 2012-01-01 10:00:00
1 | 111 | Stop | 2012-01-01 10:30:00
As you can see, each single 'event' is made of one or more 'Actions' (or as I think of them, 'sub events').
I need to identify each 'sub event' and give it an identifier. This is what I am looking for:
EventID | SubeventID | UserID | Action | TimeStamp
--------------+----------------+------------+------------+-------------------------
1 | 1 | 111 | Start | 2012-01-01 08:00:00
1 | 1 | 111 | Stop | 2012-01-01 08:59:59
1 | 2 | 999 | Start | 2012-01-01 09:00:00
1 | 2 | 999 | Stop | 2012-01-01 09:59:59
1 | 3 | 111 | Start | 2012-01-01 10:00:00
1 | 3 | 111 | Stop | 2012-01-01 10:30:00
I need something that can start counting, but only increment when some column has a specific value (like "Action" = 'Start').
I have been trying to use Window Functions for this, but with limited success. I just can't seem to find a solution that I feel will work... any thoughts?

If you have some field you can sort by, you could use the following query (untested):
SELECT
sum(("Action" = 'Start')::int) OVER (PARTITION BY "EventID" ORDER BY "Timestamp" ROWS UNBOUNDED PRECEDING)
FROM
events
Note that if the first SubEvent does not start with Start, it will have an event id of 0, which might not be what you want.
You could also use COUNT() in place of SUM():
SELECT
EventID
, COUNT(CASE WHEN Action = 'Start' THEN 1 END)
OVER ( PARTITION BY EventID
ORDER BY TimeStamp
ROWS UNBOUNDED PRECEDING )
AS SubeventID
, UserID
, Action
FROM
tableX AS t ;
Tests at SQL-Fiddle: test

Related

Summarizing data using SQL

I have a problem that I am trying to solve using SQL and I needed your inputs on the approach to go about it.
This is how the input data & expected output looks like:
container_edits - This is the input table
container | units | status | move_time
-------------------------------------------------
XYZ | 5 | Start | 2018-01-01 00:00:15
XYZ | 2 | Add | 2018-01-01 00:01:10
XYZ | 3 | Add | 2018-01-01 00:02:00
XYZ | null | Complete | 2018-01-01 00:03:00
XYZ | 5 | Start | 2018-01-01 00:04:15
XYZ | 3 | Add | 2018-01-01 00:05:10
XYZ | 4 | Add | 2018-01-01 00:06:00
XYZ | 5 | Add | 2018-01-01 00:07:10
XYZ | 6 | Add | 2018-01-01 00:08:00
XYZ | null | Complete | 2018-01-01 00:09:00
Expected summarized output
container | loop_num | units | start_time | end_time
------------------------------------------------------------------------
XYZ | 1 | 10 | 2018-01-01 00:00:15 | 2018-01-01 00:03:00
XYZ | 2 | 23 | 2018-01-01 00:04:15 | 2018-01-01 00:09:00
Essentially, I need to partition the data based on the status label, extract the minimum and maximum time within the partition and get the total number of units within that partition. I am aware of the usage of window functions and the partition by clause but I am unclear on how to apply that when I need to partition based on the value of a column ('status' in this case).
Any leads on how to go about solving this would be really helpful. Thank you!
You can assign a group using a cumulative sum of starts -- which is your loop_num The rest is aggregation:
select container, loop_num, sum(units),
min(move_time), max(move_time)
from (select ce.*,
sum(case when status = 'Start' then 1 else 0 end) over (partition by container order by move_time) as loop_num
from container_edits ce
) ce
group by container, loop_num;
Here is a db<>fiddle (it happens to use Postgres, but the syntax is standard SQL).

SQL: how to check for neither overlapping nor holes in payment records

I do have a table PaymentSchedules with percentages info, and dates from/to for which those
percentages are valid, resource by resource:
| auto_numbered | res_id | date_start | date_end | org | pct |
|---------------+--------+------------+------------+-------+-----|
| 1 | A | 2018-01-01 | 2019-06-30 | One | 100 |
| 2 | A | 2019-07-01 | (NULL) | One | 60 |
| 3 | A | 2019-07-02 | 2019-12-31 | Two | 40 |
| 4 | A | 2020-01-01 | (NULL) | Two | 40 |
| 5 | B | (NULL) | (NULL) | Three | 100 |
| 6 | C | 2018-01-01 | (NULL) | One | 100 |
| 7 | C | 2019-11-01 | (NULL) | Four | 100 |
(Records #3 and #4 could be summarized onto just one line, but duplicated on purpose, to show that there are many combinations of date_start and date_end.)
A quick reading of the data:
Org "One" is fully paying for resource A up to 2019-06-30; then, it continues
to pay 60% of the cost, but the rest (40%) is being paid by org "Two" since
2019-07-02.
This should begin on 2019-07-01... small encoding error… provoking a 1-day gap.
Org "Three" is fully paying for resource B, at all times.
Org "One" is fully paying for resource C from 2018-01-01... but, starting on
2019-01-11, org "Four" is paying for it...
... and, there, there is an encoding error: we do have 200% of resource C being
taken into account since 2019-11-01: the record #6 should have been closed
(date_end set to 2019-10-31), but hasn't...
So, when we generate a financial report for the year 2019 (from 2019-01-01 to
2019-12-31), we will have calculation errors...
So, question: how can we make sure we don't have overlapping payments for
resources, or -- also the contrary -- "holes" for some period of times?
How is it possible to write an SQL query to check that there are neither
underpaid nor overpaid resources? That is, all resources in the table should be
paid, for every single day of the financial period being looked at, by exactly
one or more organizations, in a way that the summed up percentage is always
equal to 100%.
I don't see how to proceed with such a query. Anybody able to give hints, to put
me on track?
EDIT -- Working with both SQL Server and Oracle.
EDIT -- I don't own the DB, I can't add triggers or views. I need to be able to detect things "after the facts"... Need to easily spot the conflictual records, or the "missing" ones (in case of "period holes"), fix them by hand, and then re-run the financial report.
EDIT -- If we make an analysis for 2019, the following report would be desired:
| res_id | pct_sum | date |
|--------+---------+------------|
| A | 60 | 2019-07-01 |
| C | 200 | 2019-11-01 |
| C | 200 | 2019-11-02 |
| C | 200 | ... |
| C | 200 | ... |
| C | 200 | ... |
| C | 200 | 2019-12-30 |
| C | 200 | 2019-12-31 |
or, of course, an even much better version -- certainly unobtainable? -- where one
type of problem would one be present once, with the relevant date range for
which the problem is observed:
| res_id | pct_sum | date_start | date_end |
|--------+---------+------------+------------|
| A | 60 | 2019-07-01 | 2019-07-01 |
| C | 200 | 2019-11-01 | 2019-12-31 |
EDIT -- Fiddle code: db<>fiddle here
Here's an incomplete attempt for Sql Server.
Basically, the idea was to use a recursive CTE to unfold months for each res_id.
Then left join 'what could be' to the existing date ranges.
But I doubt it can be done in a sql that would work both for Oracle & MS Sql Server.
Sure, both have window functions and CTE's.
But the datetime functions are rarely the same for different RDMS.
So I give up.
Maybe someone else finds an easier solution.
create table PaymentSchedules
(
auto_numbered int identity(1,1) primary key,
res_id varchar(30),
date_start date,
date_end date,
org varchar(30),
pct decimal(3,0)
)
GO
✓
insert into PaymentSchedules
(res_id, org, pct, date_start, date_end)
values
('A', 'One', 100, '2018-01-01', '2018-06-30')
, ('A', 'One', 100, '2019-01-01', '2019-06-30')
, ('A', 'One', 60, '2019-07-01', null)
, ('A', 'Two', 40, '2019-07-02', '2019-12-31')
, ('A', 'Two', 40, '2020-01-01', null)
, ('B', 'Three', 100, null, null)
, ('C', 'One', 100, '2018-01-01', null)
, ('C', 'Four', 100, '2019-11-01', null)
;
GO
8 rows affected
declare #MaxEndDate date;
set #MaxEndDate = (select max(iif(date_start > date_end, date_start, isnull(date_end, date_start))) from PaymentSchedules);
;with rcte as
(
select res_id
, datefromparts(year(min(date_start)), month(min(date_start)), 1) as month_start
, eomonth(coalesce(max(date_end), #MaxEndDate)) as month_end
, 0 as lvl
from PaymentSchedules
group by res_id
having min(date_start) is not null
union all
select res_id
, dateadd(month, 1, month_start)
, month_end
, lvl + 1
from rcte
where dateadd(month, 1, month_start) < month_end
)
, cte_gai as
(
select c.res_id, c.month_start, c.month_end
, t.org, t.pct, t.auto_numbered
, sum(isnull(t.pct,0)) over (partition by c.res_id, c.month_start) as res_month_pct
, count(t.auto_numbered) over (partition by c.res_id, c.month_start) as cnt
from rcte c
left join PaymentSchedules t
on t.res_id = c.res_id
and c.month_start >= datefromparts(year(t.date_start), month(t.date_start), 1)
and c.month_start <= coalesce(t.date_end, #MaxEndDate)
)
select *
from cte_gai
where res_month_pct <> 100
order by res_id, month_start
GO
res_id | month_start | month_end | org | pct | auto_numbered | res_month_pct | cnt
:----- | :---------- | :--------- | :--- | :--- | ------------: | :------------ | --:
A | 2018-07-01 | 2019-12-31 | null | null | null | 0 | 0
A | 2018-08-01 | 2019-12-31 | null | null | null | 0 | 0
A | 2018-09-01 | 2019-12-31 | null | null | null | 0 | 0
A | 2018-10-01 | 2019-12-31 | null | null | null | 0 | 0
A | 2018-11-01 | 2019-12-31 | null | null | null | 0 | 0
A | 2018-12-01 | 2019-12-31 | null | null | null | 0 | 0
C | 2019-11-01 | 2020-01-31 | One | 100 | 7 | 200 | 2
C | 2019-11-01 | 2020-01-31 | Four | 100 | 8 | 200 | 2
C | 2019-12-01 | 2020-01-31 | One | 100 | 7 | 200 | 2
C | 2019-12-01 | 2020-01-31 | Four | 100 | 8 | 200 | 2
C | 2020-01-01 | 2020-01-31 | One | 100 | 7 | 200 | 2
C | 2020-01-01 | 2020-01-31 | Four | 100 | 8 | 200 | 2
db<>fiddle here
I am not giving the full answer here but I think that you are after cursors(https://learn.microsoft.com/en-us/sql/t-sql/language-elements/declare-cursor-transact-sql?view=sql-server-ver15).
This allows you to iterate through the database, checking all of the records.
This is bad practice because even though the idea is really good, they are quite heavy, and they are slow, and they block the involved tables.
I know some people have found a method to rewrite cursors using loops (while probably), so you need to understand a cursor, get how you would implement it and then translate it into a loop. (https://www.sqlbook.com/advanced/sql-cursors-how-to-avoid-them/)
Also, views can be helpful, but I am assuming that you know how to use them already.
The algorithm should be something like:
have table1 and table2 (table2 is a copy of table1, https://www.tutorialrepublic.com/sql-tutorial/sql-cloning-tables.php)
iterate through all of the records (I would use in the first instance a cursor for this) from table1. Picking up a record from table1.
if overlapping dates (check it against table2) do something
else do something else
pick another record from table1 and go to step 2.
Drop unnecessary tables

Can I put a condition on a window function in Redshift?

I have an events-based table in Redshift. I want to tie all events to the FIRST event in the series, provided that event was in the N-hours preceding this event.
If all I cared about was the very first row, I'd simply do:
SELECT
event_time
,first_value(event_time)
OVER (ORDER BY event_time rows unbounded preceding) as first_time
FROM
my_table
But because I only want to tie this to the first event in the past N-hours, I want something like:
SELECT
event_time
,first_value(event_time)
OVER (ORDER BY event_time rows between [N-hours ago] and current row) as first_time
FROM
my_table
A little background on my table. It's user actions, so effectively a user jumps on, performs 1-100 actions, and then leaves. Most users are 1-10x per day. Sessions rarely last over an hour, so I could set N=1.
If I just set a PARTITION BY date_trunc('hour', event_time), I'll double create for sessions that span the hour.
Assume my_table looks like
id | user_id | event_time
----------------------------------
1 | 123 | 2015-01-01 01:00:00
2 | 123 | 2015-01-01 01:15:00
3 | 123 | 2015-01-01 02:05:00
4 | 123 | 2015-01-01 13:10:00
5 | 123 | 2015-01-01 13:20:00
6 | 123 | 2015-01-01 13:30:00
My goal is to get a result that looks like
id | parent_id | user_id | event_time
----------------------------------
1 | 1 | 123 | 2015-01-01 01:00:00
2 | 1 | 123 | 2015-01-01 01:15:00
3 | 1 | 123 | 2015-01-01 02:05:00
4 | 4 | 123 | 2015-01-01 13:10:00
5 | 4 | 123 | 2015-01-01 13:20:00
6 | 4 | 123 | 2015-01-01 13:30:00
The answer appears to be "no" as of now.
There is a functionality in SQL Server of using RANGE instead of ROWS in the frame. This allows the query to compare values to the current row's value.
https://www.simple-talk.com/sql/learn-sql-server/window-functions-in-sql-server-part-2-the-frame/
When I attempt this syntax in Redshift I get the error that "Range is not yet supported"
Someone update this when that "yet" changes!

SQL Query to Count Number of Responses Matching Certain Criteria over a Date Range and Display as Grouped per Day

I have the following set of survey responses in a table.
It's not very clear but the numbers represent the 'satisfaction' level where:
0 = happy
1 = neutral
2 = sad
+----------+--------+-------+------+-----------+-------------------------+
| friendly | polite | clean | rate | recommend | booking_date |
+----------+--------+-------+------+-----------+-------------------------+
| 2 | 2 | 2 | 0 | 0 | 2014-02-03 00:00:00.000 |
| 1 | 2 | 0 | 0 | 2 | 2014-02-04 00:00:00.000 |
| 0 | 0 | 0 | 1 | 0 | 2014-02-04 00:00:00.000 |
| 1 | 1 | 2 | 0 | 2 | 2014-02-04 00:00:00.000 |
| 0 | 0 | 1 | 2 | 1 | 2014-02-04 00:00:00.000 |
| 2 | 2 | 0 | 2 | 0 | 2014-02-05 00:00:00.000 |
| 2 | 1 | 1 | 0 | 2 | 2014-02-05 00:00:00.000 |
| 1 | 0 | 1 | 2 | 0 | 2014-02-05 00:00:00.000 |
| 0 | 1 | 1 | 1 | 1 | 2014-02-05 00:00:00.000 |
| 1 | 0 | 2 | 2 | 0 | 2014-02-05 00:00:00.000 |
+----------+--------+-------+------+-----------+-------------------------+
For each day I need the totals of each of the columns matching each response option. This will answer the question: "How may people answered happy, neutral or sad for each of the available question options".
I would then require a recordset returned such as:
+------------+----------+------------+--------+----------+------------+--------+
| Date | FriHappy | FriNeutral | FriSad | PolHappy | PolNeutral | PolSad |
+------------+----------+------------+--------+----------+------------+--------+
| 2014-02-03 | 0 | 0 | 1 | 0 | 0 | 1 |
| 2014-02-04 | 2 | 2 | 0 | 2 | 1 | 1 |
| 2014-02-05 | 1 | 2 | 2 | 2 | 2 | 1 |
+------------+----------+------------+--------+----------+------------+--------+
This shows that on the 4th two responders answered "happy" for the "Polite?" question, one answered "Neutral" and one answered "sad".
On the 5th, one responder answered "happy" for the Friendly option, two choose "neutral" and two chose "sad".
I really wish to avoid doing this in code but my SQL isn't great. I did have a look around but couldn't find anything matching this specific requirement.
Obviously this is never going to work (nice if it did) but this may help explain:
SELECT cast(booking_date as date) [booking_date],
COUNT(friendly=0) [FriHappy],
COUNT(friendly=1) [FriNeutral],
COUNT(friendly=2) [FriSad]
FROM [u-rate-gatwick-qsm].[dbo].[Questions]
WHERE booking_date >= '2014-02-01'
AND booking_date <= '2014-03-01'
GROUP BY cast(booking_date as date)
Any pointers would be much appreciated.
Many thanks.
Here is a working version of your sample query:
SELECT cast(booking_date as date) as [booking_date],
sum(case when friendly = 0 then 1 else 0 end) as [FriHappy],
sum(case when friendly = 1 then 1 else 0 end) as [FriNeutral],
sum(case when friendly = 2 then 1 else 0 end) as [FriSad]
FROM [u-rate-gatwick-qsm].[dbo].[Questions]
WHERE booking_date >= '2014-02-01' AND booking_date <= '2014-03-01'
GROUP BY cast(booking_date as date)
ORDER BY min(booking_date);
Your expression count(friendly = 0) doesn't work in SQL Server. Even if it did, it would be the same as count(friendly) -- that is, the number of non-NULL values in the column. Remember what count() does. It counts the number of non-NULL values.
The above logic says: add 1 when there is a match to the appropriate friendly value.
By the way, SQL Server doesn't guarantee the ordering of results from an aggregation, so I also added an order by clause. The min(booking_date) is just an easy way of ordering by the date.
And, I didn't make the change, but I think the second condition in the where should be < rather than <= so you don't include bookings on March 1st (even one at exactly midnight).

SQL Combine two tables with two parameters

I searched forum for 1h and didn't find nothing similar.
I have this problem: I want to compare two colums ID and DATE if they are the same in both tables i want to put number from table 2 next to it. But if it is not the same i want to fill yearly quota on the date. I am working in Access.
table1
id|date|state_on_date
1|30.12.2013|23
1|31.12.2013|25
1|1.1.2014|35
1|2.1.2014|12
2|30.12.2013|34
2|31.12.2013|65
2|1.1.2014|43
table2
id|date|year_quantity
1|31.12.2013|100
1|31.12.2014|150
2|31.12.2013|200
2|31.12.2014|300
I want to get:
table 3
id|date|state_on_date|year_quantity
1|30.12.2013|23|100
1|31.12.2013|25|100
1|1.1.2014|35|150
1|2.1.2014|12|150
2|30.12.2013|34|200
2|31.12.2013|65|200
2|1.1.2014|43|300
I tried joins and reading forums but didn't find solution.
Are you looking for this?
SELECT id, date, state_on_date,
(
SELECT TOP 1 year_quantity
FROM table2
WHERE id = t.id
AND date >= t.date
ORDER BY date
) AS year_quantity
FROM table1 t
Output:
| ID | DATE | STATE_ON_DATE | YEAR_QUANTITY |
|----|------------|---------------|---------------|
| 1 | 2013-12-30 | 23 | 100 |
| 1 | 2013-12-31 | 25 | 100 |
| 1 | 2014-01-01 | 35 | 150 |
| 1 | 2014-01-02 | 12 | 150 |
| 2 | 2013-12-30 | 34 | 200 |
| 2 | 2013-12-31 | 65 | 200 |
| 2 | 2014-01-01 | 43 | 300 |
Here is SQLFiddle demo It's for SQL Server but should work just fine in MS Accesss.