Summarizing data using SQL - sql

I have a problem that I am trying to solve using SQL and I needed your inputs on the approach to go about it.
This is how the input data & expected output looks like:
container_edits - This is the input table
container | units | status | move_time
-------------------------------------------------
XYZ | 5 | Start | 2018-01-01 00:00:15
XYZ | 2 | Add | 2018-01-01 00:01:10
XYZ | 3 | Add | 2018-01-01 00:02:00
XYZ | null | Complete | 2018-01-01 00:03:00
XYZ | 5 | Start | 2018-01-01 00:04:15
XYZ | 3 | Add | 2018-01-01 00:05:10
XYZ | 4 | Add | 2018-01-01 00:06:00
XYZ | 5 | Add | 2018-01-01 00:07:10
XYZ | 6 | Add | 2018-01-01 00:08:00
XYZ | null | Complete | 2018-01-01 00:09:00
Expected summarized output
container | loop_num | units | start_time | end_time
------------------------------------------------------------------------
XYZ | 1 | 10 | 2018-01-01 00:00:15 | 2018-01-01 00:03:00
XYZ | 2 | 23 | 2018-01-01 00:04:15 | 2018-01-01 00:09:00
Essentially, I need to partition the data based on the status label, extract the minimum and maximum time within the partition and get the total number of units within that partition. I am aware of the usage of window functions and the partition by clause but I am unclear on how to apply that when I need to partition based on the value of a column ('status' in this case).
Any leads on how to go about solving this would be really helpful. Thank you!

You can assign a group using a cumulative sum of starts -- which is your loop_num The rest is aggregation:
select container, loop_num, sum(units),
min(move_time), max(move_time)
from (select ce.*,
sum(case when status = 'Start' then 1 else 0 end) over (partition by container order by move_time) as loop_num
from container_edits ce
) ce
group by container, loop_num;
Here is a db<>fiddle (it happens to use Postgres, but the syntax is standard SQL).

Related

Finding the the difference in day from previous record and partitioning by category

I have the following table:
+----------+------------+----------------+
| Customer | Date | DesiredDayDiff |
+----------+------------+----------------+
| aaa | 12/09/2018 | 0 |
| aaa | 18/09/2018 | 6 |
| aaa | 25/09/2018 | 13 |
| aaa | 27/09/2018 | 15 |
| aaa | 28/09/2018 | 16 |
| bbb | 07/09/2018 | 0 |
| bbb | 11/09/2018 | 4 |
| bbb | 11/09/2018 | 4 |
+----------+------------+----------------+
I need to be able to calculate the difference of day from previous record, for that particular customer.
I believe there is added functionality in SQL server 2012+ that allows some sort of window functioning?? If this can be done using a window function, this would be a bonus as it hopefully, allows my query to be a lot more tidy.
I couldn't find a similar thread where the solution partitions by another category (in this instance above, it's the customer)
If I follow your narrative and diff from the previous row, LAG works for that:
declare #t table (Customer char(3), Date date, DesiredDayDiff int)
insert into #t(Customer,Date,DesiredDayDiff) values
('aaa','20180912',0),
('aaa','20180918',6),
('aaa','20180925',13),
('aaa','20180927',15),
('aaa','20180928',16),
('bbb','20180907',0),
('bbb','20180911',4),
('bbb','20180911',4)
select
*,
COALESCE(DATEDIFF(day,LAG(Date) OVER (PARTITION BY Customer ORDER By Date),Date),0)
from
#t
Results:
Customer Date DesiredDayDiff
-------- ---------- -------------- -----------
aaa 2018-09-12 0 0
aaa 2018-09-18 6 6
aaa 2018-09-25 13 7
aaa 2018-09-27 55 2
aaa 2018-09-28 66 1
bbb 2018-09-07 0 0
bbb 2018-09-11 4 4
bbb 2018-09-11 4 0
To match your "desired" column, I have to use FIRST_VALUE instead.

SQL / HiveQL multiple aggregates from same window

I would like to know if there are tricks to optimizing the following HiveQL (or SQL, out of curiosity).
For example, if I have a table:
x | y | e | time
2 | 5 | 1 | 11:30:00
2 | 1 | 1 | 12:15:00
8 | 0 | 1 | 16:00:00
10 | 6 | 2 | 16:06:00
1 | 2 | 2 | 17:00:00
and I want to get multiple aggregates:
select
e,
time,
sum(x) over w as cumu_x
sum(y) over w as cumu_y
count(x) over w as num_x
from my_table
window w as
(partition by e
order by time
rows between unbounded preceding and current row)
should give me the desired result
e | time | cumu_x | cumu_y | num_x
1 | 11:30:00 | 2 | 5 | 1
1 | 12:15:00 | 4 | 6 | 2
1 | 16:00:00 | 12 | 6 | 3
2 | 16:06:00 | 10 | 6 | 1
2 | 17:00:00 | 11 | 8 | 2
The question: how can this be optimized? Such Hive queries are extremely slow when millions of rows are involved.
If I were looping over the data myself, I would:
Calculate all aggregates in the same loop. Does this happen if I use the window alias?
Sort the data once and keep running totals. This is because I know that at each iteration, the result will just be an increment of the prior result. Does Hive do this? Is there a way to give hints so that it will?
Process different bins of "e" in parallel. Does Hive do this? I only see a single reducer when I run. Is there a way to help Hive parallelize?

In SQL, group user actions by first-time or recurring

Imagine a sequence of actions. Each action is of a certain type.
Grouping by a given time-frame (e.g. day), how many of these actions happened for the first time, and how many were recurring?
Example Input:
+-----------+-------------+-------------------+
| user_id | action_type | timestamp |
+-----------+-------------+-------------------+
| 5 | play | 2014-02-02 00:55 |
| 2 | sleep | 2014-02-02 00:52 |
| 5 | play | 2014-02-02 00:42 |
| 5 | eat | 2014-02-02 00:31 |
| 3 | eat | 2014-02-02 00:19 |
| 2 | eat | 2014-02-01 23:52 |
| 3 | play | 2014-02-01 23:50 |
| 2 | play | 2014-02-01 23:48 |
+-----------+-------------+-------------------+
Example Output
+------------+------------+-------------+
| first_time | recurring | day |
+------------+------------+-------------+
| 4 | 1 | 2014-02-02 |
| 3 | 0 | 2014-02-01 |
+------------+------------+-------------+
Explanation
On 2014-02-02, users 2, 3, and 5 performed various different actions. There were 4 instances were the users performed an action for the first time; in one case the user 5 repeated the action 'play'.
I added a column 'Total Actions' because as I said, I believe there is a misinterpretation of facts in output example. You can remove it easily.
TEST in SQLFiddle.com for SQL Server 2008.
select
COUNT(q.repetitions) 'first time',
SUM(case when q.repetitions>1 then q.repetitions-1 else 0 end) as 'recurring',
day
from (
select COUNT(i.action_type) as 'repetitions',convert(date,i.time_stamp) as 'day'
from input i
group by i.user_id, i.action_type,convert(date,i.time_stamp)
) q
group by q.day
order by day desc

SQL Combine two tables with two parameters

I searched forum for 1h and didn't find nothing similar.
I have this problem: I want to compare two colums ID and DATE if they are the same in both tables i want to put number from table 2 next to it. But if it is not the same i want to fill yearly quota on the date. I am working in Access.
table1
id|date|state_on_date
1|30.12.2013|23
1|31.12.2013|25
1|1.1.2014|35
1|2.1.2014|12
2|30.12.2013|34
2|31.12.2013|65
2|1.1.2014|43
table2
id|date|year_quantity
1|31.12.2013|100
1|31.12.2014|150
2|31.12.2013|200
2|31.12.2014|300
I want to get:
table 3
id|date|state_on_date|year_quantity
1|30.12.2013|23|100
1|31.12.2013|25|100
1|1.1.2014|35|150
1|2.1.2014|12|150
2|30.12.2013|34|200
2|31.12.2013|65|200
2|1.1.2014|43|300
I tried joins and reading forums but didn't find solution.
Are you looking for this?
SELECT id, date, state_on_date,
(
SELECT TOP 1 year_quantity
FROM table2
WHERE id = t.id
AND date >= t.date
ORDER BY date
) AS year_quantity
FROM table1 t
Output:
| ID | DATE | STATE_ON_DATE | YEAR_QUANTITY |
|----|------------|---------------|---------------|
| 1 | 2013-12-30 | 23 | 100 |
| 1 | 2013-12-31 | 25 | 100 |
| 1 | 2014-01-01 | 35 | 150 |
| 1 | 2014-01-02 | 12 | 150 |
| 2 | 2013-12-30 | 34 | 200 |
| 2 | 2013-12-31 | 65 | 200 |
| 2 | 2014-01-01 | 43 | 300 |
Here is SQLFiddle demo It's for SQL Server but should work just fine in MS Accesss.

Postgres 9.1 - Numbering groups of rows

I have some data that represents different 'actions'. These 'actions' collectively comprise an 'event'.
The data looks like this:
EventID | UserID | Action | TimeStamp
--------------+------------+------------+-------------------------
1 | 111 | Start | 2012-01-01 08:00:00
1 | 111 | Stop | 2012-01-01 08:59:59
1 | 999 | Start | 2012-01-01 09:00:00
1 | 999 | Stop | 2012-01-01 09:59:59
1 | 111 | Start | 2012-01-01 10:00:00
1 | 111 | Stop | 2012-01-01 10:30:00
As you can see, each single 'event' is made of one or more 'Actions' (or as I think of them, 'sub events').
I need to identify each 'sub event' and give it an identifier. This is what I am looking for:
EventID | SubeventID | UserID | Action | TimeStamp
--------------+----------------+------------+------------+-------------------------
1 | 1 | 111 | Start | 2012-01-01 08:00:00
1 | 1 | 111 | Stop | 2012-01-01 08:59:59
1 | 2 | 999 | Start | 2012-01-01 09:00:00
1 | 2 | 999 | Stop | 2012-01-01 09:59:59
1 | 3 | 111 | Start | 2012-01-01 10:00:00
1 | 3 | 111 | Stop | 2012-01-01 10:30:00
I need something that can start counting, but only increment when some column has a specific value (like "Action" = 'Start').
I have been trying to use Window Functions for this, but with limited success. I just can't seem to find a solution that I feel will work... any thoughts?
If you have some field you can sort by, you could use the following query (untested):
SELECT
sum(("Action" = 'Start')::int) OVER (PARTITION BY "EventID" ORDER BY "Timestamp" ROWS UNBOUNDED PRECEDING)
FROM
events
Note that if the first SubEvent does not start with Start, it will have an event id of 0, which might not be what you want.
You could also use COUNT() in place of SUM():
SELECT
EventID
, COUNT(CASE WHEN Action = 'Start' THEN 1 END)
OVER ( PARTITION BY EventID
ORDER BY TimeStamp
ROWS UNBOUNDED PRECEDING )
AS SubeventID
, UserID
, Action
FROM
tableX AS t ;
Tests at SQL-Fiddle: test