I have the following table (known as table1):
row_id session_id date_end user_id item_id
---------------------------------------------------
3962 5958255 2017-11-07 3249480 1
4553 5959689 2017-11-07 3249484 1
4554 5959689 2017-11-07 3249484 1
8775 5968439 2017-11-08 3249492 4
6706 5965190 2017-11-08 3249492 2
6779 5965280 2017-11-08 3249492 3
6778 5965280 2017-11-08 3249492 3
8774 5968439 2017-11-08 3249492 4
6685 5965159 2017-11-08 3249502 1
5314 5962257 2017-11-07 3249504 1
5315 5962257 2017-11-07 3249504 1
13564 5982665 2017-11-09 3249510 1
13565 5982665 2017-11-09 3249510 1
238 5941818 2017-11-06 3249540 1
8078 5967039 2017-11-08 3249540 3
13981 5984747 2017-11-09 3249540 4
127080 6267047 2017-11-30 3249540 10
When querying this database I need 3 new columns:
The count of items that are bought by each user
The count of items that are bought that contain same item_id as current row
The count of items that are bought that contain different item_id as that in the current row
However, I need all of these counts to be made with respect to a 30-day period. For example, the row for user_id 3249492 should read:
row_id session_id date_end user_id item_id total same diff
8775 5968439 2017-11-08 3249492 4 5 1 3
6706 5965190 2017-11-08 3249492 2 4 0 3
6779 5965280 2017-11-08 3249492 3 3 1 1
6778 5965280 2017-11-08 3249492 3 2 0 1
8774 5968439 2017-11-08 3249492 4 1 0 0
I have the following:
SELECT row_id, session_id, date_end, user_id, item_id,
COUNT(item_id) OVER (PARTITION BY user_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as total,
COUNT(item_id) OVER (PARTITION BY user_id, item_id ORDER BY item_id ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as same
FROM table1
Which yields the correct values for total and same but does not take into account the 30-day window. Also, I have no idea where to start with the diff column.
SQL Fiddle: http://sqlfiddle.com/#!17/ac833/2
This PostgreSQL 9.6
Any help would be greatly appreciated.
30 day running count
Instead of using a window function we can use a self join to get the 30 day running count.
WITH thirty_days_window AS (
SELECT table1.row_id, table1.item_id, "window".item_id AS other_item_id
FROM table1 join table1 AS "window" ON "window".user_id = table1.user_id AND
"window".date_end BETWEEN table1.date_end - interval '30 days' AND table1.date_end AND
"window".row_id <= table1.row_id
),
counts AS (
SELECT row_id,
COUNT(*) AS total,
COUNT(CASE WHEN item_id = other_item_id THEN 1 END) - 1 AS same,
COUNT(CASE WHEN item_id != other_item_id THEN 1 END) AS diff
FROM thirty_days_window GROUP BY row_id)
SELECT table1.row_id, session_id, date_end, user_id, table1.item_id,
total, same, diff
FROM table1 JOIN counts ON counts.row_id = table1.row_id
ORDER BY row_id;
The first part thirty_days_window creates the window by joining every row with all rows with the same user_id happening in a window of 30 days. We also assume that we only want rows with a lower row_id then the current.
Next we count the rows. same counts only rows where the item_id is the same as the item_id of the joined row (subtracting 1 to remove the original row), diff does exactly the opposite, get all rows where item_id is different from the joined row.
Finally we join back to the original table to add the session_id user_id and date_end.
The final result using the data in the fiddle:
row_id | session_id | date_end | user_id | item_id | total | same | diff
--------+------------+------------+---------+---------+-------+------+------
6706 | 5965190 | 2017-11-08 | 3249492 | 151 | 1 | 0 | 0
6778 | 5965280 | 2017-11-08 | 3249492 | 151 | 2 | 1 | 0
6779 | 5965280 | 2017-11-08 | 3249492 | 158 | 3 | 0 | 2
8774 | 5968439 | 2017-11-08 | 3249492 | 151 | 4 | 2 | 1
8775 | 5968439 | 2017-11-08 | 3249492 | 158 | 5 | 1 | 3
47046 | 6063745 | 2017-11-15 | 3263305 | 157 | 1 | 0 | 0
47047 | 6063745 | 2017-11-15 | 3263305 | 158 | 2 | 0 | 1
59887 | 6094293 | 2017-11-16 | 3263305 | 157 | 3 | 1 | 1
59888 | 6094294 | 2017-11-16 | 3263305 | 157 | 4 | 2 | 1
60343 | 6095456 | 2017-11-16 | 3263305 | 157 | 5 | 3 | 1
60344 | 6095457 | 2017-11-16 | 3263305 | 157 | 6 | 4 | 1
69112 | 6116357 | 2017-11-17 | 3263305 | 157 | 7 | 5 | 1
71085 | 6119700 | 2017-11-18 | 3263305 | 157 | 8 | 6 | 1
71508 | 6120421 | 2017-11-18 | 3250078 | 157 | 1 | 0 | 0
71509 | 6120421 | 2017-11-18 | 3250078 | 152 | 2 | 0 | 1
71510 | 6120421 | 2017-11-18 | 3250078 | 156 | 3 | 0 | 2
71511 | 6120421 | 2017-11-18 | 3250078 | 154 | 4 | 0 | 3
71512 | 6120421 | 2017-11-18 | 3250078 | 151 | 5 | 0 | 4
71513 | 6120421 | 2017-11-18 | 3250078 | 158 | 6 | 0 | 5
72242 | 6121399 | 2017-11-18 | 3263305 | 157 | 9 | 7 | 1
75696 | 6126280 | 2017-11-19 | 3263305 | 157 | 10 | 8 | 1
76082 | 6126777 | 2017-11-19 | 3263305 | 157 | 11 | 9 | 1
77546 | 6129039 | 2017-11-19 | 3263305 | 157 | 12 | 10 | 1
83754 | 6143858 | 2017-11-20 | 3263305 | 157 | 13 | 11 | 1
91331 | 6167552 | 2017-11-22 | 3263305 | 157 | 14 | 12 | 1
92431 | 6171560 | 2017-11-22 | 3263305 | 157 | 15 | 13 | 1
95073 | 6177870 | 2017-11-23 | 3263305 | 157 | 16 | 14 | 1
95302 | 6178780 | 2017-11-23 | 3263305 | 157 | 17 | 15 | 1
287471 | 7164221 | 2018-02-10 | 4516965 | 154 | 1 | 0 | 0
288750 | 7170955 | 2018-02-11 | 4516965 | 158 | 2 | 0 | 1
288751 | 7170955 | 2018-02-11 | 4516965 | 151 | 3 | 0 | 2
(31 rows)
Edit
After thinking about this for a bit, it's possible to do the query in one select:
SELECT table1.row_id, MIN(table1.session_id),
MIN(table1.date_end), MIN(table1.user_id), MIN(table1.item_id),
COUNT(*) as total,
COUNT(CASE WHEN table1.item_id = windw.item_id THEN 1 END) - 1 AS same,
COUNT(CASE WHEN table1.item_id != windw.item_id THEN 1 END)
FROM table1 JOIN table1 AS windw ON windw.user_id = table1.user_id AND
windw.date_end BETWEEN table1.date_end - INTERVAL '30 days' AND table1.date_end AND
windw.row_id <= table1.row_id
GROUP BY table1.row_id ORDER BY table1.row_id;
Related
I have a table cart_items with the following data
+-----+---------+---------+------------+----------+
| id | user_id | cart_id | product_id | quantity |
+-----+---------+---------+------------+----------+
| 303 | 9 | 44 | 1 | 2 |
| 305 | 9 | 44 | 3 | 1 |
| 307 | 9 | 44 | 3 | 1 |
| 308 | 9 | 44 | 2 | 1 |
| 309 | 9 | 44 | 6 | 1 |
| 310 | 9 | 44 | 2 | 1 |
+-----+---------+---------+------------+----------+
My problem is that there are duplicate products. My desired table would be this
+-----+---------+---------+------------+----------+
| id | user_id | cart_id | product_id | quantity |
+-----+---------+---------+------------+----------+
| 303 | 9 | 44 | 1 | 2 |
| 305 | 9 | 44 | 3 | 2 |
| 308 | 9 | 44 | 2 | 2 |
| 309 | 9 | 44 | 6 | 1 |
+-----+---------+---------+------------+----------+
So the difference is that the duplicates product_id got merged and increased the quantity.
Is there an easy way to do this with an SQL query?
You need to group by user_id, cart_id, product_id and aggregate:
select
min(id) id, user_id, cart_id, product_id, sum(quantity) quantity
from cart_items
group by user_id, cart_id, product_id
I have two tables. The first inv containing records of invoices, the second containing payments. I want to match the payments in the inv table by inv_amount and inv_date. There might be more than one invoice with the same amount on the same day and also more than one payment of the same amount on the same day.
The payment should be matched with the first matching invoice and every payment must only be matched once.
This is my data:
Table inv
inv_id | inv_amount | inv_date | inv_number
--------+------------+------------+------------
1 | 10 | 2018-01-01 | 1
2 | 16 | 2018-01-01 | 1
3 | 12 | 2018-02-02 | 2
4 | 14 | 2018-02-03 | 3
5 | 19 | 2018-02-04 | 3
6 | 19 | 2018-02-04 | 5
7 | 5 | 2018-02-04 | 6
8 | 40 | 2018-02-04 | 7
9 | 19 | 2018-02-04 | 8
10 | 19 | 2018-02-05 | 9
11 | 20 | 2018-02-05 | 10
12 | 20 | 2018-02-07 | 11
Table pay
pay_id | pay_amount | pay_date
--------+------------+------------
1 | 10 | 2018-01-01
2 | 12 | 2018-02-02
4 | 19 | 2018-02-04
3 | 14 | 2018-02-03
5 | 5 | 2018-02-04
6 | 19 | 2018-02-04
7 | 19 | 2018-02-05
8 | 20 | 2018-02-07
My Query:
SELECT DISTINCT ON (inv.inv_id) inv.inv_id,
inv.inv_amount,
inv.inv_date,
inv.inv_number,
pay.pay_id
FROM ("2016".pay
RIGHT JOIN "2016".inv ON (((pay.pay_amount = inv.inv_amount) AND (pay.pay_date = inv.inv_date))))
ORDER BY inv.inv_id
resulting in:
inv_id | inv_amount | inv_date | inv_number | pay_id
--------+------------+------------+------------+--------
1 | 10 | 2018-01-01 | 1 | 1
2 | 16 | 2018-01-01 | 1 |
3 | 12 | 2018-02-02 | 2 | 2
4 | 14 | 2018-02-03 | 3 | 3
5 | 19 | 2018-02-04 | 3 | 4
6 | 19 | 2018-02-04 | 5 | 4
7 | 5 | 2018-02-04 | 6 | 5
8 | 40 | 2018-02-04 | 7 |
9 | 19 | 2018-02-04 | 8 | 6
10 | 19 | 2018-02-05 | 9 | 7
11 | 20 | 2018-02-05 | 10 |
12 | 20 | 2018-02-07 | 11 | 8
The record inv_id = 6 should not match with pay_id = 4 for it would mean that payment 4 was inserted twice
Desired result:
inv_id | inv_amount | inv_date | inv_number | pay_id
--------+------------+------------+------------+--------
1 | 10 | 2018-01-01 | 1 | 1
2 | 16 | 2018-01-01 | 1 |
3 | 12 | 2018-02-02 | 2 | 2
4 | 14 | 2018-02-03 | 3 | 3
5 | 19 | 2018-02-04 | 3 | 4
6 | 19 | 2018-02-04 | 5 | <- should be empty**
7 | 5 | 2018-02-04 | 6 | 5
8 | 40 | 2018-02-04 | 7 |
9 | 19 | 2018-02-04 | 8 | 6
10 | 19 | 2018-02-05 | 9 | 7
11 | 20 | 2018-02-05 | 10 |
12 | 20 | 2018-02-07 | 11 | 8
Disclaimer: Yes I asked that question yesterday with the original data but someone pointed out that my sql was very hard to read. I, therefore, tried to create a cleaner representation of my problem.
For convenience, here's an SQL Fiddle to test: http://sqlfiddle.com/#!17/018d7/1
After seeing the example I think I've got the query for you:
WITH payments_cte AS (
SELECT
payment_id,
payment_amount,
payment_date,
ROW_NUMBER() OVER (PARTITION BY payment_amount, payment_date ORDER BY payment_id) AS payment_row
FROM payments
), invoices_cte AS (
SELECT
invoice_id,
invoice_amount,
invoice_date,
invoice_number,
ROW_NUMBER() OVER (PARTITION BY invoice_amount, invoice_date ORDER BY invoice_id) AS invoice_row
FROM invoices
)
SELECT invoice_id, invoice_amount, invoice_date, invoice_number, payment_id
FROM invoices_cte
LEFT JOIN payments_cte
ON payment_amount = invoice_amount
AND payment_date = invoice_date
AND payment_row = invoice_row
ORDER BY invoice_id, payment_id
I have created several different select statements for this one project due to different types of reporting, but now I have an interesting scenario, but I figured it would be far more common or maybe I'm just not using the right terminology?
My latest hurdle is that I am trying to join 2 tables together but not in the same row but in the same column....
So I have this query that Partitions the Pick Ticket Numbers based on when they were scanned...
WITH ticket AS
(
SELECT ticket_trail.PickT_Num
,ticket_trail.ticket_status
,ticket_trail.ID
,cast(ticket_trail.Time_stamp as DateTime)as 'time_stamped',
ROW_NUMBER() OVER(PARTITION BY ticket_trail.PickT_Num ORDER BY
ticket_trail.time_stamp Asc) as RowNum
FROM
ticket_trail
)
SELECT
ticket.RowNum,ticket.PickT_Num AS 'Pick Ticket'
,ticket.ID AS id1
,ticket.ticket_status as Ticket_Status
,ticket.time_stamped as start_time
,Row2.ID id2,ISNULL(Row2.time_stamped,GetDate()) AS "End Time"
,DATEDIFF(MINUTE,ticket.time_stamped,ISNULL(Row2.time_stamped,GetDate()))
From
ticket left join ticket AS Row2
ON
ticket.RowNum +1 = Row2.RowNum AND ticket.PickT_Num = Row2.PickT_Num
Here is it's output -
RowNum | Pick Ticket | id1 | Ticket_Status | start_time | id2 | End Time | Diff
1 | 4628750 | 65 | Yellow | 2017-11-08 09:24:14.000 | 66 | 2017-11-08 09:24:26.000 | 0
2 | 4628750 | 66 | Green | 2017-11-08 09:24:26.000 | NULL | 2017-11-21 16:33:12.733 | 19149
1 | 4647142 | 78 | Yellow | 2017-11-08 09:28:02.000 | 79 | 2017-11-08 09:28:08.000 | 0
2 | 4647142 | 79 | Flashing | 2017-11-08 09:28:08.000 | 295 | 2017-11-08 14:14:10.000 | 286
3 | 4647142 | 295 | Green | 2017-11-08 14:14:10.000 | NULL | 2017-11-21 16:33:12.733 | 18859
1 | 4647973 | 1 | Blue | 2017-11-08 09:02:04.000 | 21 | 2017-11-08 09:06:05.000 | 4
2 | 4647973 | 21 | Green | 2017-11-08 09:06:05.000 | NULL | 2017-11-21 16:33:12.733 | 19167
1 | 4648017 | 2 | Blue | 2017-11-08 09:02:26.000 | 22 | 2017-11-08 09:05:56.000 | 3
2 | 4648017 | 22 | Green | 2017-11-08 09:05:56.000 | NULL | 2017-11-21 16:33:12.733 | 19168
1 | 4648030 | 41 | Blue | 2017-11-08 09:18:20.000 | 54 | 2017-11-08 09:22:39.000 | 4
2 | 4648030 | 54 | Green | 2017-11-08 09:22:39.000 | NULL | 2017-11-21 16:33:12.733 | 19151
OK so that Query works perfectly!! Yet, It doesn't tell the whole story! I need to be able to add another entry to this as a '0' RowNum to each PickT_Num from another table called Orders_ent that gives a time_stamp called Printed
So I figured it would have to be a Case statement, but I'm not sure where to start...
It can have NULLS in the output, But basically what I'm looking for is:
RowNum | Pick Ticket | id1 | Ticket_Status | start_time | id2 | End Time | Diff
0 | 4628750 | NULL | Printed | 2017-11-08 09:20:14.000 | 65 | 2017-11-08 09:24:14.000 | 4
1 | 4628750 | 65 | Yellow | 2017-11-08 09:24:14.000 | 66 | 2017-11-08 09:24:26.000 | 0
2 | 4628750 | 66 | Green | 2017-11-08 09:24:26.000 | NULL | 2017-11-21 16:33:12.733 | 19149
The Orders_ent Table looks like this...
UID | Order_num | PickT_Num | Date_Created | Pick_Ticket_Printed_DATE_Time | other crap...
Try to add UNION ALL and ORDER BY
... -- your query is here
UNION ALL
SELECT
0 RowNum,
e.PickT_Num,
NULL id1,
'Printed' Ticket_Status,
e.Date_Created,
t.ID id2,
e.Pick_Ticket_Printed_DATE_Time,
DATEDIFF(MINUTE,e.Date_Created,ISNULL(e.Pick_Ticket_Printed_DATE_Time,GetDate()))
FROM Orders_ent e
JOIN
(
SELECT PickT_Num,ID
FROM ticket
WHERE RowNum=1
) t
ON e.PickT_Num=t.PickT_Num
ORDER BY 'Pick Ticket',RowNum
I have a dataset structured such as the one below stored in Hive, call it df:
+-----+-----+----------+--------+
| id1 | id2 | date | amount |
+-----+-----+----------+--------+
| 1 | 2 | 11-07-17 | 0.93 |
| 2 | 2 | 11-11-17 | 1.94 |
| 2 | 2 | 11-09-17 | 1.90 |
| 1 | 1 | 11-10-17 | 0.33 |
| 2 | 2 | 11-10-17 | 1.93 |
| 1 | 1 | 11-07-17 | 0.25 |
| 1 | 1 | 11-09-17 | 0.33 |
| 1 | 1 | 11-12-17 | 0.33 |
| 2 | 2 | 11-08-17 | 1.90 |
| 1 | 1 | 11-08-17 | 0.30 |
| 2 | 2 | 11-12-17 | 2.01 |
| 1 | 2 | 11-12-17 | 1.00 |
| 1 | 2 | 11-09-17 | 0.94 |
| 2 | 2 | 11-07-17 | 1.94 |
| 1 | 2 | 11-11-17 | 1.92 |
| 1 | 1 | 11-11-17 | 0.33 |
| 1 | 2 | 11-10-17 | 1.92 |
| 1 | 2 | 11-08-17 | 0.94 |
+-----+-----+----------+--------+
I wish to partition by id1 and id2, and then order by date descending within each grouping of id1 and id2, and then rank "amount" within that, where the same "amount" on consecutive days would receive the same rank. The ordered and ranked output I'd hope to see is shown here:
+-----+-----+------------+--------+------+
| id1 | id2 | date | amount | rank |
+-----+-----+------------+--------+------+
| 1 | 1 | 2017-11-12 | 0.33 | 1 |
| 1 | 1 | 2017-11-11 | 0.33 | 1 |
| 1 | 1 | 2017-11-10 | 0.33 | 1 |
| 1 | 1 | 2017-11-09 | 0.33 | 1 |
| 1 | 1 | 2017-11-08 | 0.30 | 2 |
| 1 | 1 | 2017-11-07 | 0.25 | 3 |
| 1 | 2 | 2017-11-12 | 1.00 | 1 |
| 1 | 2 | 2017-11-11 | 1.92 | 2 |
| 1 | 2 | 2017-11-10 | 1.92 | 2 |
| 1 | 2 | 2017-11-09 | 0.94 | 3 |
| 1 | 2 | 2017-11-08 | 0.94 | 3 |
| 1 | 2 | 2017-11-07 | 0.93 | 4 |
| 2 | 2 | 2017-11-12 | 2.01 | 1 |
| 2 | 2 | 2017-11-11 | 1.94 | 2 |
| 2 | 2 | 2017-11-10 | 1.93 | 3 |
| 2 | 2 | 2017-11-09 | 1.90 | 4 |
| 2 | 2 | 2017-11-08 | 1.90 | 4 |
| 2 | 2 | 2017-11-07 | 1.94 | 5 |
+-----+-----+------------+--------+------+
I attempted this with the following SQL query:
SELECT
id1,
id2,
date,
amount,
dense_rank() OVER (PARTITION BY id1, id2 ORDER BY date DESC) AS rank
FROM
df
GROUP BY
id1,
id2,
date,
amount
But that query doesn't seem to be doing what I'd like it to as I'm not receiving the output I'm looking for.
It seems like a window function using dense_rank, partition by and order by is what I need but I can't quite seem to get it to give me that sample output that I desire. Any help would be much appreciated! Thanks!
This is quite tricky. I think you need to use lag() to see where the value changes and then do a cumulative sum:
select df.*,
sum(case when prev_amount = amount then 0 else 1 end) over
(partition by id1, id2 order by date desc) as rank
from (select df.*,
lag(amount) over (partition by id1, id2 order by date desc) as prev_amount
from df
) df;
In sql I have a history table for each item we have and they can have a record of in or out with a quantity for each action. I'm trying to get a running count of how many of an item we have based on whether it's an activity of out or in. Here is my final sql:
SELECT itemid,
activitydate,
activitycode,
SUM(quantity) AS quantity,
SUM(CASE WHEN activitycode = 'IN'
THEN quantity
WHEN activitycode = 'OUT'
THEN -quantity
ELSE 0 END) OVER (PARTITION BY itemid ORDER BY activitydate rows unbounded preceding) AS runningcount
FROM itemhistory
GROUP BY itemid,
activitydate,
activitycode
This results in:
+--------+-------------------------+--------------+----------+--------------+
| itemid | activitydate | activitycode | quantity | runningcount |
+--------+-------------------------+--------------+----------+--------------+
| 1 | 2017-06-08 13:58:00.000 | IN | 1 | 1 |
| 1 | 2017-06-08 16:02:00.000 | IN | 6 | 2 |
| 1 | 2017-06-15 11:43:00.000 | OUT | 3 | 1 |
| 1 | 2017-06-19 12:36:00.000 | IN | 1 | 2 |
| 2 | 2017-06-08 13:50:00.000 | IN | 5 | 1 |
| 2 | 2017-06-12 12:41:00.000 | IN | 4 | 2 |
| 2 | 2017-06-15 11:38:00.000 | OUT | 2 | 1 |
| 2 | 2017-06-20 12:54:00.000 | IN | 15 | 2 |
| 2 | 2017-06-08 13:52:00.000 | IN | 5 | 3 |
| 2 | 2017-06-12 13:09:00.000 | IN | 1 | 4 |
| 2 | 2017-06-15 11:47:00.000 | OUT | 1 | 3 |
| 2 | 2017-06-20 13:14:00.000 | IN | 1 | 4 |
+--------+-------------------------+--------------+----------+--------------+
I want the end result to look like this:
+--------+-------------------------+--------------+----------+--------------+
| itemid | activitydate | activitycode | quantity | runningcount |
+--------+-------------------------+--------------+----------+--------------+
| 1 | 2017-06-08 13:58:00.000 | IN | 1 | 1 |
| 1 | 2017-06-08 16:02:00.000 | IN | 6 | 7 |
| 1 | 2017-06-15 11:43:00.000 | OUT | 3 | 4 |
| 1 | 2017-06-19 12:36:00.000 | IN | 1 | 5 |
| 2 | 2017-06-08 13:50:00.000 | IN | 5 | 5 |
| 2 | 2017-06-12 12:41:00.000 | IN | 4 | 9 |
| 2 | 2017-06-15 11:38:00.000 | OUT | 2 | 7 |
| 2 | 2017-06-20 12:54:00.000 | IN | 15 | 22 |
| 2 | 2017-06-08 13:52:00.000 | IN | 5 | 27 |
| 2 | 2017-06-12 13:09:00.000 | IN | 1 | 28 |
| 2 | 2017-06-15 11:47:00.000 | OUT | 1 | 27 |
| 2 | 2017-06-20 13:14:00.000 | IN | 1 | 28 |
+--------+-------------------------+--------------+----------+--------------+
You want sum(sum()), because this is an aggregation query:
SELECT itemid, activitydate, activitycode,
SUM(quantity) AS quantity,
SUM(SUM(CASE WHEN activitycode = 'IN' THEN quantity
WHEN activitycode = 'OUT' THEN -quantity
ELSE 0
END)
) OVER (PARTITION BY itemid ORDER BY activitydate ) AS runningcount
FROM itemhistory
GROUP BY itemid, activitydate, activitycode