Get the running unique count of items till a give date, similar to running total but instead a running unique count - sql

I have a table with user shopping data as shown below
I want an output similar to running total but instead I want the running total of the count of unique categories that the user has shopped for by date.
I know I have to make use of ROWS PRECEDING AND FOLLOWING in the count function but I am not able to user count(distinct category) in a window function
Dt category userId
4/10/2022 Grocery 123
4/11/2022 Grocery 123
4/12/2022 MISC 123
4/13/2022 SERVICES 123
4/14/2022 RETAIl 123
4/15/2022 TRANSP 123
4/20/2022 GROCERY 123
Desired output
Dt userID number of unique categories
4/10/2022 123 1
4/11/2022 123 1
4/12/2022 123 2
4/13/2022 123 3
4/14/2022 123 4
4/15/2022 123 5
4/20/2022 123 5

Consider below approach
select Dt, userId,
( select count(distinct category)
from t.categories as category
) number_of_unique_categories
from (
select *, array_agg(lower(category)) over(partition by userId order by Dt) categories
from your_table
) t
if applied to sample data in your question - output is

Related

Distribute large quantities over multiple rows

I have a simple Order table and one order can have different products with Quantity and it's Product's weight as below
OrderID
ProductName
Qty
Weight
101
ProductA
2
24
101
ProductB
1
24
101
ProductC
1
48
101
ProductD
1
12
101
ProductE
1
12
102
ProductA
5
60
102
ProductB
1
12
I am trying to partition and group the products in such a way that for an order, grouped products weight should not exceed 48.
Expected table look as below
OrderID
ProductName
Qty
Weight
GroupedID
101
ProductA
2
24
1
101
ProductB
1
24
1
101
ProductC
1
48
2
101
ProductD
1
12
3
101
ProductE
1
12
3
102
ProductA
4
48
1
102
ProductA
1
12
2
102
ProductB
1
12
2
Kindly let me know if this is possible.
Thank you.
This is a bin packing problem which is non-trivial in general. It's not just NP-complete but superexponential, ie the time increase as complexity increases is worse than exponential. Dai posted a link to Hugo Kornelis's article series which is referenced by everyone trying to solve this problem. The set-based solution performs really bad. For realistic scenarios you need iteration and preferably, using bin packing libraries eg in Python.
For production work it would be better to take advantage of SQL Server 2017+'s support for Python scripts and use a bin packing library like Google's OR Tools or the binpacking module. Even if you don't want to use sp_execute_external_script you can use a Python script to read the data from the database and split them.
The question's numbers are so regular though you could cheat a bit (actually quite a lot) and distribute all order lines into individual items, calculate the running total per order and then divide the total by the limit to produce the group number.
This works only because the running totals are guaranteed to align with the bin size.
Distributing into items can be done using a Tally/Numbers table, a table with a single Number column storing numbers from 0 to eg 1M.
Given the question's data:
declare #OrderItems table(id int identity(1,1) primary key, OrderID int,ProductName varchar(20),Qty int,Weight int)
insert into #OrderItems(OrderId,ProductName,Qty,Weight)
values
(101,'ProductA',2,24),
(101,'ProductB',1,24),
(101,'ProductC',1,48),
(101,'ProductD',1,12),
(101,'ProductE',1,12),
(102,'ProductA',5,60),
(102,'ProductB',1,12);
The following query will split each order item into individual items. It repeats each order item row as there are individual items and calculates the individual item weight
select o.*, Weight/Qty as ItemWeight
from #OrderItems o inner join Numbers ON Qty >Numbers.Number;
This row:
1 101 ProductA 2 24
Becomes
1 101 ProductA 2 24 12
1 101 ProductA 2 24 12
Calculating the running total inside a query can be done with :
SUM(ItemWeight) OVER(Partition By OrderId
Order By Itemweight
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
The Order By Itemweight claus means the smallest items are picked first, ie it's a Worst fit algorithm.
The overall query calculating the total and Group ID is
with items as (
select o.*, Weight/Qty as ItemWeight
from #OrderItems o INNER JOIN Numbers ON Qty > Numbers.Number
)
select Id,OrderId,ProductName,Qty,Weight, ItemWeight,
ceiling(SUM(ItemWeight) OVER(Partition By OrderId
Order By Itemweight
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)/48.0)
As GroupId
from items;
After that, individual items need to be grouped back into order items and groups. This produces the final query:
with items as (
select o.*, Weight/Qty as ItemWeight
from #OrderItems o INNER JOIN Numbers ON Qty > Numbers.Number
)
,bins as(
select Id,OrderId,ProductName,Qty,Weight, ItemWeight,
ceiling(SUM(ItemWeight) OVER(Partition By OrderId
Order By Itemweight
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)/48.0) As GroupId
from items
)
select
max(OrderId) as orderid,
max(productname) as ProductName,
count(*) as Qty,
sum(ItemWeight) as Weight,
max(GroupId) as GroupId
from bins
group by id,groupid
order by orderid,groupid
This returns
orderid
ProductName
Qty
Weight
GroupId
101
ProductA
2
24
1
101
ProductD
1
12
1
101
ProductE
1
12
1
101
ProductB
1
24
2
101
ProductC
1
48
3
102
ProductA
4
48
1
102
ProductA
1
12
2
102
ProductB
1
12
2

How to Select ID's in SQL (Databricks) in which at least 2 items from a list are present

I'm working with patient-level data in Azure Databricks and I'm trying to build out a cohort of patients that have at least 2 diagnoses from a list of specific diagnosis codes. This is essentially what the table looks like:
CLAIM_ID | PTNT_ID | ICD_CD | DATE
---------+---------+--------+------------
1 101 2500 01_25_2020
2 101 3850 03_13_2018
3 222 2500 10_26_2018
4 222 8888 11_30_2018
5 222 9155 04_01_2019
6 871 2500 02_17_2020
7 871 3200 09_09_2019
The list of ICD_CD codes of interest is something like [2500, 3850, 8888]. In this case, I would want to return TOTAL UNIQUE PTNT_ID = 2. These would be PTNT_ID = (101, 222) as these are the only two patients that have at least 2 ICD_CD codes of interest.
When I use something like this, I'm able to return all of the relevant PTNT_ID values, but I'm not able to get the total count of these PTNT_ID:
select mc.PTNT_ID
from MEDICAL_CLAIMS mc
where mc.PTNT_ID in ( # list of ICD_CD of interest
)
group by mc.PTNT_ID
having count(distinct mc.PTNT) >= 2
When I try to add a COUNT statement in, it returns an error
Just select from the query:
select count(*)
from
(
select mc.PTNT_ID
from MEDICAL_CLAIMS mc
where mc.PTNT_ID in ( # list of ICD_CD of interest )
group by mc.PTNT_ID
having count(distinct mc.PTNT) >= 2
) ptnts;

Need Distinct address, ID etc with the different amount in one table using Plsql

Need help on the below scenario, please.
I want distinct address, ID, etc with the different amount in one table using plsql or
For example below is the current table
Address aRea zipcode ID Amount amount2 qua number
123 Howe's drive AL 1234 1234567 100 20 1 666666
123 Howe's drive AL 1234 1234567 5 05 2 abcccc
123 east drive AZ 456 8910112 200 11 1 777777
123 east drive AZ 456 8910112 5 5 2 SDN133
116 WOOD Ave NL 1234 2325890 3.23 1.25 1 10483210
116 WOOD Ave NL 1234 2325890 3.24 1.26 2 10483211
I need the output as below.
Address aRea zipcode ID Amount amount2 qua number
123 Howe's drive AL 1234 1234567 100 20 1 666666
5 05 2 abcccc
123 east drive AZ 456 8910112 200 11 1 777777
5 5 2 SDN133
116 WOOD Ave NL 1234 2325890 3.23 1.25 1 10483210
3.24 1.26 2 10483211
Below is for BigQuery Standard SQL
I would recommended below approach
#standardSQL
SELECT address, area, zipcode, id,
ARRAY_AGG(STRUCT(amount, amount2, qua, number)) info
FROM `project.dataset.table`
GROUP BY address, area, zipcode, id
if to apply to sample data from your question - output is
This type of task is usually better done on application side.
You can do this with SQL - but you need a column to order the record consistently:
select
case when rn = 1 then address end as address,
case when rn = 1 then area end as area,
case when rn = 1 then zipcode end as zipcode,
case when rn = 1 then id end as id,
amount,
amount2,
qua,
number
from (
select
t.*,
row_number() over(
partition by address, area, zipcode, id
order by ??
) rn
from mytable t
) t
order by address, area, zipcode, id, ??
The partition by clause of the window function lists the columns that you want to "group" together; you can modify it as you really need.
The order by clause of row_number() indicates how rows should be sorted within the partition: you need to decide which column (or set of columns) you want to use. For the output tu make sense, you also need an order by clause in the query, where the partition and ordering columns will be repeated.

Oracle SQL Help Data Totals

I am on Oracle 12c and need help with the simple query.
Here is the sample data of what I currently have:
Table Name: customer
Table DDL
create table customer(
customer_id varchar2(50),
name varchar2(50),
activation_dt date,
space_occupied number(50)
);
Sample Table Data:
customer_id name activation_dt space_occupied
abc abc-001 2016-09-12 20
xyz xyz-001 2016-09-12 10
Sample Data Output
The query I am looking for will provide the following:
customer_id name activation_dt space_occupied
abc abc-001 2016-09-12 20
xyz xyz-001 2016-09-12 10
Total_Space null null 30
Here is a slightly hack-y approach to this, using the grouping function ROLLUP(). Find out more.
SQL> select coalesce(customer_id, 'Total Space') as customer_id
2 , name
3 , activation_dt
4 , sum(space_occupied) as space_occupied
5 from customer
6 group by ROLLUP(customer_id, name, activation_dt)
7 having grouping(customer_id) = 1
8 or (grouping(name) + grouping(customer_id)+ grouping(activation_dt)) = 0;
CUSTOMER_ID NAME ACTIVATIO SPACE_OCCUPIED
------------ ------------ --------- --------------
abc abc-001 12-SEP-16 20
xyz xyz-001 12-SEP-16 10
Total Space 30
SQL>
ROLLUP() generates intermediate totals for each combination of column; the verbose HAVING clause filters them out and retains only the grand total.
What you want is a bit unusual, as if customer_id is integer, then you have to cast it to string etc, but it this is your requirement, then if be achieved this way.
SELECT customer_id,
name,
activation_dt,
space_occupied
FROM
(SELECT 1 AS seq,
customer_id,
name,
activation_dt,
space_occupied
FROM customer
UNION ALL
SELECT 2 AS seq,
'Total_Space' AS customer_id,
NULL AS name,
NULL AS activation_dt,
sum(space_occupied) AS space_occupied
FROM customer
)
ORDER BY seq
Explanation:
Inner query:
First part of union all; I added 1 as seq to give 1
hardcoded with your resultset from customer.
Second part of union
all: I am just calculating sum(space_occupied) and hardcoding other
columns, including 2 as seq
Outer query; Selecting the data
columns and order by seq, so Total_Space is returned at last.
Output
+-------------+---------+---------------+----------------+
| CUSTOMER_ID | NAME | ACTIVATION_DT | SPACE_OCCUPIED |
+-------------+---------+---------------+----------------+
| abc | abc-001 | 12-SEP-16 | 20 |
| xyz | xyz-001 | 12-SEP-16 | 10 |
| Total_Space | null | null | 30 |
+-------------+---------+---------------+----------------+
Seems like a great place to use group by grouping sets seems like this is what they were designed for. Doc link
SELECT coalesce(Customer_Id,'Total_Space') as Customer_ID
, Name
, ActiviatioN_DT
, sum(Space_occupied) space_Occupied
FROM customer
GROUP BY GROUPING SETS ((Customer_ID, Name, Activation_DT, Space_Occupied)
,())
The key thing here is we are summing space occupied. The two different grouping mechanisms tell the engine to keep each row in it's original form and 1 records with space_occupied summed; since we group by () empty set; only aggregated values will be returned; along with constants (coalesce hardcoded value for total!)
The power of this is that if you needed to group by other things as well you could have multiple grouping sets. imagine a material with a product division, group and line and I want a report with sales totals by division, group and line. You could simply group by () to get grand total, (product_division, Product_Group, line) to get a product line (product_Divsion, product_group) to get a product_group total and (product_division) to get a product Division total. pretty powerful stuff for a partial cube generation.

Select all rows based on alternative publisher

I want list all the rows by alternative publisher with price ascending, see the example table below.
id publisher price
1 ABC 100.00
2 ABC 150.00
3 ABC 105.00
4 XYZ 135.00
5 XYZ 110.00
6 PQR 105.00
7 PQR 125.00
The expected result would be:
id publisher price
1 ABC 100.00
6 PQR 105.00
5 XYZ 110.00
3 ABC 105.00
7 PQR 125.00
4 XYZ 135.00
2 ABC 150.00
What would be the required SQL?
This should do it:
select id, publisher, price
from (
select id, publisher, price,
row_number() over (partition by publisher order by price) as rn
from publisher
) t
order by rn, publisher, price
The window functions assigns unique numbers for each publisher price. Based on that the outer order by will then first display all rows with rn = 1 which are the rows for each publisher with the lowest price. The second row for each publisher has the second lowest price and so on.
SQLFiddle example: http://sqlfiddle.com/#!4/06ece/2
SELECT id, publisher, price
FROM tbl
ORDER BY row_number() OVER (PARTITION BY publisher ORDER BY price), publisher;
One cannot use the output of window functions in the WHERE or HAVING BY clauses because window functions are applied after those. But one can use window functions in the ORDER BY clause.
SQL Fiddle.
Not sure what your table name is - I have called it publishertable. But the following will order the result by price in ascending order - which is the result you are looking for:
select id, publisher, price from publishertable order by price asc
if I've got it right. You should use ROW_NUMBER() function to range prices inside of each publisher and then order by this range and publisher.
SELECT ID,
Publisher,
Price,
Row_number() OVER (PARTITION BY Publisher ORDER BY Price) as rn
FROM T
ORDER BY RN,Publisher
SQLFiddle demo