Group by and provide groups only if unique in group - sql

i have the Following dataset :
Amount Document Number
0 200 12345
1 90 2222
2 200 456789
3 90 4444
4 300 4789
5 300 4789
So basically i want to get group numbers for the above data (using ngroup maybe)
Grouping the data on the basis of amount. assign a group number to one group only if the Document numbers in that group has unique numbers.
This is what i would like the outcome to be.
Amount Document Number Group
0 200 12345 1
1 90 2222 2
2 200 456789 1
3 90 4444 2
4 300 4789
5 300 4789

Grouping the data on the basis of amount. assign the rows to one group only if the Document number is a unique number.
I think you want rank():
select t.*, rank() over (order by amount, document_number) as grouping
from t;

In pandas, you could first create a mask where any group by amount has a dup is flagged as False with groupby.transform and duplicated, then use this mask and groupby.ngroup like:
mask_dup = ~(df.duplicated().groupby(df['Amount']).transform(any))
df.loc[mask_dup, 'Group'] = df[mask_dup].groupby('Amount').ngroup()+1
print (df)
Amount Document Number Group
0 200 12345 2.0
1 90 2222 1.0
2 200 456789 2.0
3 90 4444 1.0
4 300 4789 NaN
5 300 4789 NaN
if you have more than the two columns at first you need to specify the subset in duplicated

Related

Average of Rows Based On Another Columns Values

I have a table like this. I'm looking for a clean way in SQL to create a new column with the average between the Column 2 values for the rows where Column 1 equals 1 and 2 for each id.
I have some ideas on gross ways to do this, but I am looking for a straightforward solution since this seems like it should not be too difficult.
ID
Column 1
Column 2
1
1
100
1
2
75
1
3
50
2
1
45
2
2
90
2
3
60
Use window function avg with filtering.
select *,
avg(column2)
filter (where column1 in (1,2))
over (partition by id) as avrg
from the_table;
id
column1
column2
avrg
1
1
100
87.50
1
2
75
87.50
1
3
50
87.50
2
1
45
67.50
2
2
90
67.50
2
3
60
67.50
db-fiddle

How to calculate leftovers of each balance top-up using first in first out technique?

Imagine we have user balances. There's a table with top-up and withdrawals. Let's call it balance_updates.
transaction_id
user_id
current_balance
amount
created_at
1
1
100
100
...
2
1
0
-100
3
2
400
400
4
2
300
-100
5
2
200
-200
6
2
300
100
7
2
50
-50
What I want to get off this is a list of top-ups and their leftovers using the first in first out technique for each user.
So the result could be this
top_up
user_id
leftover
1
1
0
3
2
50
6
2
100
Honestly, I struggle to turn it to SQL. Tho I know how to do it on paper. Got any ideas?

Get average of rows group by value intervals

I have a table as follows:
ID | Value
1 5
1 1000
1 1500
2 1000
2 1800
3 40
3 1000
3 1200
3 2000
3 2500
I want to obtain the average of each ID groupped by a given range r of value. For instance, if in this case r=1000, The expected result would be:
ID | Value
1 5
1 1250
2 1400
3 40
3 1100
3 2250
I have seen that this can be done with time intervals as seen here. My question is, how can I perform this type of group by operation for integer/float types?
You could try this way:
SELECT id, avg(value) as AvgValue
FROM (SELECT id, value, ROUND(value/1000, 0) AS range FROM yourtable) t
GROUP BY id, range

Is it possible to set a dynamic window frame bound in SQL OVER(ROW BETWEEN ...)-Clause?

Consider the following table, describing a patients medication plan. For example, the first row describes that the patient with patient_id = 1 is treated from timestamp 0 to 4. At time = 0, the patient has not yet become any medication (kum_amount_start = 0). At time = 4, the patient has received a kumulated amount of 100 units of a certain drug. It can be assumed, that the drug is given in with a constant rate. Regarding the first row, this means that the drug is given with a rate of 25 units/h.
patient_id
starttime [h]
endtime [h]
kum_amount_start
kum_amount_end
1
0
4
0
100
1
4
5
100
300
1
5
15
300
550
1
15
18
550
700
2
0
3
0
150
2
3
6
150
350
2
6
10
350
700
2
10
15
700
1100
2
15
19
1100
1500
I want to add the two columns "kum_amount_start_last_6hr" and "kum_amount_end_last_6hr" that describe the amount that has been given within the last 6 hours of the treatment (for the respective timestamps start, end).
I'm stuck with this problem for a while now.
I tried to tackle it with something like this
SUM(kum_amount) OVER (PARTITION BY patient_id ROWS BETWEEN "dynmaic window size" AND CURRENT ROW)
but I'm not sure whether this is the right approach.
I would be very happy if you could help me out here, thanks!

Calculate percentiles using SQL for a group/partition

I want to calculate the percentiles for a given partition/group in SQL. For example the input data looks like -
CustID Product ID quantity_purchased
1 111 2
2 111 3
3 111 2
4 111 5
1 222 2
2 222 6
4 222 7
6 222 2
I want to get percentiles on each product ID group. The output should be -
Product ID min 25% 50% 75% max
111 2 2 2.5 3.5 5
222 2 2 4 6.25 7
How to achieve this using SQL?
You can use percentile_cont():
select product_id, min(quantity_purchased), max(quantity_purchased),
percentile_cont(0.25) within group (order by quantity_purchased),
percentile_cont(0.50) within group (order by quantity_purchased),
percentile_cont(0.75) within group (order by quantity_purchased)
from t
group by product_id;