Order by descending aggregation within window function in PostgreSQL - sql

I have a dataset that features duplicate values of the the primary variable, something like the following:
col1 col2 counts
110 False 1
111 False 2
111 False 1
112 True 3
112 False 2
112 False 1
113 False 1
114 False 1
115 False 2
115 False 1
116 False 1
117 False 1
118 False 4
118 False 3
118 False 2
118 False 1
I have achived this by using the following code
SELECT DISTINCT ctm_nbr
,col1
,col2
,RANK () OVER (PARTITION BY col1 ORDER BY col2) AS counts
FROM my_table
GROUP BY 1,2,3
ORDER BY ctm_nbr, row_numb DESC
However, my desired output needs to be ordered such that counts is descending yet col1 remains partitioned, so that I can see, for example, which value from col1 has the highest number of counts. Like this...
col1 col2 counts
118 False 4
118 False 3
118 False 2
118 False 1
112 True 3
112 False 2
112 False 1
115 False 2
115 False 1
111 False 2
111 False 1
110 False 1
113 False 1
114 False 1
116 False 1
117 False 1
I have tried various iterations of the final ORDER BY clause but just can't quite produce the output I need. Guidance appreciated.

You can use window functions in the order by. I think you just want:
ORDER BY COUNT(*) OVER (PARTITION BY ctm_nbr) DESC,
ctm_nbr,
row_numb DESC
This assumes that the count is the maximum value of row_numb(). So you can also express this as:
ORDER BY MAX(row_numb) OVER (PARTITION BY ctm_nbr) DESC,
ctm_nbr,
row_numb DESC

Related

How to count if there does not exist TRUE in the same category?

Assume I have two tables:
cameraNum
roadNum
isWorking
100
1
TRUE
101
1
FALSE
102
1
TRUE
103
3
FALSE
104
3
FALSE
105
7
TRUE
106
7
TRUE
107
7
TRUE
108
9
FALSE
109
9
FALSE
110
9
FALSE
roadNum
length
1
90
3
140
7
110
9
209
I want to select a table like this:
If there is no camera working, I put it in the table.
roadNum
length
3
140
9
209
I tried this below:
SELECT r.roadNum, r.length
FROM Cameras c, Road r
WHERE c.isWorking = FALSE
AND h.highwayNum = c.highwayNum
But these code only fliter there exists FALSE in isWorking.
roadNum
length
1
90
3
140
9
209
You want roads whose all cameras are not working. Here is one way to do it with aggregation and having:
select r.*
from road r
inner join camera c on c.roadNum = r.roadNum
group by r.roadNum
having not bool_or(isWorking)
Demo on DB Fiddle
roadnum
length
3
140
9
209
Regarding using not exists, yes, you can use it. The following uses a CTE to get only the roadnum of those satisfying the camera requirement then joins that to road: (see demo)
with no_working_caMera (roadnum) as
( select distinct on (c1.roadNum)
c1.roadnum
from cameras c1
where not c1.isworking
and not exists (select null
from cameras c2
where c2.roadNum = c1.roadNum
and c2.isworking
)
order by c1.roadnum
)
select r.*
from no_working_camera nwc
join road r
on nwc.roadnum = r.roadnum;
Don't join, don't aggregate, just use NOT IN or NOT EXISTS in order to find roads that don't have a working camera:
select *
from road
where roadnum not in (select roadnum from cameras where isworking);

how to pick single row from multiple rows in pandas

if I have two rows for the same ID, then I have to check for col2 and pick rows with values N and Q and skip the row with U. If there is single record with col2=U, then let it be. so for ID 123 and 555, output is with col2 N and Q resp.
ID Col1 Col2 Col3
123 AAA N true
123 BBB U true
000 AAA N true
222 CCC U false
555 FIC Q false
555 VAN U true
expected output is:
Expected output:
ID Col1 Col2 Col3
123 AAA N true
000 AAA N true
222 CCC U false
555 FIC Q false
how can I do this in pandas ?
in sql, I tried with having count(*)>1, and then picked these columns.
You can use this code:
df.drop_duplicates('ID')
Above code keep always first record. You can change this with last instead of first record.
df.drop_duplicates(subset='ID', keep="first")
df.drop_duplicates(subset='ID', keep="last")
or you may sort for any column and then using of drop_duplicates method. In this way, (by order Ascending or Descending) you may use keep="first" for Min or Max.
One simple approach is to sort your dataframe by Col2 ensuring that 'U' will end up last. There are several possibilities:
pandas.Categorical
This sets an ordered categorical type on Col2
categories = np.append(np.setdiff1d(df['Col2'], ['U']), ['U'])
df['Col2'] = pd.Categorical(df['Col2'], categories=categories, ordered=True)
df.sort_values(by='Col2').groupby('ID').first()
Split dataframe
This splits the dataframe in two based on the values of Col2 (not-U and U), and concatenates the two parts to ensure the U are at the end
pd.concat([df.query('Col2 != "U"'), df.query('Col2 == "U"')]).groupby('ID').first()
Custom sort order
This manually defines the sorting order from a list
custom_order = ['N', 'Q', 'Z', 'U']
custom_order_dict = dict(zip(custom_order, range(len(custom_order))))
df.sort_values(by='Col2', key=lambda x: x.map(custom_order_dict)).groupby('ID').first()
input
ID Col1 Col2 Col3
0 123 AAA N True
1 123 BBB U True
2 0 AAA N True
3 222 CCC U False
4 555 FIC Q False
5 555 VAN U True
6 777 UUU U False
7 777 ZZZ Z True
8 999 UUU U False
9 999 NNN N True
output
Col1 Col2 Col3
ID
000 AAA N True
123 AAA N True
222 CCC U False
555 FIC Q False
777 ZZZ Z True
999 NNN N True
I tried a solution with multiple steps. this might not be the best way to do it but I did not find any other solution.
First step:
Separate records/rows for multiple ID's
df_multiple_record=pd.concat(g for _, g in df.groupby("ID") if len(g) > 1)
Output:
ID Col1 Col2 Col3
123 AAA N true
123 BBB U true
555 FIC Q false
555 VAN U true
Second Step:
drop the record with col2='U'
df_drop_U=df_multiple_record[df_multiple_record['Col2']!='U']
output:
ID Col1 Col2 Col3
123 AAA N true
555 FIC Q false
third Step:
drop the duplicates on ID from main extract to get the records for single occurance of ID
df_single_record=df.drop_duplicates(subset=['ID'],keep=False)
output:
ID Col1 Col2 Col3
000 AAA N true
222 CCC U false
fourth step:
concatenate single record df with the df where we drop U
df_final=pd.concat([df_single_record,df_drop_U],ignore_index=True)
output:
ID Col1 Col2 Col3
000 AAA N true
222 CCC U false
123 AAA N true
555 FIC Q false

Create a flag column based on condition pandas

I have a dataframe as shown below
ID Price Duration
1 100 60
2 200 2
3 1 366
4 1 365
I would like to create a flag column based on condition in Price column and Duration column.
Steps:
If Price is less than 20 flag it as False else flag it as True
If Duration is less than 30 flag it as False else flag it as True
Expected Output:
ID Price Duration Price_Flag Duration_Flag
1 100 60 True True
2 200 2 True False
3 1 366 False True
4 10 365 False True
One idea is compare by list by order of columns name ['Price','Duration'] by DataFrame.gt:
df[['Price_Flag','Duration_Flag']] = df[['Price','Duration']].gt([20,30])
Or use Series.gt for each column separately:
df['Price_Flag'] = df['Price'].gt(20)
df['Duration_Flag'] = df['Duration'].gt(30)
Or use DataFrame.assign:
df = df.assign(Price_Flag = df['Price'].gt(20),
Duration_Flag = df['Duration'].gt(30))
print (df)
ID Price Duration Price_Flag Duration_Flag
0 1 100 60 True True
1 2 200 2 True False
2 3 1 366 False True
3 4 1 365 False True

How to assign the multiple values of an output to new multiple columns of a dataframe?

I have the following function:
def sum(x):
oneS = x.iloc[0:len(x)//10].agg('sum')
twoS = x.iloc[len(x)//10:2*len(x)//10].agg('sum')
threeS = x.iloc[2*len(x)//10:3*len(x)//10].agg('sum')
fourS = x.iloc[3*len(x)//10:4*len(x)//10].agg('sum')
fiveS = x.iloc[4*len(x)//10:5*len(x)//10].agg('sum')
sixS = x.iloc[5*len(x)//10:6*len(x)//10].agg('sum')
sevenS = x.iloc[6*len(x)//10:7*len(x)//10].agg('sum')
eightS = x.iloc[7*len(x)//10:8*len(x)//10].agg('sum')
nineS = x.iloc[8*len(x)//10:9*len(x)//10].agg('sum')
tenS = x.iloc[9*len(x)//10:len(x)//10].agg('sum')
return [oneS,twoS,threeS,fourS,fiveS,sixS,sevenS,eightS,nineS,tenS]
How to assign the outputs of this function to columns of dataframe (which already exists)
The dataframe I am applying the function is as below
Cycle Type Time
1 1 101
1 1 102
1 1 103
1 1 104
1 1 105
1 1 106
9 1 101
9 1 102
9 1 103
9 1 104
9 1 105
9 1 106
The dataframe I want to add the columns is something like below & the new columns Ones, TwoS..... Should be added like shown & filled with the results of the function.
Cycle Type OneS TwoS ThreeS
1 1
9 1
8 1
10 1
3 1
5 2
6 2
7 2
If I write a function for just one value and apply it like the following, it is possible:
grouped_data['fm']= data_train_bel1800.groupby(['Cycle', 'Type'])['Time'].apply( lambda x: fm(x))
But I want to do it all at once so that it is neat and clear.
You can use:
def f(x):
out = []
for i in range(10):
out.append(x.iloc[i*len(x)//10:(i+1)*len(x)//10].agg('sum'))
return pd.Series(out)
df1 = (data_train_bel1800.groupby(['Cycle', 'Type'])['Time']
.apply(f)
.unstack()
.add_prefix('new_')
.reset_index())
print (df1)
Cycle Type new_0 new_1 new_2 new_3 new_4 new_5 new_6 new_7 new_8 \
0 1 1 0 101 102 205 207 209 315 211 211
1 9 1 0 101 102 205 207 209 315 211 211
new_9
0 106
1 106

sql query - selection of a correct row from a groups specified by id

id risk origin strength strength_sol
13456 1 1 3 3
13456 134 0 5 NULL
13456 128 0 7 NULL
13456 121 0 5 NULL
13456 122 0 4 NULL
13456 190 0 2 NULL
22367 1 1 5 5
22367 128 0 4 NULL
22367 1 0 2 NULL
22367 36 0 6 NULL
12789 1 1 5 5
12789 1 0 4 NULL
12789 118 1 2 NULL
12789 118 1 5 NULL
12789 1 0 7 NULL
16908 1 0 5 5
16908 36 0 4 NULL
16908 28 1 3 NULL
16908 128 1 5 NULL
16908 1 0 7 NULL
12439 1 0 4 4
12439 134 0 2 NULL
12439 16 0 5 NULL
15678 36 0 4 NULL
15678 28 0 2 NULL
15678 134 0 5 NULL
Problem and data description:
I have a big dataset. Above you can see just a small sample in order to describe my problem.
I need to choose exactly one row for each id.
In the dataset above there are all the possible cases that can happen.
The last two columns are not a part of the data set. They are the result that I need to get.
Origin is a 0/1 variable.
I need to choose this:
for one id:
situation: when risk = 1 and origin = 1 - Im ok, I will take this
row, there can be the only row like this for one id in the dataset
situation: when for one id there is no case that risk=1 and origin =1, I have to choose the row where risk=1 and origin = 0, if there are more such a rows, it doesn't matter which one I choose(But I have
to choose only ONE of them, not all of them).
when there in no risk = 1 in any row for one id (doesnt matter what is the value of the origin), I just simply put NULL as
strength_sol
My solution is like this (but it is not correct):
case when risk=1 and origin =1 then strength
when risk=1 and origin = 0 then strength
else NULL end as strength
This solution is not correct, because in the situation number one can happen that there is also a row with risk=1 and origin=0, but I'm not interested in that row (I want NULL for that row).
You can use the row_number function to number the rows so that the first row would be the one with highest priority (in this case risk=1 and origin=1) and the second with the next highest priority (risk=1 and origin=0). All the other rows are numbered arbitrarily and then you can choose the first row from each group.
select id,risk,origin,strength,
case when rnum=1 then strength end strength_sol
from (select t.*,
row_number() over(partition by id
order by case when risk=1 and origin =1 then 1
when risk=1 and origin =0 then 2
else 3 end) rnum
from t
) x
this is the final code :)
SELECT
id
,risk
,origin
,strength
,CASE
WHEN (X.rnum = 1 AND risk = 1) THEN strength
ELSE NULL
END AS strength_sol
FROM
(
SELECT
t.*
,ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY CASE
WHEN risk = 1 AND
origin = 1 THEN 1
WHEN risk = 1 AND
origin = 0 THEN 2
ELSE 3
END
) rnum
FROM
t
) AS [X]