i have the following SQL that I would like to use to plot cumulative distribution plot but i can't seem to get the data right.
Sample Data:
token_Length
Frequency
1
6436
2
7489
3
3724
4
2440
5
667
6
396
7
264
8
215
9
117
10
90
11
61
12
29
13
69
15
40
18
45
How do i prepare this data to create a CDF plot in looker?
So that it looks like
token_Length
Frequency
cume_dist
1
6436
0.291459107
2
7489
0.630604112
3
3724
0.799248256
4
2440
0.909745494
5
667
0.939951091
6
396
0.95788425
7
264
0.969839688
8
215
0.979576125
9
117
0.984874558
10
90
0.988950276
11
61
0.991712707
12
29
0.993025994
13
69
0.996150711
15
40
0.997962141
18
45
1
I have tried a measure as follows:
measure: cume_dist {
type: number
sql: cume_dist() over (order by ${token_length} ASC);;
}
This generates SQL as:
SELECT
token_length,
COUNT(*) AS "count",
cume_dist() over (order by (token_length) ASC) AS "cume_dist"
FROM string_facts
so i have a Dataframe that has a repeating Number Series that i want to group like this:
Number Pattern
Value
Desired Group
Value.1
1
723
1
Max of Group
2
400
1
Max of Group
8
235
1
Max of Group
5
387
2
Max of Group
7
911
2
Max of Group
3
365
3
Max of Group
4
270
3
Max of Group
5
194
3
Max of Group
7
452
3
Max of Group
100
716
4
Max of Group
104
69
4
Max of Group
2
846
5
Max of Group
3
474
5
Max of Group
4
524
5
Max of Group
So essentially the number pattern is always monotonly increasing.
Any Ideas?
You can compare Number Pattern by 1 with cumulative sum by Series.cumsum and then is used GroupBy.transform with max:
df['Desired Group'] = df['Number Pattern'].eq(1).cumsum()
df['Value.1'] = df.groupby('Desired Group')['Value'].transform('max')
print (df)
Number Pattern Value Desired Group Value.1
0 1 723 1 723
1 2 400 1 723
2 3 235 1 723
3 1 387 2 911
4 2 911 2 911
5 1 365 3 452
6 2 270 3 452
7 3 194 3 452
8 4 452 3 452
9 1 716 4 716
10 2 69 4 716
11 1 846 5 846
12 2 474 5 846
13 3 524 5 846
For monotically increasing use:
df['Desired Group'] = (~df['Number Pattern'].diff().gt(0)).cumsum()
So basically I have some simple SQL code that looks like the following;
SELECT
[Column1]
,[Column2]
,[Column3]
,[Column4]
,MIN([Column5]) AS maxColumn5
,MAX([Column6]) AS minColumn6
,SUM([Column7]) AS sumColumn7
,SUM([Column8]) AS sumColumn8
,SUM([Column9]) AS sumColumn9
FROM
[tableName]
GROUP BY
[Column1]
,[Column2]
,[Column3]
,[Column4]
What I am trying to do is also find the column either 'Column1', 'Column2', or 'Column3' that corresponds to the MIN([Column6]) and then the column that corresponds to MAX([Column8]).
The output should be exactly the same except there will be an extra 2 column at the end specifying which one the min and max are associated with.
I think there is a simple problem in your question, as Col1,Col2,Col3 that correspond to the max or min, are displayed directly, in other words you have them as you are grouping by Col1,Col2,Col3 & Col4.
As you did not provide some data, I will set some random data to prove my point.
Lets create a memory table similar to yours with 9 columns and fill it with random data for col6-8 with 10 rows for example, you can use the below:-
Declare #data Table(
Column1 int,Column2 int,Column3 int,Column4 int,Column5 int,Column6 int,Column7 int,Column8 int,Column9 int
)
declare #index int=5
while(#index>0)
begin
insert into #data values(1,2,3,4,RAND()*1000,RAND()*1000,RAND()*1000,RAND()*1000,RAND()*1000)
insert into #data values(5,6,7,8,RAND()*1000,RAND()*1000,RAND()*1000,RAND()*1000,RAND()*1000)
set #index=#index-1
end
we can see the data with the below
select * from #data order BY [Column1],[Column2],[Column3],[Column4]
Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9
1 2 3 4 669 203 278 364 577
1 2 3 4 389 316 290 548 661
1 2 3 4 835 555 942 985 604
1 2 3 4 477 743 580 305 414
1 2 3 4 431 296 471 150 352
1 2 3 4 346 220 573 941 633
1 2 3 4 392 450 652 978 883
1 2 3 4 235 479 751 136 978
1 2 3 4 906 183 141 915 783
1 2 3 4 329 342 682 977 870
5 6 7 8 218 740 41 299 816
5 6 7 8 800 630 674 888 799
5 6 7 8 27 307 446 743 345
5 6 7 8 501 928 824 592 691
5 6 7 8 439 624 260 757 547
5 6 7 8 287 610 287 708 652
5 6 7 8 441 711 433 642 343
5 6 7 8 751 928 237 53 535
5 6 7 8 594 768 708 173 33
5 6 7 8 352 703 943 867 661
now lets see the result of your grouping that you provided without any change
Col1 Col2 Col3 Col4 minCol5 maxCol6 maxCol8 sumCol7 sumCol8 sumCol9
1 2 3 4 235 743 985 5360 6299 6755
5 6 7 8 27 928 888 4853 5722 5422
so if we go back to your question, what is the value of Col1,Col2,Col3 for the maxCol6, well for each maxCol6 you have the values of Col1,Col2,Col3 & even Col4.
so what are the values for Col1,Col2,Col3 for maxCol16 that is 928, well they are 5,6 & 7.
ok, now lets say you want the record key that have that maxCol6, that is easy too, we would add an identity col as ID as below:-
Declare #data Table(
ID int identity(1,1), Column1 int,Column2 int,Column3 int,Column4 int,Column5 int,Column6 int,Column7 int,Column8 int,Column9 int
)
declare #index int=10
while(#index>0)
begin
insert into #data values(1,2,3,4,RAND()*1000,RAND()*1000,RAND()*1000,RAND()*1000,RAND()*1000)
insert into #data values(5,6,7,8,RAND()*1000,RAND()*1000,RAND()*1000,RAND()*1000,RAND()*1000)
set #index=#index-1
end
select * from #data order BY [Column1],[Column2],[Column3],[Column4]
;with agg as (
SELECT
[Column1]
,[Column2]
,[Column3]
,[Column4]
,MIN([Column5]) AS minColumn5
,MAX([Column6]) AS maxColumn6
,MAX([Column8]) AS maxColumn8
,SUM([Column7]) AS sumColumn7
,SUM([Column8]) AS sumColumn8
,SUM([Column9]) AS sumColumn9
FROM
#data [tableName]
GROUP BY
[Column1]
,[Column2]
,[Column3]
,[Column4]
)
--select * from agg order BY [Column1],[Column2],[Column3],[Column4]
select agg.*,maxCol6.ID [MaxCol6Seq],maxCol8.ID [MaxCol8Seq] from agg
inner join #data maxCol6
on agg.Column1=maxCol6.Column1
and agg.Column2=maxCol6.Column2
and agg.Column3=maxCol6.Column3
and agg.Column4=maxCol6.Column4
and agg.maxColumn6=maxCol6.Column6
inner join #data maxCol8
on agg.Column1=maxCol8.Column1
and agg.Column2=maxCol8.Column2
and agg.Column3=maxCol8.Column3
and agg.Column4=maxCol8.Column4
and agg.maxColumn8=maxCol8.Column8
As this is a new run for this set of data , below:-
ID Column1 Column2 Column3 Column4 Column5 Column6 Column7 Column8 Column9
1 1 2 3 4 201 848 993 50 304
3 1 2 3 4 497 207 644 399 104
5 1 2 3 4 445 321 822 151 185
7 1 2 3 4 611 402 620 61 543
9 1 2 3 4 460 409 182 915 211
11 1 2 3 4 886 804 180 213 282
13 1 2 3 4 614 709 932 806 162
15 1 2 3 4 795 752 110 474 463
17 1 2 3 4 737 545 77 648 727
19 1 2 3 4 788 862 266 464 851
20 5 6 7 8 218 561 943 572 54
18 5 6 7 8 741 621 610 214 536
16 5 6 7 8 579 248 374 693 761
14 5 6 7 8 866 415 198 528 657
12 5 6 7 8 905 947 500 50 387
10 5 6 7 8 492 860 948 299 220
8 5 6 7 8 861 328 727 40 327
6 5 6 7 8 435 534 707 769 777
4 5 6 7 8 587 68 45 184 614
2 5 6 7 8 189 24 289 121 772
The result is as below:-
C1 C2 C3 C4 minC5 maxC6 maxC8 sumC7 sumC8 sumC9 MaxCol6Seq MaxCol8Seq
1 2 3 4 201 862 915 4826 4181 3832 19 9
5 6 7 8 189 947 769 5341 3470 5105 12 6
Hope this helps.
If you just want a flag on each row specifying whether the value is the overall maximum or minimum, you can use window functions and CASE:
SELECT [Column1], [Column2], [Column3], [Column4],
MAX([Column5]) AS maxColumn5,
MIN([Column6]) AS minColumn6,
SUM([Column7]) AS sumColumn7,
SUM([Column8]) AS sumColumn8,
SUM([Column9]) AS sumColumn9,
(CASE WHEN MIN([Column6]) = MIN(MIN([Column6])) OVER () THEN 1 ELSE 0 END) as is_min_column6,
(CASE WHEN MAX([Column7]) = MAX(MAX([Column7])) OVER () THEN 1 ELSE 0 END) as is_max_column7
FROM [tableName]
GROUP BY [Column1], [Column2], [Column3], [Column4]
I have a query that looks like this:
select nvl(trim(a.code), 'Blanks') as Ward, count(b.apcasekey) as UNSP, count(c.apcasekey) as GRAPH,
count(d.apcasekey) as "ANI/PIG",
(count(b.apcasekey) + count(c.apcasekey) + count(d.apcasekey)) as "TOTAL ACTIVE",
count(a.apcasekey) as "TOTAL OPEN" from (etc...)
group by a.code
order by Ward
The reason I have nvl(trim(a.code), 'Blanks') as Ward is that sometimes a.code is a blank string, sometimes it's a null.
The problem is that when I use the Group By statement, I can't use Ward or I get the error
Ward: Invalid Identifier
I can only use a.code so I get 2 rows for 'Blanks', as per below
1 Blanks 7 0 0 7 7
2 Blanks 23 1 1 25 30
3 W01 75 4 0 79 91
4 W02 62 1 0 63 72
5 W03 140 2 0 142 162
6 W04 6 1 0 7 7
7 W05 46 0 1 47 48
8 W06 322 46 1 369 425
9 W07 91 0 1 92 108
10 W08 93 2 0 95 104
11 W09 28 1 0 29 30
12 W10 25 0 0 25 28
What I need, is for the row with 'Blanks' to combined into 1 row. Little help?
Thanks.
You can not use the alias in the GROUP BY, but you can use the expression that builds the value:
GROUP BY nvl(trim(a.code), 'Blanks')
I will explain my problem briefly
have duplicate rino like below (actually this rino is the serial number in front end)
chqid rino branchid
----- ---- --------
876 6 2
14 6 2
18 10 2
828 10 2
829 11 2
19 11 2
830 12 2
20 12 2
78 40 2
1092 40 2
1094 41 2
79 41 2
413 43 2
1103 43 2
82 44 2
1104 44 2
1105 45 2
83 45 2
91 46 2
1106 46 2
here in my case I don't want to delete these duplicate rino instead of that I planned to update the rino having max date(column not specified in the above sample actually a date column is there) to the next rino number
what exactly I meant is :
if I sort out the above result according to the max(date) I will get
chqid rino branchid
----- ---- --------
876 6 2
828 10 2
19 11 2
830 12 2
1092 40 2
79 41 2
413 43 2
82 44 2
83 45 2
1106 46 2
(NOTE : total number of duplicate rows are 10 in branchid=2)
the last entered rino in the table for branchid=2 is 245
So I just want to update the 10 rows(Of column rino) with numbers starting from 246 to 255( just added 245+10 like this select lastno+ generate_series(1,10) nos from tab where cola=4 and branchid = 2 and vrid=20;)
Expected Output:
chqid rino branchid
----- ---- --------
876 246 2
828 247 2
19 248 2
830 249 2
1092 250 2
79 251 2
413 252 2
82 253 2
83 254 2
1106 255 2
using postgresql
Finally I found a solution, am using dynamic-sql to solve my issue
do
$$
declare
arow record;
begin
for arow in
select chqid,rino,branchid from (
select chqid,rino::int ,vrid,branchid ,row_number()over (partition by rino::int ) rn
from tab
where vrid =20
and branchid = 2)t
where rn >1
loop
execute format('
update tab
set rino=(select max(rino::int)+1 from gtab19 where acyrid=4 and branchid = 2 and vrid=20)
where chqid=%s
',arow.chqid);
end loop;
end;
$$;