i have the following SQL that I would like to use to plot cumulative distribution plot but i can't seem to get the data right.
Sample Data:
token_Length
Frequency
1
6436
2
7489
3
3724
4
2440
5
667
6
396
7
264
8
215
9
117
10
90
11
61
12
29
13
69
15
40
18
45
How do i prepare this data to create a CDF plot in looker?
So that it looks like
token_Length
Frequency
cume_dist
1
6436
0.291459107
2
7489
0.630604112
3
3724
0.799248256
4
2440
0.909745494
5
667
0.939951091
6
396
0.95788425
7
264
0.969839688
8
215
0.979576125
9
117
0.984874558
10
90
0.988950276
11
61
0.991712707
12
29
0.993025994
13
69
0.996150711
15
40
0.997962141
18
45
1
I have tried a measure as follows:
measure: cume_dist {
type: number
sql: cume_dist() over (order by ${token_length} ASC);;
}
This generates SQL as:
SELECT
token_length,
COUNT(*) AS "count",
cume_dist() over (order by (token_length) ASC) AS "cume_dist"
FROM string_facts
My df:
Plate Route Speed Dif Latitude Longitude
1 724TL054M RUTA 23 0 32.0 19.489872 -99.183970
2 0350021 RUTA 35 0 33.0 19.303572 -99.083700
3 0120480 RUTA 12 0 32.0 19.356400 -99.125694
4 1000106 RUTA 100 0 32.0 19.212614 -99.131874
5 0030719 RUTA 3 0 36.0 19.522831 -99.258500
... ... ... ... ... ... ...
1617762 923CH113M RUTA 104 0 33.0 19.334467 -99.016880
1617763 0120077 RUTA 12 0 32.0 19.302448 -99.084530
1617764 0470053 RUTA 47 0 33.0 19.399706 -99.209190
1617765 0400070 CETRAM 0 33.0 19.265041 -99.163290
1617766 0760175 RUTA 76 0 33.0 19.274513 -99.240150
I want to filter those plates which, when their summed Dif (and hence I did a groupby) is bigger than 3600 (1 hour, since Dif is seconds), keep those. Otherwise discard them.
I tried (after a post from here):
df.groupby('Plate').filter(lambda x: x['Dif'].sum() > 3600)
But I still get about 60 plates with under 3600 as sum:
df.groupby('Plate').agg({'Dif':'sum'}).reset_index().nsmallest(60, 'Dif')
Plate Dif
952 655NZ035M 268.0
1122 949CH002C 814.0
446 0440220 1318.0
1124 949CH005C 1334.0
1042 698NZ011M 1434.0
1038 697NZ011M 1474.0
1 0010193 1509.0
282 0270302 1513.0
909 614NZ021M 1554.0
156 0140236 1570.0
425 0430092 1577.0
603 0620123 1586.0
510 0530029 1624.0
213 0180682 1651.0
736 0800126 1670.0
I have been some hours into this and I cant solve it. Any help is appreciated.
Assign it back
df = df.groupby('Plate').filter(lambda x: x['Dif'].sum() > 3600)
Then
df.groupby('Plate').agg({'Dif':'sum'}).reset_index().nsmallest(60, 'Dif')
There are 5 members contributing the value of something for every [E,M,S] as below:
E,M,S,Mem1,Mem2,Mem3,Mem4,Mem5
1,365,-10,15,21,18,16,,
1,365,10,23,34,,45,65
365,365,-20,34,45,43,32,23
365,365,20,56,45,,32,38
730,365,-5,82,64,13,63,27
730,365,15,24,68,,79,78
Notice that there are missing contributions ,,. I want to know the number of contributions for each [E,M,S]. For this e.g. the output is:
1,365,-10,4
1,365,10,4
365,365,-20,5
365,365,20,4
730,365,-5,5
730,365,15,4
groupingBy['E','M','S'] and then aggregating(counting) or applying(function) but across axis=1 would do. How is that done? Or any another idiomatic way to do such ?
The answer posted by #Wen is brilliant and definitely seems like the easiest way to do this.
If you wanted another way to do this, then you could use .melt to view the groups in the DF. Then, use groupby with a .sum() aggregation within each group in the melted DF. You just need to ignore the NaNs when you aggregate, and one way to do this is by following the approach in this SO post - .notnull() applied to groups.
Input DF
print df
E M S Mem1 Mem2 Mem3 Mem4 Mem5
0 1 365 -10 15 21 18.0 16 NaN
1 1 365 10 23 34 NaN 45 65.0
2 365 365 -20 34 45 43.0 32 23.0
3 365 365 20 56 45 NaN 32 38.0
4 730 365 -5 82 64 13.0 63 27.0
5 730 365 15 24 68 NaN 79 78.0
Here is the approach
# Apply melt to view groups
dfm = pd.melt(df, id_vars=['E','M','S'])
print(dfm.head(10))
E M S variable value
0 1 365 -10 Mem1 15.0
1 1 365 10 Mem1 23.0
2 365 365 -20 Mem1 34.0
3 365 365 20 Mem1 56.0
4 730 365 -5 Mem1 82.0
5 730 365 15 Mem1 24.0
6 1 365 -10 Mem2 21.0
7 1 365 10 Mem2 34.0
8 365 365 -20 Mem2 45.0
9 365 365 20 Mem2 45.0
# GROUP BY
grouped = dfm.groupby(['E','M','S'])
# Aggregate within each group, while ignoring NaNs
gtotals = grouped['value'].apply(lambda x: x.notnull().sum())
# (Optional) Reset grouped DF index
gtotals = gtotals.reset_index(drop=False)
print(gtotals)
E M S value
0 1 365 -10 4
1 1 365 10 4
2 365 365 -20 5
3 365 365 20 4
4 730 365 -5 5
5 730 365 15 4
I have the below dataframe has in a messy way and I need to club row 0 and 1 to make that as columns and keep rest rows from 3 asis:
Start Date 2005-01-01 Unnamed: 3 Unnamed: 4 Unnamed: 5
Dat an_1 an_2 an_3 an_4 an_5
mt mt s t inch km
23 45 67 78 89 9000
change to below dataframe :
Dat_mt an_1_mt an_2 _s an_3_t an_4_inch an_5_km
23 45 67 78 89 9000
IIUC
df.columns=df.loc[0]+'_'+df.loc[1]
df=df.loc[[2]]
df
Out[429]:
Dat_mt an_1_mt an_2_s an_3_t an_4_inch an_5_km
2 23 45 67 78 89 9000
there are three tables, first table name is baseline which contains all beneficiaries information and one column in the name of PPI Score and the second table in the name of PPI_SCORE_TOOKUP which contains six columns as below the third table in the name of endline which contains beneficiaries end line assessment data and also one column in the name of PPI_Score, what i want is, to join some how these tables however there is no foreign key of the baseline and endline table in the PPI_SCORE_TOOKUP table there is only PPI_Score in the tables PPI_SCORE_TOOKUP, endline and endline tables, and i want to query to show some baseline data along PPI result if the values of the ppi in the basline table is between or equals to PPI_SCORE_START and PPI_SCORE_END and also it should show endline data of the same member along with the PPI Score with its six column if ppi score in the endline table is between and equals to PPI_SCORE_START and PPI_SCORE_END all in one row.
Note: i did not try any query yet since i did not have any idea how to do this, but i expect the expected result in the bottom of this question.
Tables are as follows
baseline table
ID NAME LAST_NAME DISTRICT PPI_SCORE
1 A A A 10
2 B B B 23
3 C C C 90
4 D D D 47
endline table
baseline_ID Enterprise Market PPI_SCORE
3 Bee Keeping Yes
2 Poultry No 74
1 Agriculture Yes 80
PPI_SCORE_TOOKUP table
ppi_start ppi_end national national_150 national_200 usaid
0 4 100 100 100 100
10 14 66.1 89.5 96.5 39.2
5 9 68.8 90.2 96.7 44.4
15 19 59.5 89.1 97.2 35.2
20 24 51.3 85.5 96.4 28.8
25 29 43.5 81.1 93.2 20
30 34 31.9 74.5 90.4 13.6
35 39 24.6 66.9 87.3 7.9
40 44 15.2 58 82.8 4.5
45 49 11.4 47.9 73.4 4.2
50 54 6 37.2 68.4 2.6
55 59 2.7 26.1 61.3 0.5
60 64 0.9 21 50.4 0.5
65 69 0 14.3 37.1 0
70 74 3 14.3 29.2 0
75 79 0 1.4 5.1 0
80 84 0 0 9.5 0
85 89 0 0 15.2 0
90 94 0 0 0 0
95 100 0 0 0 0
Expected Result
Your query can be made in the following way:
SELECT *
FROM baseline b
LEFT JOIN endline e ON b.id = e.baseline_ID
LEFT JOIN PPI_SCORE_TOOKUP ppi ON b.PPI_SCORE BETWEEN ppi.ppi_start AND ppi.ppi_end
LEFT JOIN PPI_SCORE_TOOKUP ppi2 ON e.PPI_SCORE BETWEEN ppi2.ppi_start AND ppi2.ppi_end
This matches your id's from the baseline table with the baseline_ID's from the endline table, keeping possible null values from baseline. It then matches the PPI_SCORE from baseline with ppi_start and ppi_end from PPI_SCORE_TOOKUP. Then we join the PPI_SCORE from endline with and ppi_end.
By replacing * with whatever fields you want to have.
See fiddle for a working example