How to find mean (average) of each 3 values in one column of unknown size using pandas. No Numpy - pandas

I am trying to figure out how to apply such functions (mean. STD etc) on different values of a CSV file. To keep it simple I put here the example of one column.
S08
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
D1 = D.loc[:,'S08']
rang = len(D1)
for i in range(rang):
x = D1.iloc[:,i+2]
m = x.mean()
print(m)
time S08 S09 S15 S37 S38 S39 S41 S45 S49
1 10 5 100 5 145 1500 1 10 99
2 20 15 200 15 135 1400 2 150 99
3 30 25 300 25 125 1300 3 140 99
4 40 35 400 35 115 1200 4 130 99
5 50 45 500 45 105 1100 5 120 99
6 60 55 600 55 95 1000 6 110 99
7 70 65 700 65 85 900 7 100 99
8 80 75 800 75 75 800 8 90 99
9 90 85 900 85 65 700 9 80 99
10 100 95 1000 95 55 600 10 70 99
11 110 105 1100 105 45 500 11 60 99
12 120 115 1200 115 35 400 12 50 99
13 130 125 1300 125 25 300 13 40 99
14 140 135 1400 135 15 200 14 30 99
15 150 145 1500 145 5 100 15 20 99
enter image description here

Use groupby by index floor divided by 3 and then aggregate columns by agg:
#create monotonic unique index (0,1,2...) if necessary
#df = df.reset_index(drop=True)
df = df.groupby(df.index // 3).agg({'col1':'mean', 'col2':'std'})
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(5, size=(10,3)), columns=list('ABC'))
print (df)
A B C
0 0 0 3
1 0 2 4
2 2 2 2
3 2 1 0
4 0 4 3
5 4 2 0
6 3 1 2
7 3 4 4
8 1 3 4
9 4 3 3
df1 = df.groupby(df.index // 3).agg({'A':'mean', 'B':'std'})
print (df1)
A B
0 0.666667 1.154701
1 2.000000 1.527525
2 2.333333 1.527525
3 4.000000 NaN
#floor dicide index values for create triple groups
print (df.index // 3)
Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')
EDIT:
df1 = df.groupby(df.index // 3).agg(['mean','std'])
df1.columns = df1.columns.map('_'.join)
print (df1)
A_mean A_std B_mean B_std C_mean C_std
0 0.666667 1.154701 1.333333 1.154701 3.000000 1.000000
1 2.000000 2.000000 2.333333 1.527525 1.000000 1.732051
2 2.333333 1.154701 2.666667 1.527525 3.333333 1.154701
3 4.000000 NaN 3.000000 NaN 3.000000 NaN

Related

How to calculate shift and rolling sum over missing dates without adding them to data frame in Pandas?

I have a data set with dates, customers and income:
Date CustomerIncome
0 1/1/2018 A 53
1 2/1/2018 A 36
2 3/1/2018 A 53
3 5/1/2018 A 89
4 6/1/2018 A 84
5 8/1/2018 A 84
6 9/1/2018 A 54
7 10/1/2018 A 19
8 11/1/2018 A 44
9 12/1/2018 A 80
10 1/1/2018 B 24
11 2/1/2018 B 100
12 9/1/2018 B 40
13 10/1/2018 B 47
14 12/1/2018 B 10
15 2/1/2019 B 5
For both customers there are missing dates as they purchased nothing at some months.
I want to add per each customer what was the income of last month and also the rolling sum of income for the last year.
Meaning, if there's a missing month, I'll see '0' at the shift(1) column of the following month that has income. And I'll see rolling sum of 12 months even if there weren't 12 observations.
This is the expected result:
Date CustomerIncome S(1)R(12)
0 1/1/2018 A 53 0 53
1 2/1/2018 A 36 53 89
2 3/1/2018 A 53 36 142
3 5/1/2018 A 89 0 231
4 6/1/2018 A 84 89 315
5 8/1/2018 A 84 0 399
6 9/1/2018 A 54 84 453
7 10/1/2018 A 19 54 472
8 11/1/2018 A 44 19 516
9 12/1/2018 A 80 44 596
10 1/1/2018 B 24 0 24
11 2/1/2018 B 100 24 124
12 9/1/2018 B 40 0 164
13 10/1/2018 B 47 40 211
14 12/1/2018 B 10 0 221
15 2/1/2019 B 5 0 102
So far, I've added the rows with missing dates with stack and unstack, but with multiple dates and customers, it explodes the data to millions of rows, crashing kernel with most rows are 0's.
You can use .shift but have logic that if the gap is > 31 days, then make (S1) = 0
The rolling 12 calculation requires figuring out the "Rolling Date" and doing some complicated list comprehension to decide whether or not to return a value. Then, take a sum of each list per row.
df['Date'] = pd.to_datetime(df['Date']).dt.date
df['S(1)'] = df.groupby('Customer')['Income'].transform('shift').fillna(0)
s = (df['Date'] - df['Date'].shift())/np.timedelta64(1, '31D') <= 1
df['S(1)'] = df['S(1)'].where(s,0).astype(int)
df['Rolling Date'] = (df['Date'] - pd.Timedelta('1Y'))
df['R(12)'] = df.apply(lambda d: sum([z for x,y,z in
zip(df['Customer'], df['Date'], df['Income'])
if y > d['Rolling Date']
if y <= d['Date']
if x == d['Customer']]), axis=1)
df = df.drop('Rolling Date', axis=1)
df
Out[1]:
Date Customer Income S(1) R(12)
0 2018-01-01 A 53 0 53
1 2018-02-01 A 36 53 89
2 2018-03-01 A 53 36 142
3 2018-05-01 A 89 0 231
4 2018-06-01 A 84 89 315
5 2018-08-01 A 84 0 399
6 2018-09-01 A 54 84 453
7 2018-10-01 A 19 54 472
8 2018-11-01 A 44 19 516
9 2018-12-01 A 80 44 596
10 2018-01-01 B 24 0 24
11 2018-02-01 B 100 24 124
12 2018-09-01 B 40 0 164
13 2018-10-01 B 47 40 211
14 2018-12-01 B 10 0 221
15 2019-02-01 B 5 0 102

BigQuery to count if two record values is greater than or equal the values in their columns and find percent overall

Let say I have a table of millions of records resulting from a simulation, below sample
TO Sim DUR Cost
1 1 20 145
1 2 24 120
1 3 27 176
1 4 30 170
1 5 23 173
1 6 26 148
1 7 21 175
1 8 22 171
1 9 23 169
1 10 23 178
2 1 23 172
2 2 29 152
2 3 25 162
2 4 20 179
2 5 26 154
2 6 27 137
2 7 27 131
2 8 28 148
2 9 25 156
2 10 22 169
how to do the calculation in BigQuery to find the percent count of rows that are satisfying two conditions. (i can do a UDF but I would like it to be all in SQL statements)
The excel equivalent to the new calculated column would be =countifs($C$2:$C$21,">="&C2,$D$2:$D$21,">="&D2,$A$2:$A$21,A2) / countif($A$2:$A$21,A2)
the results would look like
TO Sim DUR Cost f0
1 1 20 145 0.90
1 2 24 120 0.40
1 3 27 176 0.10
1 4 30 170 0.10
1 5 23 173 0.30
1 6 26 148 0.30
1 7 21 175 0.30
1 8 22 171 0.40
1 9 23 169 0.50
1 10 23 178 0.10
2 1 23 172 0.10
2 2 29 152 0.10
2 3 25 162 0.10
2 4 20 179 0.10
2 5 26 154 0.10
2 6 27 137 0.30
2 7 27 131 0.40
2 8 28 148 0.20
2 9 25 156 0.20
2 10 22 169 0.20
Below is for BigQuery Standard SQL
#standardSQL
SELECT ANY_VALUE(a).*, COUNTIF(b.dur >= a.dur AND b.cost >= a.cost) / COUNT(1) calc
FROM `project.dataset.table` a
JOIN `project.dataset.table` b
USING (to_)
GROUP BY FORMAT('%t', a)
-- ORDER BY to_, sim
if to apply to sample data from your question - result is
Row to_ sim dur cost calc
1 1 1 20 145 0.9
2 1 2 24 120 0.4
3 1 3 27 176 0.1
4 1 4 30 170 0.1
5 1 5 23 173 0.3
6 1 6 26 148 0.3
7 1 7 21 175 0.3
8 1 8 22 171 0.4
9 1 9 23 169 0.5
10 1 10 23 178 0.1
11 2 1 23 172 0.1
12 2 2 29 152 0.1
13 2 3 25 162 0.1
14 2 4 20 179 0.1
15 2 5 26 154 0.1
16 2 6 27 137 0.3
17 2 7 27 131 0.4
18 2 8 28 148 0.2
19 2 9 25 156 0.2
20 2 10 22 169 0.2
Note: I am using field name to_ instead of to which is keyword and not allowed to be used as column name

How to replace last n values of a row with zero

I want to replace last 2 values of one of the column with zero. I understand for NaN values, I am able to use .fillna(0), but I would like to replace row 6 value of the last column as well.
Weight Name Age d_id_max
0 45 Sam 14 2
1 88 Andrea 25 1
2 56 Alex 55 1
3 15 Robin 8 3
4 71 Kia 21 3
5 44 Sia 43 2
6 54 Ryan 45 1
7 34 Dimi 65 NaN
df.drop(df.tail(2).index,inplace=True)
Weight Name Age d_id_max
0 45 Sam 14 2
1 88 Andrea 25 1
2 56 Alex 55 1
3 15 Robin 8 3
4 71 Kia 21 3
5 44 Sia 43 2
6 54 Ryan 45 0
7 34 Dimi 65 0
Before pandas 0.20.0 (long time) it was job for ix, but now it is deprecated. So you can use:
DataFrame.iloc for get last rows and also Index.get_loc for positions of column d_id_max:
df.iloc[-2:, df.columns.get_loc('d_id_max')] = 0
print (df)
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0
Or DataFrame.loc with indexing index values:
df.loc[df.index[-2:], 'd_id_max'] = 0
Try .iloc and get_loc
df.iloc[[-1,-2], df.columns.get_loc('d_id_max')] = 0
Out[232]:
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0
You can use:
df['d_id_max'].iloc[-2:] = 0
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0

getting all rows after the smallest value in a column in each group after groupby

I have a dataframe as below.
After i do the grouby on 'Cycle' & 'Type', i want to find out the sum/mean/std of only the positive values of the negative values of the rows AFTER the smallest 'Switch' value.
How do i do this? I think the easier option would be to get a dataframe which has the rows after the smallest switch for all 'Cycle' & 'Type' group.
Cycle Type Time Switch
10 1 101 -0.134
10 1 102 0.001
10 1 103 -0.058
10 1 104 0.035
10 1 105 -0.209
10 1 106 0.002
10 1 107 -0.0443
10 1 108 0.001
10 1 109 -0.368
10 1 110 0.015
10 1 111 -0.009
10 1 112 0.055
10 1 113 -0.014
10 1 114 0.004
10 1 115 -0.033
10 1 116 0.003
10 1 117 -0.0401
10 1 118 0.003
10 1 119 -0.088
10 1 120 0.005
10 1 121 -0.026
10 1 122 0.001
10 1 123 -0.115
10 1 124 0.005
10 1 125 -0.085
10 1 126 0.002
10 1 127 -0.054
10 1 128 0.012
10 1 129 -0.034
8 1 101 -1.876
8 1 102 0.003
8 1 103 -0.134
8 1 104 0.002
8 1 105 -0.036
8 1 106 0.012
8 1 107 -0.08
8 1 108 0.037
8 1 109 -0.027
8 1 110 0.022
8 1 111 -0.001
8 1 112 0.028
8 1 113 -0.009
8 1 114 0.002
8 1 115 -0.006
8 1 116 0.01
8 1 117 -0.002
8 1 118 0.002
8 1 119 -0.002
8 1 120 0.008
8 1 121 -0.011
8 1 122 0.001
8 1 123 -0.028
8 1 124 0.003
8 1 125 -0.063
8 1 126 0.013
8 1 127 -0.003
8 1 128 0.02
8 1 129 -0.113
8 1 130 0.003
8 1 131 -0.03
8 1 132 0.012
8 1 133 -0.078
8 1 134 0.001
8 1 135 -0.764
8 1 136 0.006
8 1 137 -0.268
8 1 138 0.016
8 1 139 -0.171
8 1 140 0.013
8 1 141 -0.286
8 1 142 0.023
For the given dataframe, the output would be all the rows below -0.368 for cycle=10 & type=1. On the other hand for cycle=8 & type=1, all rows below -1.876 (so all rows below the first row).
The output dataframe would be as below (first 9 rows on cycle 10, type 1 are removed & first row of type 8 cycle 1 are removed)
Cycle Type Time Switch
10 1 110 0.015
10 1 111 -0.009
10 1 112 0.055
10 1 113 -0.014
10 1 114 0.004
10 1 115 -0.033
10 1 116 0.003
10 1 117 -0.0401
10 1 118 0.003
10 1 119 -0.088
10 1 120 0.005
10 1 121 -0.026
10 1 122 0.001
10 1 123 -0.115
10 1 124 0.005
10 1 125 -0.085
10 1 126 0.002
10 1 127 -0.054
10 1 128 0.012
10 1 129 -0.034
8 1 102 0.003
8 1 103 -0.134
8 1 104 0.002
8 1 105 -0.036
8 1 106 0.012
8 1 107 -0.08
8 1 108 0.037
8 1 109 -0.027
8 1 110 0.022
8 1 111 -0.001
8 1 112 0.028
8 1 113 -0.009
8 1 114 0.002
8 1 115 -0.006
8 1 116 0.01
8 1 117 -0.002
8 1 118 0.002
8 1 119 -0.002
8 1 120 0.008
8 1 121 -0.011
8 1 122 0.001
8 1 123 -0.028
8 1 124 0.003
8 1 125 -0.063
8 1 126 0.013
8 1 127 -0.003
8 1 128 0.02
8 1 129 -0.113
8 1 130 0.003
8 1 131 -0.03
8 1 132 0.012
8 1 133 -0.078
8 1 134 0.001
8 1 135 -0.764
8 1 136 0.006
8 1 137 -0.268
8 1 138 0.016
8 1 139 -0.171
8 1 140 0.013
8 1 141 -0.286
8 1 142 0.023
How do I accomplish this?
In the same way, if i have to get all the rows after the 2nd lowest, how do i do it?
The time keeps increasing, so if we know the time of the lowest 'Switch' all the time values above this 'Time' can be the contents of the new dataframe.
I have the progress as below.
I was able to find the time of the lowest value with codes below.
after_min = switch.groupby(['Cycle','Type'],as_index=False).apply(lambda x: x.nsmallest(1, 'Switch'))
a = after_min.groupby(['Cycle','Type'])['Time'].agg('first').reset_index(name='Time')
I am not able to create a mask or something like that which would filter the values below the lowest time in each group. Can anyone help?
If there is a way to get all the rows after the lowest value of 'Switch' even when we did not have the 'Time' column, please let me know.
Update The answer suggested by WeNYoBen works perfectly if I want to get all the rows after the LOWEST value. However, if i want to get all the rows after the 2nd lowest, it will not work.
With the same logic a mentioned by WeNYoBen, if i can transform the last value in level=1 of the result of the code below to my group, there may be a possibility to get the rows after 2nd lowest.
df.groupby(['Cycle','Type'],as_index=False).apply(lambda x: x.nsmallest(2, 'Switch'))
The code above gives the output as seen in the picture below. 63 & 4 are the indexes of 2nd lowest 'Switch' values. If i can only transform these values to each group respectively. Then i can get the rows below the 2nd lowest values using the logic from WeNYoBen (this can also be scaled by changing the nsmallest values in the above code to number desired). I am just not able to transform 63 & 1 to each group. Can anyone help?
Here is a solution by using transform and idxmin
df[df.index>df.groupby(['Cycle','Type']).Switch.transform('idxmin')]

Pandas, getting mean and sum with groupby

I have a data frame, df, which looks like this:
index New Old Map Limit count
1 93 35 54 > 18 1
2 163 93 116 > 18 1
3 134 78 96 > 18 1
4 117 81 93 > 18 1
5 194 108 136 > 18 1
6 125 57 79 <= 18 1
7 66 39 48 > 18 1
8 120 83 95 > 18 1
9 150 98 115 > 18 1
10 149 99 115 > 18 1
11 148 85 106 > 18 1
12 92 55 67 <= 18 1
13 64 24 37 > 18 1
14 84 53 63 > 18 1
15 99 70 79 > 18 1
I need to produce a data frame that looks like this:
Limit <=18 >18
total mean total mean
New xx1 yy1 aa1 bb1
Old xx2 yy2 aa2 bb2
MAP xx3 yy3 aa3 bb3
I tried this without success:
df.groupby('Limit')['New', 'Old', 'MAP'].[sum(), mean()].T without success.
How can I achieve this in pandas?
You can use groupby with agg, then transpose by T and unstack:
print (df[['New', 'Old', 'Map', 'Limit']].groupby('Limit').agg([sum, 'mean']).T.unstack())
Limit <= 18 > 18
sum mean sum mean
New 217.0 108.5 1581.0 121.615385
Old 112.0 56.0 946.0 72.769231
Map 146.0 73.0 1153.0 88.692308
I edit by comment, it looks nicer:
print (df.groupby('Limit')['New', 'Old', 'Map', 'Limit'].agg([sum, 'mean']).T.unstack())
And if need total columns:
print (df.groupby('Limit')['New', 'Old', 'Map', 'Limit']
.agg({'total':sum, 'mean': 'mean'})
.T
.unstack(0))
Limit <= 18 > 18
total mean total mean
New 217.0 108.5 1581.0 121.615385
Old 112.0 56.0 946.0 72.769231
Map 146.0 73.0 1153.0 88.692308