I have a dataframe that needs to repeat itself.
from io import StringIO
import pandas as pd
audit_trail = StringIO('''
course_id AcademicYear_to months TotalFee
260 2017 24 100
260 2018 12 140
274 2016 36 300
274 2017 24 340
274 2018 12 200
285 2017 24 300
285 2018 12 200
''')
df11 = pd.read_csv(audit_trail, sep=" " )
For the course id 260 there are 2 entries per year. Year 2017 and Year 2018. I need to repeat the years for the month groups.
I will get 2 more rows, 2018 for months 24 and 2017 for months 12. The final dataframe will look like this...
audit_trail = StringIO('''
course_id AcademicYear_to months TotalFee
260 2017 24 100
260 2018 24 100
260 2017 12 140
260 2018 12 140
274 2016 36 300
274 2017 36 300
274 2018 36 300
274 2016 24 340
274 2017 24 340
274 2018 24 340
274 2016 12 200
274 2017 12 200
274 2018 12 200
285 2017 24 300
285 2018 24 300
285 2017 12 200
285 2018 12 200
''')
df12 = pd.read_csv(audit_trail, sep=" " )
I tried to concat the same dataframe twice, but that does not solve the problem. I need to change the years and for 36 months, the data needs to be repeated 3 times.
pd.concat([df11, df11])
The group by object will return the years. I simply need to join the years in each group with the original dataframe.
df11.groupby('course_id')['AcademicYear_to'].apply(list)
260 [2017, 2018]
274 [2016, 2017, 2018]
285 [2017, 2018]
Simple join can work if the records match with the number of years. For e.g. course id 274 has 48 months and 285 has duration of 24 months and there are 3, 2 entries respectively. The problem is with course id 260 which is 24 months course but has only 1 entry. The join will not return the second year for that course.
df11=pd.read_csv('https://s3.amazonaws.com/todel162/myso.csv')
df11.course_id.value_counts()
274 3
285 2
260 1
df=df11.merge(df11[['course_id']], on='course_id')
df.course_id.value_counts()
274 9
285 4
260 1
Is it possible to write a query that will also consider the number of months?
The following query will return the records where simple join will not return expected results.
df11=pd.read_csv('https://s3.amazonaws.com/todel162/myso.csv')
df11['m1']=df11.groupby('course_id').course_id.transform( lambda x: x.count() * 12)
df11.query( 'm1 != duration_inmonths')
df11.course_id.value_counts()
274 3
285 2
260 1
df=df11.merge(df11[['course_id']], on='course_id')
df.course_id.value_counts()
274 9
285 4
260 1
The expected count in this case is
274 6
285 4
260 2
This is because even if there are 3 years for id 274, the course duration is only 24 months. And even if there is only 1 record for 260 since the duration is 24 months, it should return 2 records. (once for current year and the other current_year + 1)
IIUC we can merge df11 to itself:
In [14]: df11.merge(df11[['course_id']], on='course_id')
Out[14]:
course_id AcademicYear_to months TotalFee
0 260 2017 24 100
1 260 2017 24 100
2 260 2018 12 140
3 260 2018 12 140
4 274 2016 36 300
5 274 2016 36 300
6 274 2016 36 300
7 274 2017 24 340
8 274 2017 24 340
9 274 2017 24 340
10 274 2018 12 200
11 274 2018 12 200
12 274 2018 12 200
13 285 2017 24 300
14 285 2017 24 300
15 285 2018 12 200
16 285 2018 12 200
Not Pretty!
def f(x):
idx = x.index.remove_unused_levels()
idx = pd.MultiIndex.from_product(idx.levels, names=idx.names)
return x.reindex(idx)
df11.set_index(['months', 'AcademicYear_to']) \
.groupby('course_id').TotalFee.apply(f) \
.groupby(level=[0, 1]).transform('first') \
.astype(df11.TotalFee.dtype).reset_index()
course_id months AcademicYear_to TotalFee
0 260 24 2017 100
1 260 24 2018 100
2 260 12 2017 140
3 260 12 2018 140
4 274 12 2016 200
5 274 12 2017 200
6 274 12 2018 200
7 274 24 2016 340
8 274 24 2017 340
9 274 24 2018 340
10 274 36 2016 300
11 274 36 2017 300
12 274 36 2018 300
13 285 24 2017 300
14 285 24 2018 300
15 285 12 2017 200
16 285 12 2018 200
Related
I have a pandas dataframe which includes 200k+ unique IDs and for each ID 33 time periods with numeric data as well as padded values to ensure same sequence length for my RNN model. The data is sorted already by ID and time. Overall, the pandas dataframe has just over 8M rows! I'm generating the np array sequences currently, but it takes so long (code below).
I'm wondering if there is a faster or more efficient way to generate them. I'll share a data sample below as well as my current code. Thank you in advance and let me know if there are any follow up questions!
# pandas dataframe example
id time v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 y
a1 1 161 11 137 15 386 15 275 10 422 1 344 3 18 0
a1 2 487 14 77 11 329 7 188 12 174 2 462 14 14 0
a1 3 92 2 12 11 226 20 50 5 313 9 65 19 13 0
… … … … … … … … … … … … … … … …
a1 31 434 6 30 12 216 12 151 18 470 20 414 4 13 0
a1 32 271 2 456 12 100 19 198 12 377 14 205 13 1 0
a1 33 183 15 34 10 499 20 229 6 191 20 145 13 2 0
a2 1 247 10 54 6 115 14 102 9 39 6 34 1 8 0
a2 2 216 19 423 4 205 13 458 12 226 7 264 18 6 0
a2 3 311 7 285 5 147 6 30 17 332 10 116 13 1 0
… … … … … … … … … … … … … … … …
a2 31 124 13 62 11 229 4 242 20 261 6 350 16 8 0
a2 32 359 9 290 2 72 14 478 2 197 14 188 7 11 0
a2 33 410 15 370 18 34 9 387 5 218 9 257 5 1 1
# current sequence generation code
def create_sequence_data(df, num_features=num_features, y_label='y'):
id_list = df['id'].unique()
x_sequence = []
y_local = []
ids = []
for idx in id_list:
x_sequence.append(df[df['id'] == idx][num_features].values.tolist())
y_local.append(df[df['id'] == idx][y_label].max())
ids.append(idx)
return x_sequence, y_local, ids
x_seq, y_vals, idx_ls = create_sequence_data(df=df, y_label='y')
This solution actually worked fast enough for me. In the code prior to the one above I was also looping through rows for each id but the filter on id alone and then the values.tolist() call outputs a list of lists perfect for the sequence modeling.
In terms of time by total rows in the several dataframes was just over 8M and I was able to get through each one in under a couple hours so not too bad compared to before.
I have the following problem :
A dataframe named df1 like this :
Id PVF PM_year Year
0 A6489 75 25 2018
1 A175 56 54 2018
2 A2856 34 65 2018
3 A6489 35 150 2019
4 A175 45 700 2019
5 A2856 55 120 2019
6 A6489 205 100 2020
7 A2856 35 445 2020
I want to create a new column named PM_previous_year which is equal for each combination (ID+Year) to the value of PM_year of the same Id and the previous year...
Example :
For the line indexed 3, the Id is 'A6489' and the year is 2019. So the value of the new column "PM_previous_year" should be the value of the line where Id is the same ('A6489') and year is equal to 2018 (2019-1). in this simple example it corresponds to the line indexed 0 and the value expected for the new column on line indexed 3 is then 25.
Finally, the targeted DataFrame df2 for this short example looks like this :
Id PVF PM_year Year PM_previous_year
0 A6489 75 25 2018 NaN
1 A175 56 54 2018 NaN
2 A2856 34 65 2018 NaN
3 A6489 35 150 2019 25.0
4 A175 45 700 2019 54.0
5 A2856 55 120 2019 65.0
6 A6489 205 100 2020 150.0
7 A2856 35 445 2020 120.0
I havn't found any obvious solution yet. Maybe there is a way in reshaping the df, but I'm not very familiar with that.
If somebody have any idea, I would be very grateful.
Thks
If possible simplify solution and shifting PM_year per Id use:
df['PM_previous_year'] = df.groupby('Id')['PM_year'].shift()
print (df)
Id PVF PM_year Year PM_previous_year
0 A6489 75 25 2018 NaN
1 A175 56 54 2018 NaN
2 A2856 34 65 2018 NaN
3 A6489 35 150 2019 25.0
4 A175 45 700 2019 54.0
5 A2856 55 120 2019 65.0
6 A6489 205 100 2020 150.0
7 A2856 35 445 2020 120.0
Or:
s = df.pivot('Year','Id','PM_year').shift().unstack().rename('PM_previous_year')
df = df.join(s, on=['Id','Year'])
print (df)
Id PVF PM_year Year PM_previous_year
0 A6489 75 25 2018 NaN
1 A175 56 54 2018 NaN
2 A2856 34 65 2018 NaN
3 A6489 35 150 2019 25.0
4 A175 45 700 2019 54.0
5 A2856 55 120 2019 65.0
6 A6489 205 100 2020 150.0
7 A2856 35 445 2020 120.0
I have a data set with dates, customers and income:
Date CustomerIncome
0 1/1/2018 A 53
1 2/1/2018 A 36
2 3/1/2018 A 53
3 5/1/2018 A 89
4 6/1/2018 A 84
5 8/1/2018 A 84
6 9/1/2018 A 54
7 10/1/2018 A 19
8 11/1/2018 A 44
9 12/1/2018 A 80
10 1/1/2018 B 24
11 2/1/2018 B 100
12 9/1/2018 B 40
13 10/1/2018 B 47
14 12/1/2018 B 10
15 2/1/2019 B 5
For both customers there are missing dates as they purchased nothing at some months.
I want to add per each customer what was the income of last month and also the rolling sum of income for the last year.
Meaning, if there's a missing month, I'll see '0' at the shift(1) column of the following month that has income. And I'll see rolling sum of 12 months even if there weren't 12 observations.
This is the expected result:
Date CustomerIncome S(1)R(12)
0 1/1/2018 A 53 0 53
1 2/1/2018 A 36 53 89
2 3/1/2018 A 53 36 142
3 5/1/2018 A 89 0 231
4 6/1/2018 A 84 89 315
5 8/1/2018 A 84 0 399
6 9/1/2018 A 54 84 453
7 10/1/2018 A 19 54 472
8 11/1/2018 A 44 19 516
9 12/1/2018 A 80 44 596
10 1/1/2018 B 24 0 24
11 2/1/2018 B 100 24 124
12 9/1/2018 B 40 0 164
13 10/1/2018 B 47 40 211
14 12/1/2018 B 10 0 221
15 2/1/2019 B 5 0 102
So far, I've added the rows with missing dates with stack and unstack, but with multiple dates and customers, it explodes the data to millions of rows, crashing kernel with most rows are 0's.
You can use .shift but have logic that if the gap is > 31 days, then make (S1) = 0
The rolling 12 calculation requires figuring out the "Rolling Date" and doing some complicated list comprehension to decide whether or not to return a value. Then, take a sum of each list per row.
df['Date'] = pd.to_datetime(df['Date']).dt.date
df['S(1)'] = df.groupby('Customer')['Income'].transform('shift').fillna(0)
s = (df['Date'] - df['Date'].shift())/np.timedelta64(1, '31D') <= 1
df['S(1)'] = df['S(1)'].where(s,0).astype(int)
df['Rolling Date'] = (df['Date'] - pd.Timedelta('1Y'))
df['R(12)'] = df.apply(lambda d: sum([z for x,y,z in
zip(df['Customer'], df['Date'], df['Income'])
if y > d['Rolling Date']
if y <= d['Date']
if x == d['Customer']]), axis=1)
df = df.drop('Rolling Date', axis=1)
df
Out[1]:
Date Customer Income S(1) R(12)
0 2018-01-01 A 53 0 53
1 2018-02-01 A 36 53 89
2 2018-03-01 A 53 36 142
3 2018-05-01 A 89 0 231
4 2018-06-01 A 84 89 315
5 2018-08-01 A 84 0 399
6 2018-09-01 A 54 84 453
7 2018-10-01 A 19 54 472
8 2018-11-01 A 44 19 516
9 2018-12-01 A 80 44 596
10 2018-01-01 B 24 0 24
11 2018-02-01 B 100 24 124
12 2018-09-01 B 40 0 164
13 2018-10-01 B 47 40 211
14 2018-12-01 B 10 0 221
15 2019-02-01 B 5 0 102
Let say I have a table of millions of records resulting from a simulation, below sample
TO Sim DUR Cost
1 1 20 145
1 2 24 120
1 3 27 176
1 4 30 170
1 5 23 173
1 6 26 148
1 7 21 175
1 8 22 171
1 9 23 169
1 10 23 178
2 1 23 172
2 2 29 152
2 3 25 162
2 4 20 179
2 5 26 154
2 6 27 137
2 7 27 131
2 8 28 148
2 9 25 156
2 10 22 169
how to do the calculation in BigQuery to find the percent count of rows that are satisfying two conditions. (i can do a UDF but I would like it to be all in SQL statements)
The excel equivalent to the new calculated column would be =countifs($C$2:$C$21,">="&C2,$D$2:$D$21,">="&D2,$A$2:$A$21,A2) / countif($A$2:$A$21,A2)
the results would look like
TO Sim DUR Cost f0
1 1 20 145 0.90
1 2 24 120 0.40
1 3 27 176 0.10
1 4 30 170 0.10
1 5 23 173 0.30
1 6 26 148 0.30
1 7 21 175 0.30
1 8 22 171 0.40
1 9 23 169 0.50
1 10 23 178 0.10
2 1 23 172 0.10
2 2 29 152 0.10
2 3 25 162 0.10
2 4 20 179 0.10
2 5 26 154 0.10
2 6 27 137 0.30
2 7 27 131 0.40
2 8 28 148 0.20
2 9 25 156 0.20
2 10 22 169 0.20
Below is for BigQuery Standard SQL
#standardSQL
SELECT ANY_VALUE(a).*, COUNTIF(b.dur >= a.dur AND b.cost >= a.cost) / COUNT(1) calc
FROM `project.dataset.table` a
JOIN `project.dataset.table` b
USING (to_)
GROUP BY FORMAT('%t', a)
-- ORDER BY to_, sim
if to apply to sample data from your question - result is
Row to_ sim dur cost calc
1 1 1 20 145 0.9
2 1 2 24 120 0.4
3 1 3 27 176 0.1
4 1 4 30 170 0.1
5 1 5 23 173 0.3
6 1 6 26 148 0.3
7 1 7 21 175 0.3
8 1 8 22 171 0.4
9 1 9 23 169 0.5
10 1 10 23 178 0.1
11 2 1 23 172 0.1
12 2 2 29 152 0.1
13 2 3 25 162 0.1
14 2 4 20 179 0.1
15 2 5 26 154 0.1
16 2 6 27 137 0.3
17 2 7 27 131 0.4
18 2 8 28 148 0.2
19 2 9 25 156 0.2
20 2 10 22 169 0.2
Note: I am using field name to_ instead of to which is keyword and not allowed to be used as column name
This
SELECT
AVG(s.Amount/100)[Avg],
STDEV(s.Amount/100) [StDev],
VAR(s.Amount/100) [Var]
Returns this:
Avg StDev Var
133 550.82021581146 303402.910146583
Statistics aren't my strongest suit, but how is it possible that standard deviation and variance are larger than the average? Not only that, but variance is almost 100x larger than the largest sample in set.
Here is the entire sample set, with the above replaced with
SELECT s.Amount/100
while the rest of the query is identical
Amount
4645
3182
422
377
359
298
278
242
230
213
182
180
174
166
150
130
116
113
109
107
102
96
84
78
78
76
66
64
61
60
60
60
59
59
56
49
46
41
41
39
38
36
29
27
26
25
25
25
24
24
24
22
22
22
20
20
19
19
19
19
19
18
17
17
17
16
14
13
12
12
12
11
11
10
10
10
10
9
9
9
8
8
8
7
7
6
6
6
3
3
3
3
2
2
2
2
2
1
1
1
1
1
1
You need to read a book on statistics, or at least start with the Wikipedia pages that describe the concepts.
The standard deviation and variance are very related. The variance is the square (or close enough to the square) of the standard deviation. You can check that this is true of your numbers.
There is not really a relationship between the standard deviation and the average. The standard deviation is measuring the dispersal of the data around the average. The data can be arbitrarily dispersed around an average.
You might be confused because there are estimates on standard deviation/standard error when you assume a particular distribution of the data. However, those estimates are about the distribution and not about the data.