Get row index based on multiple column values in pandas - pandas

Here is a sample df >> real one > 500k rows. I am trying to get the row index of every instance where column ‘Trigger’ is == 1 so I can get the value in column ‘Price’. See desired column.
df10 = pd.DataFrame({
'Trigger': [0,0,1,1,1,0,0,1,0,1],
'Price': [12,14,16,18,20,2,4,6,8,10],
'Stock': ['AAPL', 'AAPL', 'AAPL', 'AAPL', 'AAPL', 'IBM','IBM','IBM','IBM','IBM'],
'desired':[0,0,16,18,20,0,0,6,0,10]
})
I was looking at answers online and you can use this code but it gives an array or all instances and I don’t know how to move the position in the array >> or if that is possible
df10['not_correct'] = np.where(df10['Trigger'] ==1 , df10.iloc[df10.index[df10['Trigger'] == 1][0],0],0)
So essentially, I want to find the index row number of (all) instances where column ‘Trigger’ == 1. It would be similar to a simple if statement in excel >> if (a[row#] == 1, b[row#],0)
Keep in mind this is example and I will NOT know where the 1 and 0 are in the actual df or how many 1’s there actually are in the ‘Trigger’ column >> it could be 0, 1 or 50.

To get the row number, use df.index in your np.where.
df10['row']=np.where(df10['Trigger']==1,df10.index,0)
df10
Out[7]:
Trigger Price Stock desired row
0 0 12 AAPL 0 0
1 0 14 AAPL 0 0
2 1 16 AAPL 16 2
3 1 18 AAPL 18 3
4 1 20 AAPL 20 4
5 0 2 IBM 0 0
6 0 4 IBM 0 0
7 1 6 IBM 6 7
8 0 8 IBM 0 0
9 1 10 IBM 10 9

The np.where do not need filter the result
df10['New']=np.where(df10.Trigger==1,df10.Price,0)
df10
Out[180]:
Trigger Price Stock desired New
0 0 12 AAPL 0 0
1 0 14 AAPL 0 0
2 1 16 AAPL 16 16
3 1 18 AAPL 18 18
4 1 20 AAPL 20 20
5 0 2 IBM 0 0
6 0 4 IBM 0 0
7 1 6 IBM 6 6
8 0 8 IBM 0 0
9 1 10 IBM 10 10

Related

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

Counting consecutive occurences in dataframe based on condition

I am trying to find whether 3 or more occurences of any consecutive number in a column are present, and if so mark the last one with a 1 and the rest with zero's.
df['a'] = df.assign(consecutive=df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size')).query('consecutive > #threshold')
is what i have found here: Identifying consecutive occurrences of a value however this gives me the error: ValueError: Wrong number of items passed 6, placement implies 1. I understand the issue that it cannot be printed into the dataframe but what would be the correct approach to get this desired result?
Secondly if this condition is satisfied, I would like to execute an equation (e.g. 2*b) to multiple rows neighbouring (either previous or results to follow) the 1 (like the shift function but then repetitive to e.g. 3 previous rows). I'm quite sure this must be possible but have not been able to get this whole objective to work. It does not necessarily have to be based on the one in column c, this is just a proposal.
small data excerpt below for interpretation, column c and d present desired result:
a b c d
16215 2 0 0
24848 4 0 0
24849 4 0 8
24850 4 0 8
24851 4 1 8
24852 6 0 0
24853 6 0 0
24854 8 0 0
24855 8 0 0
24856 8 0 16
25208 8 0 16
25932 8 1 16
28448 10 0 0
28449 10 0 0
28450 10 0 0
Using cumsum with diff create the groupkey, then find the last position of each group when it total count is more than 3 , then we using bfill with limit
s=df.b.diff().ne(0).cumsum()
s1=s.groupby(s).transform('count')
s2=s.groupby(s).cumcount()
df['c']=((s1==s2+1)&(s1>3)).astype(int)
df['d']=(df.c.mask(df.c==0)*df.b*2).bfill(limit=2).combine_first(df.c)
df
Out[87]:
a b c d
0 16215 2 0 0.0
1 24848 4 0 0.0
2 24849 4 0 8.0
3 24850 4 0 8.0
4 24851 4 1 8.0
5 24852 6 0 0.0
6 24853 6 0 0.0
7 24854 8 0 0.0
8 24855 8 0 0.0
9 24856 8 0 16.0
10 25208 8 0 16.0
11 25932 8 1 16.0
12 28448 10 0 0.0
13 28449 10 0 0.0
14 28450 10 0 0.0

Pandas: Calculate percentage of column for each class

I have a dataframe like this:
Class Boolean Sum
0 1 0 10
1 1 1 20
2 2 0 15
3 2 1 25
4 3 0 52
5 3 1 48
I want to calculate percentage of 0/1's for each class, so for example the output could be:
Class Boolean Sum %
0 1 0 10 0.333
1 1 1 20 0.666
2 2 0 15 0.375
3 2 1 25 0.625
4 3 0 52 0.520
5 3 1 48 0.480
Divide column Sum with GroupBy.transform for return Series with same length as original DataFrame filled by aggregated values:
df['%'] = df['Sum'].div(df.groupby('Class')['Sum'].transform('sum'))
print (df)
Class Boolean Sum %
0 1 0 10 0.333333
1 1 1 20 0.666667
2 2 0 15 0.375000
3 2 1 25 0.625000
4 3 0 52 0.520000
5 3 1 48 0.480000
Detail:
print (df.groupby('Class')['Sum'].transform('sum'))
0 30
1 30
2 40
3 40
4 100
5 100
Name: Sum, dtype: int64

Determine the max count in a pandas Grouped By df and use this as a criteria to return records

Afternoon All,
I have a large amount of data over a one month period. I would like to:
a. Find the book with the highest number of trades over that months period.
b. Knowing this provide a groupby summary of all the trades done on that book for the month but display it's months trades within each hour of the 24 hour clock.
Here is a sample dataset:
df_Highest_Traded_Away_Book = [
('trading_book', ['A', 'A','A','A','B','C','C','C']),
('rfq_create_date_time', ['2018-09-03 01:06:09', '2018-09-08 01:23:29',
'2018-09-15 02:23:29','2018-09-20 03:23:29',
'2018-09-20 00:23:29','2018-09-25 01:23:29',
'2018-09-25 02:23:29','2018-09-30 02:23:29',])
]
df_Highest_Traded_Away_Book = pd.DataFrame.from_items(df_Highest_Traded_Away_Book)
display(df_Highest_Traded_Away_Book)
trading_book rfq_create_date_time
0 A 2018-09-03 01:06:09
1 A 2018-09-08 01:23:29
2 A 2018-09-15 02:23:29
3 A 2018-09-20 03:23:29
4 B 2018-09-20 00:23:29
5 C 2018-09-25 01:23:29
6 C 2018-09-25 02:23:29
7 C 2018-09-30 02:23:29
df_Highest_Traded_Away_Book['rfq_create_date_time'] = pd.to_datetime(df_Highest_Traded_Away_Book['rfq_create_date_time'])
df_Highest_Traded_Away_Book['Time_in_GMT'] = df_Highest_Traded_Away_Book['rfq_create_date_time'].dt.hour
display(df_Highest_Traded_Away_Book)
trading_book rfq_create_date_time Time_in_GMT
0 A 2018-09-03 01:06:09 1
1 A 2018-09-08 01:23:29 1
2 A 2018-09-15 02:23:29 2
3 A 2018-09-20 03:23:29 3
4 B 2018-09-20 00:23:29 0
5 C 2018-09-25 01:23:29 1
6 C 2018-09-25 02:23:29 2
7 C 2018-09-30 02:23:29 2
df_Highest_Traded_Away_Book = df_Highest_Traded_Away_Book.groupby(['trading_book']).size().reset_index(name='Traded_Away_for_the_Hour').sort_values(['Traded_Away_for_the_Hour'], ascending=False)
display(df_Highest_Traded_Away_Book)
trading_book Trades_Bucketted_into_the_Hour_They_Occured
0 A 4
2 C 3
1 B 1
display(df_Highest_Traded_Away_Book['Traded_Away_for_the_Hour'].max())
4
i.e. Book A has the most number of trades in the month
Now return a grouped by result of all trades done on this book (for the month) but display such that trades are bucketed into the hour they were traded.
Time_in_GMT Trades_Book_A_Bucketted_into_the_Hour_They_Occured
0 0
1 2
2 1
3 1
4 0
. 0
. 0
. 0
24 0
Any help would be appreciated. I figure there is some way to return the criteria in one line of code.
Use Series.idxmax for top book:
df_Highest_Traded_Away_Book['rfq_create_date_time'] = pd.to_datetime(df_Highest_Traded_Away_Book['rfq_create_date_time'])
df_Highest_Traded_Away_Book['Time_in_GMT'] = df_Highest_Traded_Away_Book['rfq_create_date_time'].dt.hour
df_Highest_Book = df_Highest_Traded_Away_Book.groupby(['trading_book']).size().idxmax()
#alternative solution
#df_Highest_Book = df_Highest_Traded_Away_Book['trading_book'].value_counts().idxmax()
print(df_Highest_Book)
A
Then compare by eq (==), aggregate sum for count of True values and add missing values by reindex:
df_Highest_Traded_Away_Book = (df_Highest_Traded_Away_Book['trading_book']
.eq(df_Highest_Book)
.groupby(df_Highest_Traded_Away_Book['Time_in_GMT'])
.sum()
.astype(int)
.reindex(np.arange(25), fill_value=0)
.to_frame(df_Highest_Book))
print(df_Highest_Traded_Away_Book)
A
Time_in_GMT
0 0
1 2
2 1
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0

What's the problem of this one-hot encoding?

In [4]: data = pd.read_csv('student_data.csv')
In [5]: data[:10]
Out[5]:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
5 1 760 3.00 2
6 1 560 2.98 1
7 0 400 3.08 2
8 1 540 3.39 3
9 0 700 3.92 2
one_hot_data = pd.get_dummies(data['rank'])
# TODO: Drop the previous rank column
data = data.drop('rank', axis=1)
data = data.join(one_hot_data)
# Print the first 10 rows of our data
data[:10]
It always gives an error:
KeyError: 'rank'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-6a749c8f286e> in <module>()
1 # TODO: Make dummy variables for rank
----> 2 one_hot_data = pd.get_dummies(data['rank'])
3
4 # TODO: Drop the previous rank column
5 data = data.drop('rank', axis=1)
If get:
KeyError: 'rank'
it means there is no column rank. Obviously problem is with traling whitespace or encoding.
print (data.columns.tolist())
['admit', 'gre', 'gpa', 'rank']
Your solution should be simplify by DataFrame.pop - it select column and remove from original DataFrame:
data = data.join(pd.get_dummies(data.pop('rank')))
# Print the first 10 rows of our data
print(data[:10])
admit gre gpa 1 2 3 4
0 0 380 3.61 0 0 1 0
1 1 660 3.67 0 0 1 0
2 1 800 4.00 1 0 0 0
3 1 640 3.19 0 0 0 1
4 0 520 2.93 0 0 0 1
5 1 760 3.00 0 1 0 0
6 1 560 2.98 1 0 0 0
7 0 400 3.08 0 1 0 0
8 1 540 3.39 0 0 1 0
9 0 700 3.92 0 1 0 0
I tried your code and it works fine. You can need to rerun the previous cells which includes loading of the data