Pandas gropuby[['col_name', 'values from another dataframe']] - pandas

I have two very large pandas dataframes, df and df_new
Sample df:
A B DU DR
100 103 -2 -10
100 110 -8 -9
100 112 0 -4
100 105 2 0
100 111 NAN 12
.
.
.
264 100 NAN -15
.
.
.
Sample df_new:
A TD
100 0
100 1
100 2
.
.
.
103 0
103 1
.
.
.
I wish to get another pandas dataframe with count of B whose DU is less than or equal to TD of df_new for the same value of A in both df and df_new. Similary I need count of B's whose DU is greater than TD of df_new for the same value of A (it should also include count of np.nan).
i.e:
my expected dataframe should be something like this:
A TD Count_Less Count_More
100 0 3 2
100 1 3 2
100 2 4 1
.
.
.
103 0 0 5
103 1 1 4
.
.
.
How can I do this in Python?
Please note the data size is huge.

First use DataFrame.merge with left join for one Dataframe, then compare columns by Series.gt for > and
Series.le for <= to new columns with DataFrame.assign and last aggregate sum:
df1 = df_new.merge(df.assign(DU = df['DU'].fillna(df_new['TD'].max() + 1)), on='A', how='left')
df2 = (df1.assign(Count_Less=df1['DU'].le(df1['TD']).astype(int),
Count_More=(df1['DU'].gt(df1['TD'])).astype(int))
.groupby(['A','TD'], as_index=False)['Count_Less','Count_More'].sum()
)
print (df2)
A TD Count_Less Count_More
0 100 0 3 2
1 100 1 3 2
2 100 2 4 1
3 103 0 0 0
4 103 1 0 0
Another solution with custom function, but slow if large DataFrame df_new:
df1 = df.assign(DU = df['DU'].fillna(df_new['TD'].max() + 1))
def f(x):
du = df1.loc[df1['A'].eq(x['A']), 'DU']
Count_Less = du.le(x['TD']).sum()
Count_More = du.gt(x['TD']).sum()
return pd.Series([Count_Less,Count_More], index=['Count_Less','Count_More'])
df_new = df_new.join(df_new.apply(f, axis=1))
print (df_new)
A TD Count_Less Count_More
0 100 0 3 2
1 100 1 3 2
2 100 2 4 1
3 103 0 0 0
4 103 1 0 0

Related

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

Delete rows in dataframe based on info from Series

I would like to delete all rows in the Dataframe that have number of appereance = 10 and status = 1.
Example of Dataframe X is
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
...
First I found all rows with status=1 with count()=10
exclude=X[X.Status == 1].groupby('ID')['Status'].value_counts().loc[lambda x: x==10].index
exclude is Series
MultiIndex([( 371391, 1),
( 383537, 1),
...
Is it possible to delete rows in Dataframe X based od info for ID from Series ?
If your original DataFrame looks something like this:
print(df)
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
5 371391 1
6 371391 1
7 371391 1
8 371391 1
9 371391 1
10 371391 1
11 371391 1
12 371391 1
13 371391 1
And you group IDs and statuses together to find the IDs you want to exclude:
df2 = df.groupby(['ID', 'Status']).size().to_frame('size').reset_index()
print(df2)
ID Status size
0 366804 0 1
1 371391 1 10
2 383537 1 1
3 383538 0 1
4 383539 0 1
excludes = df2.loc[(df2['size'] == 10) & (df2['Status'] == 1), 'ID']
print(excludes)
1 371391
Name: ID, dtype: int64
Then you could use Series.isin and invert the boolean Series ~s:
df = df[~df['ID'].isin(excludes)]
print(df)
ID Status
0 366804 0
2 383537 1
3 383538 0
4 383539 0

How to extract first digit of every value of a column of dataframe(df) to a new dataframe (df1)

sample dataframe(df) having a column price as:
price
0 2500
1 2600
2 5400
3 3250
4 6245
. .
. .
How to achieve df1 as:
price
0 2
1 2
2 5
3 3
4 6
. .
I have an idea of converting each number to string and get index(0) of each column but is there any other approach to do so?
Convert values to strings and seelct first value:
df['new'] = df['price'].astype(str).str[0].astype(int)
print (df)
price new
0 2500 2
1 2600 2
2 5400 5
3 3250 3
4 6245 6
Or use integer division:
df['new'] = df['price'] // 1000
Or generally:
print (df)
price
0 20
1 260
2 5
3 325000
4 6245
df['new'] = df['price'] // (10 ** (np.log10(df['price'])).astype(int))
print (df)
price new
0 20 2
1 260 2
2 5 5
3 325000 3
4 6245 6

Need to loop over pandas series to find indices of variable

I have a dataframe and a list. I would like to iterate over elements in the list and find their location in dataframe then store this to a new dataframe
my_list = ['1','2','3','4','5']
df1 = pd.DataFrame(my_list, columns=['Num'])
dataframe : df1
Num
0 1
1 2
2 3
3 4
4 5
dataframe : df2
0 1 2 3 4
0 9 12 8 6 7
1 11 1 4 10 13
2 5 14 2 0 3
I've tried something similar to this but doesn't work
for x in my_list:
i,j= np.array(np.where(df==x)).tolist()
df2['X'] = df.append(i)
df2['Y'] = df.append(j)
so looking for a result like this
dataframe : df1 updated
Num X Y
0 1 1 1
1 2 2 2
2 3 2 4
3 4 1 2
4 5 2 0
any hints or ideas would be appreciated
Instead of trying to find the value in df2, why not just make df2 a flat dataframe.
df2 = pd.melt(df2)
df2.reset_index(inplace=True)
df2.columns = ['X', 'Y', 'Num']
so now your df2 just looks like this:
Index X Y Num
0 0 0 9
1 1 0 11
2 2 0 5
3 3 1 12
4 4 1 1
5 5 1 14
You can of course sort by Num and if you just want the values from your list you can further filter df2:
df2 = df2[df2.Num.isin(my_list)]

Pandas, bygroup operation

I have in pandas by using of groupby() next output (A,B,C are the columns in the input table)
C
A B
0 0 6
2 1
6 5
. . .
Output details: [244 rows x 1 columns] I just want to have all 3 columns instead of one,how is it possible to do?
Output, which I wish:
A B C
0 0 6
0 2 1
. . .
It appears to be undocumented, but simply: gb.bfill(), see this example:
In [68]:
df=pd.DataFrame({'A':[0,0,0,0,0,0,0,0],
'B':[0,0,0,0,1,1,1,1],
'C':[1,2,3,4,1,2,3,4],})
In [69]:
gb=df.groupby(['A', 'B'])
In [70]:
print gb.bfill()
A B C
0 0 0 1
1 0 0 2
2 0 0 3
3 0 0 4
4 0 1 1
5 0 1 2
6 0 1 3
7 0 1 4
[8 rows x 3 columns]
But I don't see why you need to do that, don't you end up with the original DataFrame (only maybe rearranged)?