Why this inconsistency between a Dataframe and a column of it? - pandas

When debugging a nasty error in my code I come across this that looks that an inconsistency in the way Dataframes work (using pandas = 1.0.3):
import pandas as pd
df = pd.DataFrame([[10*k, 11, 22, 33] for k in range(4)], columns=['d', 'k', 'c1', 'c2'])
y = df.k
X = df[['c1', 'c2']]
Then I tried to add a column to y (forgetting that y is a Series, not a Dataframe):
y['d'] = df['d']
I'm now aware that this adds a weird row to the Series; y is now:
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
But the weird thing is that now:
>>> df.shape, df['k'].shape
((4, 4), (5,))
And df and df['k'] look like:
d k c1 c2
0 0 11 22 33
1 10 11 22 33
2 20 11 22 33
3 30 11 22 33
and
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object

There are a few things at work here:
A pandas series can store objects of arbitrary types.
y['d'] = _ add a new object to the series y with name 'd'.
Thus, y['d'] = df['d'] add a new object to the series y with name 'd' and value is the series df['d'].
So you have added a series as the last entry of the series y. You can verify that
(y['d'] == y.iloc[-1]).all() == True and
(y.iloc[-1] == df['d']).all() == True.
To clarify the inconsistency between df and df.k: Note that df.k, df['k'], or df.loc[:, 'k'] returns the series 'view' of column k, thus, adding an entry to the series will directly append it to this view. However, df.k shows the entire series, whereas df only show the series to maximum length df.shape[0]. Hence the inconsistent behavior.
I agree that this behavior is prone to bugs and should be fixed. View vs. copy is a common cause for many issues. In this case, df.iloc[:, 1] behaves correctly and should be used instead.

Related

Pandas with a condition select a value from a column and multiply by scalar in new column, row by row

A value in 'Target_Labels' is either 0.0,1.0,2.0 in float64.
Based on this value, I would like to look up a value in one of three columns 'B365A','B365D','B365H' and multiply this value by 10 in a new column. This operation needs to be row wise throughout the entire DataFrame.
I have tried many combinations but nothing seem to work...
final['amount'] = final['Target_Labels'].apply((lambda x: 'B365A' * 10 if x==0.0 else ('B365D' * 10 if x ==1 else 'B365H' * 10))
def prod(x, var1, var2, var3, var4):
if (x[var4])==0:
x[var3]*10
elif (x[var4])==1:
x[var1]*10
else:
x[var2]*10
return x
final['montant'] = final.apply(lambda x: prod(x, 'B365D', 'B365H','B365A', 'Target_Labels'), axis=1)
I'm new to Pandas and any help is welcome...
Use numpy to indexing and get individual cells:
array = final.values
row = range(len(df))
col = final['Target_Labels'] - 1
>>> final
B365A B365D B365H Target_Labels
0 11 12 13 1
1 11 12 13 2
2 11 12 13 3
>>> final['amount'] = final.values[(range(len(final)),
final['Target_Labels'] - 1)] * 10
>>> final
B365A B365D B365H Target_Labels amount
0 11 12 13 1 110
1 11 12 13 2 120
2 11 12 13 3 130

Pandas Groupby -- efficient selection/filtering of groups based on multiple conditions?

I am trying to
filter dataframe groups in Pandas, based on multiple (any) conditions.
but I cannot seem to get to a fast Pandas 'native' one-liner.
Here I generate an example dataframe of 2*n*n rows and 4 columns:
import itertools
import random
n = 100
lst = range(0, n)
df = pd.DataFrame(
{'A': list(itertools.chain.from_iterable(itertools.repeat(x, n*2) for x in lst)),
'B': list(itertools.chain.from_iterable(itertools.repeat(x, 1*2) for x in lst)) * n,
'C': random.choices(list(range(100)), k=2*n*n),
'D': random.choices(list(range(100)), k=2*n*n)
})
resulting in dataframes such as:
A B C D
0 0 0 26 49
1 0 0 29 80
2 0 1 70 92
3 0 1 7 2
4 1 0 90 11
5 1 0 19 4
6 1 1 29 4
7 1 1 31 95
I want to
select groups grouped by A and B,
filtered groups down to where any values in the group are greater than 50 in both columns C and D,
A "native" Pandas one-liner would be the following:
test.groupby([test.A, test.B]).filter(lambda x: ((x.C>50).any() & (x.D>50).any()) )
which produces
A B C D
2 0 1 70 92
3 0 1 7 2
This is all fine for small dataframes (say n < 20).
But this solution takes quite long (for example, 4.58 s when n = 100) for large dataframes.
I have an alternative, step-by-step solution which achieves the same result, but runs much faster (28.1 ms when n = 100):
test_g = test.assign(key_C = test.C>50, key_D = test.D>50).groupby([test.A, test.B])
test_C_bool = test_g.key_C.transform('any')
test_D_bool = test_g.key_D.transform('any')
test[test_C_bool & test_D_bool]
but arguably a bit more ugly. My questions are:
Is there a better "native" Pandas solution for this task? , and
Is there a reason for the sub-optimal performance of my version of the "native" solution?
Bonus question:
In fact I only want to extract the groups and not together with their data. I.e., I only need
A B
0 1
in the above example. Is there a way to do this with Pandas without going through the intermediate step I did above?
This is similar to your second approach, but chained together:
mask = (df[['C','D']].gt(50) # in the case you have different thresholds for `C`, `D` [50, 60]
.all(axis=1) # check for both True on the rows
.groupby([df['A'],df['B']]) # normal groupby
.transform('max') # 'any' instead of 'max' also works
)
df.loc[mask]
If you don't want the data, you can forgo the transform:
mask = df[['C','D']].min(axis=1).gt(50).groupby([df['A'],df['B']]).any()
mask[mask].index
# out
# MultiIndex([(0, 1)],
# names=['A', 'B'])

subset df by masking between specific rows

I'm trying to subset a pandas df by removing rows that fall between specific values. The problem is these values can be at different rows so I can't select fixed rows.
Specifically, I want to remove rows that fall between ABC xxx and the integer 5. These values could fall anywhere in the df and be of unequal length.
Note: The string ABC will be followed by different values.
I thought about returning all the indexes that contain these two values.
But mask could work better if I could return all rows between these two values?
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'X',1,2,'ABC',1,4,5,'Y',1,2],
})
mask = (df['Val'].str.contains(r'ABC(?!$)')) & (df['Val'] == 5)
Intended Output:
Val
0 None
8 X
9 1
10 2
15 Y
16 1
17 2
If ABC is always before 5 and always pairs (ABC, 5) get indices of values with np.where, zip and get index values between - last filter by isin with invert mask by ~:
#2 values of ABC, 5 in data
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'None','None','None',
'None','ABC','None',1,2,3,4,5,'None','None','None']
})
m1 = np.where(df['Val'].str.contains(r'ABC', na=False))[0]
m2 = np.where(df['Val'] == 5)[0]
print (m1)
[ 1 12]
print (m2)
[ 7 18]
idx = [x for y, z in zip(m1, m2) for x in range(y, z + 1)]
print (df[~df.index.isin(idx)])
Val
0 None
8 X
9 1
10 2
11 None
19 X
20 1
21 2
a = df.index[df['Val'].str.contains('ABC')==True][0]
b = df.index[df['Val']==5][0]+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]
Output
Val
0 None
8 X
9 1
10 2
If there are more than one 'ABC' and 5, then you the below version.
With this you get the df other than the first ABC & the last 5
a = (df['Val'].str.contains('ABC')==True).idxmax()
b = df['Val'].where(df['Val']==5).last_valid_index()+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]

apply() function to generate new value in a new column

I am new to python 3 and pandas. I tried to add a new column into a dataframe where the value is the difference between two existing columns.
My current code is:
import pandas as pd
import io
from io import StringIO
x="""a,b,c
1,2,3
4,5,6
7,8,9"""
with StringIO(x) as df:
new=pd.read_csv(df)
print (new)
y=new.copy()
y.loc[:,"d"]=0
# My lambda function is completely wrong, but I don't know how to make it right.
y["d"]=y["d"].apply(lambda x:y["a"]-y["b"], axis=1)
Desired output is
a b c d
1 2 3 -1
4 5 6 -1
7 8 9 -1
Does anyone have any idea how I can make my code work?
Thanks for your help.
You need y only for DataFrame for DataFrame.apply with axis=1 for process by rows:
y["d"]= y.apply(lambda x:x["a"]-x["b"], axis=1)
For better debugging is possible create custom function:
def f(x):
print (x)
a = x["a"]-x["b"]
return a
y["d"]= y.apply(f, axis=1)
a 1
b 2
c 3
Name: 0, dtype: int64
a 4
b 5
c 6
Name: 1, dtype: int64
a 7
b 8
c 9
Name: 2, dtype: int64
Better solution if need only subtract columns:
y["d"] = y["a"] - y["b"]
print (y)
a b c d
0 1 2 3 -1
1 4 5 6 -1
2 7 8 9 -1

Combine two columns of numbers in dataframe into single column using pandas/python

I'm very new to Pandas and Python.
I have a 3226 x 61 dataframe and I would like to combine two columns into a single one.
The two columns I would like to combine are both integers - one has either one or two digits (1 through 52) while the other has three digits (e.g., 1 or 001, 23 or 023). I need the output to be a five digit integer (e.g., 01001 or 52023). There will be no mathematical operations with the resulting integers - I will need them only for look-up purposes.
Based on some of the other posts on this fantastic site, I tried the following:
df['YZ'] = df['Y'].map(str) + df['Z'].map(str)
But that returns "1.00001 for a first column of "1" and second column of "001", I believe because making "1" a str turns it into "1.0", which "001" is added to the end.
I've also tried:
df['YZ'] = df['Y'].join(df['Z'])
Getting the following error:
AttributeError: 'Series' object has no attribute 'join'
I've also tried:
df['Y'] = df['Y'].astype(int)
df['Z'] = df['Z'].astype(int)
df['YZ'] = df[['Y','Z']].apply(lambda x: ''.join(x), axis=1)
Getting the following error:
TypeError: ('sequence item 0: expected str instance, numpy.int32
found', 'occurred at index 0')
A copy of the columns is below:
1 1
1 3
1 5
1 7
1 9
1 11
1 13
I understand there are two issues here:
Combining the two columns
Getting the correct format (five digits)
Frankly, I need help with both but would be most appreciative of the column combining problem.
I think you need convert columns to string, add 0 by zfill and simply sum by +:
df['YZ'] = df['Y'].astype(str).str.zfill(2) + df['Z'].astype(str).str.zfill(3)
Sample:
df=pd.DataFrame({'Y':[1,3,5,7], 'Z':[10,30,51,74]})
print (df)
Y Z
0 1 10
1 3 30
2 5 51
3 7 74
df['YZ'] = df['Y'].astype(str).str.zfill(2) + df['Z'].astype(str).str.zfill(3)
print (df)
Y Z YZ
0 1 10 01010
1 3 30 03030
2 5 51 05051
3 7 74 07074
If need also change original columns:
df['Y'] = df['Y'].astype(str).str.zfill(2)
df['Z'] = df['Z'].astype(str).str.zfill(3)
df['YZ'] = df['Y'] + df['Z']
print (df)
Y Z YZ
0 01 010 01010
1 03 030 03030
2 05 051 05051
3 07 074 07074
Solution with join:
df['Y'] = df['Y'].astype(str).str.zfill(2)
df['Z'] = df['Z'].astype(str).str.zfill(3)
df['YZ'] = df[['Y','Z']].apply('-'.join, axis=1)
print (df)
Y Z YZ
0 01 010 01-010
1 03 030 03-030
2 05 051 05-051
3 07 074 07-074
And without change original columns:
df['YZ'] = df['Y'].astype(str).str.zfill(2) + '-' + df['Z'].astype(str).str.zfill(3)
print (df)
Y Z YZ
0 1 10 01-010
1 3 30 03-030
2 5 51 05-051
3 7 74 07-074