pandas dataframe index access not raising an exception when out of bounds? - pandas

How can the following MWE script work? I actually want the assignment (right before the print) to fail. Instead it changes nothing and raises no exception. This is some of the weirdest behaviour.
import pandas as pd
import numpy as np
l = ['a', 'b']
d = np.array([[False]*len(l)]*3)
df = pd.DataFrame(columns=l, data=d, index=range(1,4))
df["a"][4] = True
print df

When you say df["a"][4] = True, you are modifying the a series object, and you aren't really modifying the df DataFrame because df's index does not have an entry of 4. I wrote up a snippet of code exhibiting this behavior:
In [90]:
import pandas as pd
import numpy as np
l = ['a', 'b']
d = np.array([[False]*len(l)]*3)
df = pd.DataFrame(columns=l, data=d, index=range(1,4))
df['a'][4] = True
print "DataFrame:"
print df
DataFrame:
a b
1 False False
2 False False
3 False False
In [91]:
df['b'][4]=False
print "DataFrame:"
print df
DataFrame:
a b
1 False False
2 False False
3 False False
In [92]:
print "DF's Index"
print df.index
DF's Index
Int64Index([1, 2, 3], dtype='int64')
In [93]:
print "Series object a:"
print df['a']
Series object a:
1 False
2 False
3 False
4 True
Name: a, dtype: bool
In [94]:
print "Series object b:"
print df['b']
Series object b:
1 False
2 False
3 False
4 False
Name: b, dtype: bool

Related

ValueError: Must be all encoded bytes when reading csv with 0 and 1 in pandas

I am trying to read a csv with 1s and 0s and convert them to True and False, because I have a lot of columns I would like to use the true_values and flase_values arguments, but I got
ValueError: Must be all encoded bytes:
from io import StringIO
import numpy as np
import pandas as pd
pd.read_csv(StringIO("""var1, var2
0, 0
0, 1
1, 1
0, 0
0, 1
1, 0"""), true_values=[1],false_values=[0])
I cannot find the problem with the code that I wrote.
You don't need true_values and false_values parameters. Use dtype instead:
>>> pd.read_csv(StringIO("""var1,var2
0,0
0,1
1,1
0,0
0,1
1,0"""), dtype={'var1': bool, 'var2': bool})
var1 var2
0 False False
1 False True
2 True True
3 False False
4 False True
5 True False
If your columns have same prefix, use filter:
df = pd.read_csv(StringIO("""..."""))
cols = df.filter(like='var').columns
df[cols] = df[cols].astype(bool)
If your columns are consecutive, use iloc:
df = pd.read_csv(StringIO("""..."""))
cols = df.iloc[:, 0:2].columns
df[cols] = df[cols].astype(bool)
Auto-detection:
m = df.min().eq(0) & df.max().eq(1)
df.loc[:, m] = df.loc[:, m].astype(bool)

Boolean comparaison between a series and a dataframe (element-wise)

Here the series and dataframe to be compared element-wise (AND condition):
import pandas as pd
se = pd.Series(data=[False, True])
df = pd.DataFrame(data=[[True, False], [True, True]],
columns=['A','B'])
Desired result:
df2 = pd.DataFrame(data=[[False, False], [True, True]],
columns=['A','B'])
I could achieve that using a slow for loop but I am sure there is a way to vectorise that.
Many thanks !
Convert Series to numpy array and compare with broadcasting:
print (df & se.to_numpy()[:,None])
A B
0 False False
1 True True
You can use conversion to numpy array to benefit from broadcasting:
out = np.logical_and(df, se.to_numpy()[:,None])
output:
A B
0 False False
1 True True
intermediate:
se.to_numpy()[:,None]
array([[False],
[ True]])
Another possible solution:
(df & np.vstack(se))
Output:
A B
0 False False
1 True True
df.mul(se,0)
Output:
A B
0 False False
1 True Tru

Drop pandas column with constant alphanumeric values

I have a dataframe df that contains around 2 million records.
Some of the columns contain only alphanumeric values (e.g. "wer345", "gfer34", "123fdst").
Is there a pythonic way to drop those columns (e.g. using isalnum())?
Apply Series.str.isalnum column-wise to mask all the alphanumeric values of the DataFrame. Then use DataFrame.all to find the columns that only contain alphanumeric values. Invert the resulting boolean Series to select only the columns that contain at least one non-alphanumeric value.
is_alnum_col = df.apply(lambda col: col.str.isalnum()).all()
res = df.loc[:, ~is_alnum_col]
Example
import pandas as pd
df = pd.DataFrame({
'a': ['aas', 'sd12', '1232'],
'b': ['sdds', 'nnm!!', 'ab-2'],
'c': ['sdsd', 'asaas12', '12.34'],
})
is_alnum_col = df.apply(lambda col: col.str.isalnum()).all()
res = df.loc[:, ~is_alnum_col]
Output:
>>> df
a b c
0 aas sdds sdsd
1 sd12 nnm!! asaas12
2 1232 ab-2 12.34
>>> df.apply(lambda col: col.str.isalnum())
a b c
0 True True True
1 True False True
2 True False False
>>> is_alnum_col
a True
b False
c False
dtype: bool
>>> res
b c
0 sdds sdsd
1 nnm!! asaas12
2 ab-2 12.34

Pandas: Memory error when using apply to split single column array into columns

I am wondering if anybody has a quick fix for a memory error that appears when doing the same thing as in the below example on larger data?
Example:
import pandas as pd
import numpy as np
nRows = 2
nCols = 3
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: [np.random.rand(nCols)], axis=1)
df3 = pd.concat(df2.apply(pd.DataFrame, columns=range(nCols)).tolist())
It is when creating df3 I get memory error.
The DF's in the example:
df
0
0 NaN
1 NaN
df2
0 [[0.6704675101784022, 0.41730480236712697, 0.5...
1 [[0.14038693859523377, 0.1981014890848788, 0.8...
dtype: object
df3
0 1 2
0 0.670468 0.417305 0.558690
0 0.140387 0.198101 0.800745
First I think working with lists in pandas is not good idea, if possible, you can avoid it.
So I believe you can simplify your code a lot:
nRows = 2
nCols = 3
np.random.seed(2019)
df3 = pd.DataFrame(np.random.rand(nRows, nCols))
print (df3)
0 1 2
0 0.903482 0.393081 0.623970
1 0.637877 0.880499 0.299172
Here's an example with a solution of the problem (note that in this example lists are not used in the columns, but arrays instead. This I cannot avoid, since my original problem comes with lists or array in a column).
import pandas as pd
import numpy as np
import time
np.random.seed(1)
nRows = 25000
nCols = 10000
numberOfChunks = 5
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)
for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))),
np.arange(int(round(nRows/float(numberOfChunks))), nRows + int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
df2tmp = df2.iloc[start:stop]
if start == 0:
df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
continue
df3tmp = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
df3 = pd.concat([df3, df3tmp])

pandas quantile comparison: indexes not aligned [duplicate]

How can I perform comparisons between DataFrames and Series? I'd like to mask elements in a DataFrame/Series that are greater/less than elements in another DataFrame/Series.
For instance, the following doesn't replace elements greater than the mean
with nans although I was expecting it to:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x[x > x.mean(axis=1)] = np.nan
>>> x
a b
0 1 3
1 2 4
If we look at the boolean array created by the comparison, it is really weird:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x > x.mean(axis=1)
a b 0 1
0 False False False False
1 False False False False
I don't understand by what logic the resulting boolean array is like that. I'm able to work around this problem by using transpose:
>>> (x.T > x.mean(axis=1).T).T
a b
0 False True
1 False True
But I believe there is some "correct" way of doing this that I'm not aware of. And at least I'd like to understand what is going on.
The problem here is that it's interpreting the index as column values to perform the comparison, if you use .gt and pass axis=0 then you get the result you desire:
In [203]:
x.gt(x.mean(axis=1), axis=0)
Out[203]:
a b
0 False True
1 False True
You can see what I mean when you perform the comparison with the np array:
In [205]:
x > x.mean(axis=1).values
Out[205]:
a b
0 False False
1 False True
here you can see that the default axis for comparison is on the column, resulting in a different result