Assign a list with a missing value to a Pandas Series in Python - pandas

Something wired when I tried to assign a list with missing value np.nan to a Pandas Series
Below are the codes to reproduce the fact.
import numpy as np
import pandas as pd
S = pd.Series(0, index = list('ABCDE'))
>>> S
A 0
B 0
C 0
D 0
E 0
dtype: int64
ind = [True, False, True, False, True]
x = [1, np.nan, 2]
>>> S[ind]
A 0
C 0
E 0
dtype: int64
Assign x to S[ind]
S[ind] = x
Something wired in S
>>> S
A 1
B 0
C 2
D 0
E NaN
dtype: float64
I am expecting S to be
>>> S
A 1
B 0
C NaN
D 0
E 2
dtype: float64
Anyone can give an explanation for this?

You can try this:
S[S[ind].index] = x
or
S[S.index[ind]] = x

Related

How can delete the index from the data?

I was trying to use the re.sub() on my data, but it keeps showing the TypeError.
(TypeError: expected string or bytes-like object).
This (example) is the data that I'm using:
I was trying to do:
import re
example_sub = re.sub('\n', ' ', example)
example_sub
I tried to resolve it by removing the index using reset_index(), but it didn't work.
What should I do?
Thank you!
You can use pandas.Series.str.replace:
>>> import pandas as pd
>>> df = pd.DataFrame({"a": ["a\na", "b\nb", "c\nc\nc\nc\n"]})
>>> df.a.str.replace("\n", " ")
0 a a
1 b b
2 c c c c
Name: a, dtype: object
For more complex substitutions, you can use a regex pattern:
>>> import re
>>> import pandas as pd
>>> df = pd.DataFrame({"a": ["a\na", "b\nb", "c\nc\nc\nc\n"]})
>>> pattern = re.compile(r"\n")
>>> df.a.str.replace(pattern, " ")
0 a a
1 b b
2 c c c c
Name: a, dtype: object

Pandas: Memory error when using apply to split single column array into columns

I am wondering if anybody has a quick fix for a memory error that appears when doing the same thing as in the below example on larger data?
Example:
import pandas as pd
import numpy as np
nRows = 2
nCols = 3
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: [np.random.rand(nCols)], axis=1)
df3 = pd.concat(df2.apply(pd.DataFrame, columns=range(nCols)).tolist())
It is when creating df3 I get memory error.
The DF's in the example:
df
0
0 NaN
1 NaN
df2
0 [[0.6704675101784022, 0.41730480236712697, 0.5...
1 [[0.14038693859523377, 0.1981014890848788, 0.8...
dtype: object
df3
0 1 2
0 0.670468 0.417305 0.558690
0 0.140387 0.198101 0.800745
First I think working with lists in pandas is not good idea, if possible, you can avoid it.
So I believe you can simplify your code a lot:
nRows = 2
nCols = 3
np.random.seed(2019)
df3 = pd.DataFrame(np.random.rand(nRows, nCols))
print (df3)
0 1 2
0 0.903482 0.393081 0.623970
1 0.637877 0.880499 0.299172
Here's an example with a solution of the problem (note that in this example lists are not used in the columns, but arrays instead. This I cannot avoid, since my original problem comes with lists or array in a column).
import pandas as pd
import numpy as np
import time
np.random.seed(1)
nRows = 25000
nCols = 10000
numberOfChunks = 5
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)
for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))),
np.arange(int(round(nRows/float(numberOfChunks))), nRows + int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
df2tmp = df2.iloc[start:stop]
if start == 0:
df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
continue
df3tmp = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
df3 = pd.concat([df3, df3tmp])

How to build column by column dataframe pandas

I have a dataframe looking like this example
A | B | C
__|___|___
s s nan
nan x x
I would like to create a table of intersections between columns like this
| A | B | C
__|______|____|______
A | True |True| False
__|______|____|______
B | True |True|True
__|______|____|______
C | False|True|True
__|______|____|______
Is there an elegant cycle-free way to do it?
Thank you!
Setup
df = pd.DataFrame(dict(A=['s', np.nan], B=['s', 'x'], C=[np.nan, 'x']))
Option 1
You can use numpy broadcasting to evaluate each column by each other column. Then determine if any of the comparisons are True
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).any(0),
df.columns, df.columns
)
A B C
A True True False
B True True True
C False True True
By replacing any with sum you can get a count of how many intersections.
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).sum(0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Or use np.count_nonzero instead of sum
v = df.values
pd.DataFrame(
np.count_nonzero(v[:, :, None] == v[:, None], 0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Option 2
Fun & Creative way
d = pd.get_dummies(df.stack()).unstack(fill_value=0)
d = d.T.dot(d)
d.groupby(level=1).sum().groupby(level=1, axis=1).sum()
A B C
A 1 1 0
B 1 2 1
C 0 1 1

creating a logical panda series by comparing two series

In pandas I'm trying to get two series combined to one logical one
f = pd.Series(['a','b','c','d','e'])
x = pd.Series(['a','c'])
As a result I would like to have the series
[1, 0, 1, 0, 0]
I tried
f.map(lambda e: e in x)
Series f is large (30000) so looping over the elements (with map) is probably not very efficient. What would be a good approach?
Use isin:
In [207]:
f = pd.Series(['a','b','c','d','e'])
x = pd.Series(['a','c'])
f.isin(x)
Out[207]:
0 True
1 False
2 True
3 False
4 False
dtype: bool
You can convert the dtype using astype if you prefer:
In [208]:
f.isin(x).astype(int)
Out[208]:
0 1
1 0
2 1
3 0
4 0
dtype: int32

How do you filter out rows with NaN in a panda's dataframe

I have a few entries in a panda dataframe that are NaN. How would I remove any row with a NaN?
Just use x.dropna():
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]:
In [3]: df = pd.DataFrame(np.random.randn(5, 2))
In [4]: df.iloc[0, 1] = np.nan
In [5]: df.iloc[4, 0] = np.nan
In [6]: print(df)
0 1
0 2.264727 NaN
1 0.229321 1.615272
2 -0.901608 -1.407787
3 -0.198323 0.521726
4 NaN 0.692340
In [7]: df2 = df.dropna()
In [8]: print(df2)
0 1
1 0.229321 1.615272
2 -0.901608 -1.407787
3 -0.198323 0.521726