I was trying to use the re.sub() on my data, but it keeps showing the TypeError.
(TypeError: expected string or bytes-like object).
This (example) is the data that I'm using:
I was trying to do:
import re
example_sub = re.sub('\n', ' ', example)
example_sub
I tried to resolve it by removing the index using reset_index(), but it didn't work.
What should I do?
Thank you!
You can use pandas.Series.str.replace:
>>> import pandas as pd
>>> df = pd.DataFrame({"a": ["a\na", "b\nb", "c\nc\nc\nc\n"]})
>>> df.a.str.replace("\n", " ")
0 a a
1 b b
2 c c c c
Name: a, dtype: object
For more complex substitutions, you can use a regex pattern:
>>> import re
>>> import pandas as pd
>>> df = pd.DataFrame({"a": ["a\na", "b\nb", "c\nc\nc\nc\n"]})
>>> pattern = re.compile(r"\n")
>>> df.a.str.replace(pattern, " ")
0 a a
1 b b
2 c c c c
Name: a, dtype: object
Related
I have a pandas colum which has special characters such as {{,}},[,],,. (commas are separators).
I tried using the following to replace the special characters with an underscore ('_'), but it is not working. Can you please let me know what I am doing wrong? Thanks.
import pandas as pd
data = [["facebook_{{campaign.name}}"], ["google_[email]"]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Marketing'])
print(df)
df['Marketing'].str.replace(r"\(|\)|\{|\}|\[|\]|\|", "_")
print(df)
Output:
Marketing
0 facebook_{{campaign.name}}
1 google_[email]
Marketing
0 facebook_{{campaign.name}}
1 google_[email]
From this DataFrame :
>>> import pandas as pd
>>> data = [["facebook_{{campaign.name}}"], ["google_[email]"]]
>>> df = pd.DataFrame(data, columns = ['Marketing'])
>>> df
Marketing
0 facebook_{{campaign.name}}
1 google_[email]
We can use replace as you suggested with a regex, including | which is a or operator except for the final \| which is the symbol |.
Then we deduplicate the double _ and we remove the final remaining _ to get the expected result :
>>> df['Marketing'] = df['Marketing'].str.replace(r"\(+|\)+|\{+|\}+|\[+|\]+|\|+|\_+|\.+", "_", regex=True).str.replace(r"_+", "_", regex=True).str.replace(r"_$", "", regex=True)
>>> df
0 facebook_campaign_name
1 google_email
Name: Marketing, dtype: object
I am wondering if anybody has a quick fix for a memory error that appears when doing the same thing as in the below example on larger data?
Example:
import pandas as pd
import numpy as np
nRows = 2
nCols = 3
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: [np.random.rand(nCols)], axis=1)
df3 = pd.concat(df2.apply(pd.DataFrame, columns=range(nCols)).tolist())
It is when creating df3 I get memory error.
The DF's in the example:
df
0
0 NaN
1 NaN
df2
0 [[0.6704675101784022, 0.41730480236712697, 0.5...
1 [[0.14038693859523377, 0.1981014890848788, 0.8...
dtype: object
df3
0 1 2
0 0.670468 0.417305 0.558690
0 0.140387 0.198101 0.800745
First I think working with lists in pandas is not good idea, if possible, you can avoid it.
So I believe you can simplify your code a lot:
nRows = 2
nCols = 3
np.random.seed(2019)
df3 = pd.DataFrame(np.random.rand(nRows, nCols))
print (df3)
0 1 2
0 0.903482 0.393081 0.623970
1 0.637877 0.880499 0.299172
Here's an example with a solution of the problem (note that in this example lists are not used in the columns, but arrays instead. This I cannot avoid, since my original problem comes with lists or array in a column).
import pandas as pd
import numpy as np
import time
np.random.seed(1)
nRows = 25000
nCols = 10000
numberOfChunks = 5
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)
for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))),
np.arange(int(round(nRows/float(numberOfChunks))), nRows + int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
df2tmp = df2.iloc[start:stop]
if start == 0:
df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
continue
df3tmp = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
df3 = pd.concat([df3, df3tmp])
Something wired when I tried to assign a list with missing value np.nan to a Pandas Series
Below are the codes to reproduce the fact.
import numpy as np
import pandas as pd
S = pd.Series(0, index = list('ABCDE'))
>>> S
A 0
B 0
C 0
D 0
E 0
dtype: int64
ind = [True, False, True, False, True]
x = [1, np.nan, 2]
>>> S[ind]
A 0
C 0
E 0
dtype: int64
Assign x to S[ind]
S[ind] = x
Something wired in S
>>> S
A 1
B 0
C 2
D 0
E NaN
dtype: float64
I am expecting S to be
>>> S
A 1
B 0
C NaN
D 0
E 2
dtype: float64
Anyone can give an explanation for this?
You can try this:
S[S[ind].index] = x
or
S[S.index[ind]] = x
How can the following MWE script work? I actually want the assignment (right before the print) to fail. Instead it changes nothing and raises no exception. This is some of the weirdest behaviour.
import pandas as pd
import numpy as np
l = ['a', 'b']
d = np.array([[False]*len(l)]*3)
df = pd.DataFrame(columns=l, data=d, index=range(1,4))
df["a"][4] = True
print df
When you say df["a"][4] = True, you are modifying the a series object, and you aren't really modifying the df DataFrame because df's index does not have an entry of 4. I wrote up a snippet of code exhibiting this behavior:
In [90]:
import pandas as pd
import numpy as np
l = ['a', 'b']
d = np.array([[False]*len(l)]*3)
df = pd.DataFrame(columns=l, data=d, index=range(1,4))
df['a'][4] = True
print "DataFrame:"
print df
DataFrame:
a b
1 False False
2 False False
3 False False
In [91]:
df['b'][4]=False
print "DataFrame:"
print df
DataFrame:
a b
1 False False
2 False False
3 False False
In [92]:
print "DF's Index"
print df.index
DF's Index
Int64Index([1, 2, 3], dtype='int64')
In [93]:
print "Series object a:"
print df['a']
Series object a:
1 False
2 False
3 False
4 True
Name: a, dtype: bool
In [94]:
print "Series object b:"
print df['b']
Series object b:
1 False
2 False
3 False
4 False
Name: b, dtype: bool
The shape of the two arrays x and y is (a,b). How do I get a combined array of shape (a,b,2)?
My current solution is
z = np.zeros((a,b,2))
z[:,:,0] = x
z[:,:,1] = y
Is it possible to achieve this without creating a new array?
You can use np.dstack:
In [2]: import numpy as np
In [3]: a = np.random.normal(size=(4,6))
In [4]: b = np.random.normal(size=(4,6))
In [5]: np.dstack((a,b)).shape
Out[5]: (4, 6, 2)
And a comparison:
In [10]: d = np.dstack((a,b))
In [11]: c = np.zeros((4,6,2))
In [12]: c[:,:,0] = a
In [13]: c[:,:,1] = b
In [14]: np.allclose(c,d)
Out[14]: True