Replace pandas column special characters - pandas

I have a pandas colum which has special characters such as {{,}},[,],,. (commas are separators).
I tried using the following to replace the special characters with an underscore ('_'), but it is not working. Can you please let me know what I am doing wrong? Thanks.
import pandas as pd
data = [["facebook_{{campaign.name}}"], ["google_[email]"]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Marketing'])
print(df)
df['Marketing'].str.replace(r"\(|\)|\{|\}|\[|\]|\|", "_")
print(df)
Output:
Marketing
0 facebook_{{campaign.name}}
1 google_[email]
Marketing
0 facebook_{{campaign.name}}
1 google_[email]

From this DataFrame :
>>> import pandas as pd
>>> data = [["facebook_{{campaign.name}}"], ["google_[email]"]]
>>> df = pd.DataFrame(data, columns = ['Marketing'])
>>> df
Marketing
0 facebook_{{campaign.name}}
1 google_[email]
We can use replace as you suggested with a regex, including | which is a or operator except for the final \| which is the symbol |.
Then we deduplicate the double _ and we remove the final remaining _ to get the expected result :
>>> df['Marketing'] = df['Marketing'].str.replace(r"\(+|\)+|\{+|\}+|\[+|\]+|\|+|\_+|\.+", "_", regex=True).str.replace(r"_+", "_", regex=True).str.replace(r"_$", "", regex=True)
>>> df
0 facebook_campaign_name
1 google_email
Name: Marketing, dtype: object

Related

Pandas: Getting indices (numeric position) from external array for each value in Column

I have an fixed value with arrays: ['string1', 'string2', 'string3'] and a Pandas Datafrae:
>>> pd.DataFrame({'column': ['string1', 'string1', 'string2']})
column
0 string1
1 string1
2 string2
And I want to add a new column with the indices position from the previous array, so it becomes:
>>> pd.DataFrame({'column': ['string1', 'string1', 'string2', pd.NA], 'indices': [0,0,1, pd.NA]})
column indices
0 string1 0
1 string1 0
2 string2 1
3 <NA> <NA>
I.e the position of the value in the main array. This will be later fed into pyarrow's DictionaryArray[1]. The Dataframe can have null values as well
Is there any fast way to do this? Been trying to figure out how to vectorize it. Naive implementation:
def create_dictionary_array_indices(column_name, arrow_array):
global dictionary_values
values = arrow_array.to_pylist()
indices = []
for i, value in enumerate(values):
if not value or value != value:
indices.append(None)
else:
indices.append(
dictionary_values[column_name].index(value)
)
indices = pd.array(indices, dtype=pd.Int32Dtype())
return pa.DictionaryArray.from_arrays(indices, dictionary_values[column_name])
[1] https://lists.apache.org/thread/xkpyb3zboksbhmyqzzkj983y6l0t9bjs
Given your two dataframes:
import pandas as pd
df1 = pd.DataFrame({"column": ["string1", "string1", "string2"]})
df2 = pd.DataFrame({"column": ["string1", "string1", "string2", pd.NA]})
Here is one way to do it:
df1 = df1.drop_duplicates(keep="first").reset_index(drop=True)
indices = {value: key for key, value in df1["column"].items()}
df2["indices"] = df2["column"].apply(lambda x: indices.get(x, pd.NA))
print(df2)
# Output
column indices
0 string1 0
1 string1 0
2 string2 1
3 <NA> <NA>

Drop pandas column with constant alphanumeric values

I have a dataframe df that contains around 2 million records.
Some of the columns contain only alphanumeric values (e.g. "wer345", "gfer34", "123fdst").
Is there a pythonic way to drop those columns (e.g. using isalnum())?
Apply Series.str.isalnum column-wise to mask all the alphanumeric values of the DataFrame. Then use DataFrame.all to find the columns that only contain alphanumeric values. Invert the resulting boolean Series to select only the columns that contain at least one non-alphanumeric value.
is_alnum_col = df.apply(lambda col: col.str.isalnum()).all()
res = df.loc[:, ~is_alnum_col]
Example
import pandas as pd
df = pd.DataFrame({
'a': ['aas', 'sd12', '1232'],
'b': ['sdds', 'nnm!!', 'ab-2'],
'c': ['sdsd', 'asaas12', '12.34'],
})
is_alnum_col = df.apply(lambda col: col.str.isalnum()).all()
res = df.loc[:, ~is_alnum_col]
Output:
>>> df
a b c
0 aas sdds sdsd
1 sd12 nnm!! asaas12
2 1232 ab-2 12.34
>>> df.apply(lambda col: col.str.isalnum())
a b c
0 True True True
1 True False True
2 True False False
>>> is_alnum_col
a True
b False
c False
dtype: bool
>>> res
b c
0 sdds sdsd
1 nnm!! asaas12
2 ab-2 12.34

How can delete the index from the data?

I was trying to use the re.sub() on my data, but it keeps showing the TypeError.
(TypeError: expected string or bytes-like object).
This (example) is the data that I'm using:
I was trying to do:
import re
example_sub = re.sub('\n', ' ', example)
example_sub
I tried to resolve it by removing the index using reset_index(), but it didn't work.
What should I do?
Thank you!
You can use pandas.Series.str.replace:
>>> import pandas as pd
>>> df = pd.DataFrame({"a": ["a\na", "b\nb", "c\nc\nc\nc\n"]})
>>> df.a.str.replace("\n", " ")
0 a a
1 b b
2 c c c c
Name: a, dtype: object
For more complex substitutions, you can use a regex pattern:
>>> import re
>>> import pandas as pd
>>> df = pd.DataFrame({"a": ["a\na", "b\nb", "c\nc\nc\nc\n"]})
>>> pattern = re.compile(r"\n")
>>> df.a.str.replace(pattern, " ")
0 a a
1 b b
2 c c c c
Name: a, dtype: object

Pandas how to append data from df into df2

The code bellow gets me in a loop and prints a table of data from a website.
How can i get the data from this 'output' table into a new orgaized table?
I need to get the 'Códigos de Negociação' and the 'CNPJ' into this new table.
This is a sample of the scraped table
0 1
0 Nome de Pregão FII ALIANZA
1 Códigos de Negociação ALZR11
2 CNPJ 28.737.771/0001-85
3 Classificação Setorial Financeiro e Outros/Fundos/Fundos Imobiliários
4 Site www.btgpactual.com
This is the code
import pandas as pd
list = pd.read_html('http://bvmf.bmfbovespa.com.br/Fundos-Listados/FundosListados.aspx?tipoFundo=imobiliario&Idioma=pt-br')[0]
Tickers = list['Código'].tolist()
removechars = str.maketrans('', '', './-')
for i in Tickers:
try:
df = pd.read_html("http://bvmf.bmfbovespa.com.br/Fundos-Listados/FundosListadosDetalhe.aspx?Sigla="+i+"&tipoFundo=Imobiliario&aba=abaPrincipal&idioma=pt-br")[0]
print(df)
except:
print('y')
And i would like to apply the removechars in the CNPJ, to clear it from dots, bars and dashes.
Expected result:
Código CNPJ
0 ALZR11 28737771000185
This code worked for me
import pandas as pd
list = pd.read_html('http://bvmf.bmfbovespa.com.br/Fundos-Listados/FundosListados.aspx?tipoFundo=imobiliario&Idioma=pt-br')[0]
Tickers = list['Código'].tolist()
print(list)
CNPJ = []
Codigo = []
removechars = str.maketrans('', '', './-')
for i in Tickers:
try:
df = pd.read_html("http://bvmf.bmfbovespa.com.br/Fundos-Listados/FundosListadosDetalhe.aspx?Sigla="+i+"&tipoFundo=Imobiliario&aba=abaPrincipal&idioma=pt-br")[0]
print(df)
Codigo.append(df.at[1, 1])
CNPJ.append(df.at[2,1])
df2 = pd.DataFrame({'Codigo':Codigo,'CNPJ':CNPJ})
CNPJ_No_S_CHAR = [s.translate(removechars) for s in CNPJ]
df2['CNPJ'] = pd.Series(CNPJ_No_S_CHAR)
print(df2)
except:
print('y')

Pandas: Memory error when using apply to split single column array into columns

I am wondering if anybody has a quick fix for a memory error that appears when doing the same thing as in the below example on larger data?
Example:
import pandas as pd
import numpy as np
nRows = 2
nCols = 3
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: [np.random.rand(nCols)], axis=1)
df3 = pd.concat(df2.apply(pd.DataFrame, columns=range(nCols)).tolist())
It is when creating df3 I get memory error.
The DF's in the example:
df
0
0 NaN
1 NaN
df2
0 [[0.6704675101784022, 0.41730480236712697, 0.5...
1 [[0.14038693859523377, 0.1981014890848788, 0.8...
dtype: object
df3
0 1 2
0 0.670468 0.417305 0.558690
0 0.140387 0.198101 0.800745
First I think working with lists in pandas is not good idea, if possible, you can avoid it.
So I believe you can simplify your code a lot:
nRows = 2
nCols = 3
np.random.seed(2019)
df3 = pd.DataFrame(np.random.rand(nRows, nCols))
print (df3)
0 1 2
0 0.903482 0.393081 0.623970
1 0.637877 0.880499 0.299172
Here's an example with a solution of the problem (note that in this example lists are not used in the columns, but arrays instead. This I cannot avoid, since my original problem comes with lists or array in a column).
import pandas as pd
import numpy as np
import time
np.random.seed(1)
nRows = 25000
nCols = 10000
numberOfChunks = 5
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)
for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))),
np.arange(int(round(nRows/float(numberOfChunks))), nRows + int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
df2tmp = df2.iloc[start:stop]
if start == 0:
df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
continue
df3tmp = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
df3 = pd.concat([df3, df3tmp])