Pandas - string replace but element is a list of strings

Pandas - string replace but element is a list of strings - pandas

This line lets me replace the substring /data/ in every row of the column path with "../datasets/"
df['path']=df['path'].astype(str).str.replace("/data/","../datasets/")
What if every row of the column path contains a list of strings e.g. ["/data/1","/data/2"] ? How can I use replace?
for example df['path'][0] should go from ["/data/1","/data/2"] to ["../datasets/1","../datasets/2"]

Use apply:
df = pd.DataFrame({
'path': [["/data/1","/data/2"]]
})
df['path'] = df['path'].apply(lambda lst: [s.replace('/data/', '../datasets/') for s in lst])

Related

How can I create a pandas column based on another pandas column that has for values a list?

I am working with a dataframe and one of the columns has for values a list of strings in each row. The list contains a number of links (each list can have a different number of links). I want to create a new column that will be based on this column of lists but keep only the links that have the keyword "uploads".
To my example, the first entry of the column is like that:
['https://seekingalpha.com/instablog/5006891-hfir/4960045-natural-gas-daily',
'https://seekingalpha.com/article/4116929-weekly-natural-gas-storage-report',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png',
'https://seekingalpha.com/account/research/subscribe?slug=hfir-energy&sasource=upsell']
And I want to keep only
['https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png']
And put the clean version in a new column of the same dataframe.
Can you please suggest a way to do it?

I just found a way where I create a function that looks within a list for a specific pattern (in my case the keyword "uploads")
def clean_alt_list(list_):
list_ = [s for s in list_ if "uploads" in s]
return list_
And then I apply this function into the column I am interested in
df['clean_links'] = df['links'].apply(clean_alt_list)

IIUC, this should work for you:
df = pd.DataFrame({'url': [['https://seekingalpha.com/instablog/5006891-hfir/4960045-natural-gas-daily', 'https://seekingalpha.com/article/4116929-weekly-natural-gas-storage-report', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png', 'https://seekingalpha.com/account/research/subscribe?slug=hfir-energy&sasource=upsell']]})
df = df.explode('url').reset_index(drop=True)
df[df['url'].str.contains('uploads')]
Result:
url
2 https://static.seekingalpha.com/uploads/2017/1...
3 https://static.seekingalpha.com/uploads/2017/1...
4 https://static.seekingalpha.com/uploads/2017/1...

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'

You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.

I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

How to replace element in pandas DataFrame column [duplicate]

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?

Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.

For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs

If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".

Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)

In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)

Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))

If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

Remove the ending of a column in pandas dataframe starting from a specific substring

I have a dataframe df with only one column which have strings like: 005f12b33ac4bdb310d8e503a065ef10b28566ea#code_id#alarm|clock
I want to remove the ending of the string starting from #code_id# and want the string to look like
005f12b33ac4bdb310d8e503a065ef10b28566ea
How can I do it?

Try this :
df['Column_name']= df['Column_name'].astype(str).str.split('#').str[0]

Use:
df['col'] = df['col'].str.extract('(.*)#code_id#')

Is there a way to selectively replace content in a dataframe?

df.replace(['《', '》', ' ,\t'], ['', '', '|'], regex=True, inplace=True)
I have a dataFrame in which I want to replace the 3 characters. But these replace all the columns. Can I exclude a particular column in the above code that won't be replaced? For example, I have a column called 'Summary' that I don't want to apply these replacements.
Is that possible?

Assuming you only want to exclude df['Summary], you can use .loc to exclude all columns but one and then replace the strings you need to replace.
I dropped the inplace=True because I don't know if that will work.
df.loc[:, df.columns != 'Summary'] = df.loc[:, df.columns != 'Summary'].replace(['《','》',' ,\t'],['','','|'], regex=True)

Try with pass dict
df.update(df.drop('Summary',1).replace(dict(zip(['《', '》', ' ,\t'], ['', '', '|'])), regex=True))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas - string replace but element is a list of strings - pandas

Use apply: df = pd.DataFrame({ 'path': [["/data/1","/data/2"]] }) df['path'] = df['path'].apply(lambda lst: [s.replace('/data/', '../datasets/') for s in lst])

Related

How can I create a pandas column based on another pandas column that has for values a list?

Pandas splitting a column with new line separator

How to replace element in pandas DataFrame column [duplicate]

Remove the ending of a column in pandas dataframe starting from a specific substring

Is there a way to selectively replace content in a dataframe?

Categories

Resources