Replacing partial text in cells in a dataframe - pandas

This is an extension of a question asked and solved earlier (Replace specific values inside a cell without chaging other values in a dataframe)
I have a dataframe where different numeric codes are used in place of text strings and now I would like to replace those codes with text values. In the reference question (above link) it worked with the regex method before but now it is not working anymore and I am clueless if there are any changes made to the .replace method?
Example of my dataframe:
col1
0 1,2,3
1 1,2
2 2-3
3 2, 3
The code lines that I wrote use a dictionary of values that needs to changed and then regex is set to be true.
I used the following code:
d = {'1':'a', '2':'b', '3':'c'}
df['col2'] = df['col1'].replace(d, regex=True)
The result I got is:
col1 col2
0 1,2,3 a,2,3
1 1,2 a,2
2 2-3 b-3
3 2, 3 b, 3
Whereas, I was expecting:
col1 col2
0 1,2,3 a,b,c
1 1,2 a,b
2 2-3 b-c
3 2, 3 b, c
Or alternatively:
col1
0 a,b,c
1 a,b
2 b-c
3 b, c
Is there any changes to the .replace method in the last 1 year? or am I doing anything wrong here? Earlier the same code that I have written worked but not anymore.

Ok, after some experimenting, I found that for each code (numbers) in my cells I need to have a regex replacement statement, such as:
df.replace({'col1': r'1'}, {'col1': 'a'}, regex=True, inplace=True)
df.replace({'col1': r'2'}, {'col1': 'b'}, regex=True, inplace=True)
df.replace({'col1': r'3'}, {'col1': 'c'}, regex=True, inplace=True)
Which results in:
col1
0 a,b,c
1 a,b
2 b-c
3 b, c
This is just a work around as it will overwrite the existing column but it works in my case as my main objective was to replace the codes with values.

Related

In dataframe, merge row by matching multiple id but, condition is different for all id like (full or partial match) [duplicate]

I want to merge several strings in a dataframe based on a groupedby in Pandas.
This is my code so far:
import pandas as pd
from io import StringIO
data = StringIO("""
"name1","hej","2014-11-01"
"name1","du","2014-11-02"
"name1","aj","2014-12-01"
"name1","oj","2014-12-02"
"name2","fin","2014-11-01"
"name2","katt","2014-11-02"
"name2","mycket","2014-12-01"
"name2","lite","2014-12-01"
""")
# load string as stream into dataframe
df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2])
# add column with month
df["month"] = df["date"].apply(lambda x: x.month)
I want the end result to look like this:
I don't get how I can use groupby and apply some sort of concatenation of the strings in the column "text". Any help appreciated!
You can groupby the 'name' and 'month' columns, then call transform which will return data aligned to the original df and apply a lambda where we join the text entries:
In [119]:
df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
name text month
0 name1 hej,du 11
2 name1 aj,oj 12
4 name2 fin,katt 11
6 name2 mycket,lite 12
I sub the original df by passing a list of the columns of interest df[['name','text','month']] here and then call drop_duplicates
EDIT actually I can just call apply and then reset_index:
In [124]:
df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()
Out[124]:
name month text
0 name1 11 hej,du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
update
the lambda is unnecessary here:
In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Out[38]:
name month text
0 name1 11 du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
We can groupby the 'name' and 'month' columns, then call agg() functions of Panda’s DataFrame objects.
The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.
df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})
The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:
output_series = df.groupby(['name','month'])['text'].apply(list)
If you want to concatenate your "text" in a list:
df.groupby(['name', 'month'], as_index = False).agg({'text': list})
For me the above solutions were close but added some unwanted /n's and dtype:object, so here's a modified version:
df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()
Please try this line of code : -
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Although, this is an old question. But just in case. I used the below code and it seems to work like a charm.
text = ''.join(df[df['date'].dt.month==8]['text'])
Thanks to all the other answers, the following is probably the most concise and feels more natural. Using df.groupby("X")["A"].agg() aggregates over one or many selected columns.
df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'],
'B' : ['i', 'j', 'k', 'i', 'j'],
'X' : [1, 2, 2, 1, 3]})
A B X
a i 1
a j 2
b k 2
c i 1
c j 3
df.groupby("X", as_index=False)["A"].agg(' '.join)
X A
1 a c
2 a b
3 c
df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)
X A B
1 a c i i
2 a b j k
3 c j

Convert transactions with several products from columns to row [duplicate]

I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.
After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])

Delete all rows with an empty cell anywhere in the table at once in pandas

I have googled it and found lots of question in stackoverflow. So suppose I have a dataframe like this
A B
-----
1
2
4 4
First 3 rows will be deleted. And suppose I have not 2 but 200 columns. How can I do that?
As per your request - first replace to Nan:
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.dropna()
If you want to remove on a specific column, then you need to specify the column name in the brackets

pandas get columns without copy

I have a data frame with multiple columns, and I want to get some of them, and drop others, without copying a new dataframe
I suppose it should be
df = df['col_a','col_b']
but I'm not sure whether it copy a new one or not. Is there any better way to do this?
Your approach should work, apart from one minor issue:
df = df['col_a','col_b']
shoud be:
df = df[['col_a','col_b']]
Because you assign the subset df back to df, it's essentially equivalent to dropping the other columns.
If you would like to drop other columns in place, you can do:
df.drop(columns=df.columns.difference(['col_a','col_b']),inplace=True)
Let me know if this is what you want.
you have a dataframe df with multiple columns a, b, c, d and e. You want to select let us say a and b and store them back in df. To achieve this, you can do :
df=df[['a', 'b']]
Input dataframe df:
a b c d e
1 1 1 1 1
3 2 3 1 4
When you do :
df=df[['a', 'b']]
output will be :
a b
1 1
3 2

Pandas dataframe values not changing outside of function

I have a pandas dataframe inside a for loop where I change a value in pandas dataframe like this:
df[item].ix[(e1,e2)] = 1
However when I access the df, the values are still unchanged. Do you know where exactly am I going wrong?
Any suggestions?
You are using chained indexing, which usually causes problems. In your code, df[item] returns a series, and then .ix[(e1,e2)] = 1 modifies that series, leaving the original dataframe untouched. You need to modify the original dataframe instead, like this:
import pandas as pd
df = pd.DataFrame({'colA': [5, 6, 1, 2, 3],
'colB': ['a', 'b', 'c', 'd', 'e']})
print df
df.ix[[1, 2], 'colA'] = 111
print df
That code sets rows 1 and 2 of colA to 111, which I believe is the kind of thing you were looking to do. 1 and 2 could be replaced with variables of course.
colA colB
0 5 a
1 6 b
2 1 c
3 2 d
4 3 e
colA colB
0 5 a
1 111 b
2 111 c
3 2 d
4 3 e
For more information on chained indexing, see the documentation:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Side note: you may also want to rethink your code in general since you mentioned modifying a dataframe in a loop. When using pandas, you usually can and should avoid looping and leverage set-based operations instead. It takes some getting used to, but it's the way to unlock the full power of the library.