How to map integer to string value in pandas dataframe - pandas

I have this python dictionary:
dictionary = {
'1':'A',
'2':'B',
'3':'C',
'4':'D',
'5':'E',
'6':'F',
'7':'G',
'8':'H',
'8':'I',
'9':'J',
'0':'L'
}
The I have created this simple pandas dataframe:
import pandas as pd
ds = {'col1' : [12345,67890], 'col2' : [12364,78910]}
df = pd.DataFrame(data=ds)
print(df)
Which looks like this:
col1 col2
0 12345 12364
1 67890 78910
I would like to transform each and every digit in col1 (which is an int field) to the correspondent letter as per dictionary indicated above. So, basically I'd like the resulting dataframe to look like this:
col1 col2 col1_transformed
0 12345 12364 ABCDE
1 67890 78910 FGHIJ
Is there a quick, pythonic way to do so by any chance?

A possible solution (notice that 8 is repeated in your dictionary -- a typo? -- and, therefore, my result does not match yours):
def f(x):
return ''.join([dictionary[y] for y in str(x)])
df['col3'] = df['col1'].map(f)
Output:
col1 col2 col3
0 12345 12364 ABCDE
1 67890 78910 FGIJL

Try:
df[df.columns + "_transformed"] = df.apply(
lambda x: [
"".join(dictionary.get(ch, "") for ch in s) for s in map(str, x)
],
axis=1,
result_type="expand",
)
print(df)
Prints:
col1 col2 col1_transformed col2_transformed
0 12345 12364 ABCDE ABCFD
1 67890 78910 FGIJL GIJAL

Related

Add row to an existing dataframe

I have an existing data frame with known columns. I want to insert a row with data for each column inserted one at a time.
I first created an empty data frame with few columns-
df = pd.DataFrame(columns=['col1', 'col2', 'col3'])
df.to_csv('test.csv', sep='|', index=False)
test.csv
col1|col2|col3
Then, added a row with data inserted for each column one at a time.
list = ['col1', 'col2', 'col3']
turn = 2
df = pd.read_csv('test.csv', sep='|')
while turn:
for each in list:
df[each] = turn
turn-=1
Expected output test.csv
col1|col2|col3
2 |2 |2
1 |1 |1
But I am unable to get the expected output, instead, I'm getting this
col1|col2|col3
Kindly let me know where I'm making mistake, I would really appreciate any sort of help.
You can use df.append() to append a row
import pandas as pd
df = pd.DataFrame(columns=['col1', 'col2', 'col3'])
turn = 2
while turn:
new_row = {'col1':turn, 'col2':turn, 'col3':turn}
df = df.append(new_row, ignore_index=True)
turn-=1
Out[11]:
col1 col2 col3
0 2 2 2
1 1 1 1
To modify your while loop, do:
turn = 2
while turn:
for each in list:
df.loc[len(df.dropna()), each] = turn
turn-=1
>>> df
col1 col2 col3
0 2 2 2
1 1 1 1
>>>
The reason it doesn't work is because you're assigning to the whole column... Not the specific row value.

Substring column in Pandas based another column

I'm trying to substring a column based on the length of another column but the resultset is NaN. What am I doing wrong?
import pandas as pd
df = pd.DataFrame([['abcdefghi','xyz'], ['abcdefghi', 'z']], columns=['col1', 'col2'])
df.col1.str[:df.col2.str.len()]
0 NaN
1 NaN
Name: col1, dtype: float64
Here is what I am expecting:
0 'abc'
1 'a'
I don't think string indexing would take a series. I would do a list comprehension:
df['extract'] = [r.col1[:len(r.col2)] for _,r in df.iterrows()]
Or
df['extract'] = [s1[:len(s2)] for s1,s2 in zip(df.col1, df.col2)]
Output:
col1 col2 extract
0 abcdefghi xyz abc
1 abcdefghi z a
using numpy and converting the array to pd.Series
def slicer(start=None, stop=None, step=1):
return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
df["new_str"] = pd.Series(
[slicer(0, i)(c) for i, c in zip(df["col2"].apply(len), df["col1"].values)]
)
print(df)
col1 col2 new_str
0 abcdefghi xyz abc
1 abcdefghi z a
Here is a solution using lambda:
df['new'] = df.apply(lambda row: row['col1'][0:len(row['col2'])], axis=1)
Result:
col1 col2 new
0 abcdefghi xyz abc
1 abcdefghi z a

Pandas - groupby and count series string over column

I have a df like this:
import pandas as pd
df = pd.DataFrame(columns=['Concat','SearchTerm'])
df = df.append({'Concat':'abc','SearchTerm':'aa'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'aab'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'aac'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'ddd'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'cef'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'plo'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'cefa'}, ignore_index=True)
print(df)
Concat SearchTerm
0 abc aa
1 abc aab
2 abc aac
3 abc ddd
4 def cef
5 def plo
6 def cefa
I want to group up the df by Concat, and count how many times each SearchTerm appears within the strings of that subset. So the final result should look like this:
Concat SearchTerm Count
0 abc aa 3
1 abc aab 1
2 abc aac 1
3 abc ddd 1
4 def cef 2
5 def plo 1
6 def cefa 1
For Concat abc, aa is found 3 times among the 4 SearchTerms. I can get the solution using a loop, but for my larger dataset, it is too slow.
I have tried two solutions from this thread and this thread.
df['Count'] = df['SearchTerm'].str.contains(df['SearchTerm']).groupby(df['Concat']).sum()
df['Count'] = df.groupby(['Concat'])['SearchTerm'].transform(lambda x: x[x.str.contains(x)].count())
In either case, there is a TypeError:
'Series' objects are mutable, thus they cannot be hashed
Any help would be appreciated.
Use transform and listcomp
s = df.groupby('Concat').SearchTerm.transform('|'.join)
df['Count'] = [s[i].count(term) for i, term in enumerate(df.SearchTerm)]
Out[77]:
Concat SearchTerm Count
0 abc aa 3
1 abc aab 1
2 abc aac 1
3 abc ddd 1
4 def cef 2
5 def plo 1
6 def cefa 1

pandas replace values condition based on another column

I have a dataframe that looks like this:
col1 col2
Yes 23123
No 23423423
Yes 34234
No 13213
I want to replace values in col2 so that if 'Yes' in col1 then return blank and if 'No' return the initial value
I want to see this:
col1 col2
Yes
No 23423423
Yes
No 13213
I have tried this but 'No' is returning None:
def map_value(x):
if x in ['Yes']:
return ''
else:
return None
df['col2'] = df['col1'].apply(map_value)
there are many ways to go about this, one of them is
df.loc[df.col1 == 'Yes', 'col2'] = ''
Output:
col1 col2
Yes
No 23423423
Yes
No 13213
You can use numpy for this
import pandas as pd
import numpy as np
d = {'col1': ['yes', 'no', 'yes', 'no'], 'col2': [23123,23423423,34234,13213]}
df = pd.DataFrame(data=d)
df['col2'] = np.where(df.col1 == 'yes', '', df.col2)
df
Created df by copying sample data from OP's post and using following command:
df=pd.read_clipboard();
df
col1 col2
0 Yes 23123
1 No 23423423
2 Yes 34234
3 No 13213
Could you please try following.
m=df['col1']=='No'
df['col2']=df['col2'].where(m,'')
df
After running code output will be as follows:
col1 col2
0 Yes
1 No 23423423
2 Yes
3 No 13213

Removing part of a string in every column in pandas dataframe after a symbol

I want to remove everything after '-' in each row in one column in a pandas dataframe. I have tried str.split to no avail.
Try this:
df['column'] = df['column'].str.replace(r'-.*$', '')
Demo:
In [154]: df
Out[154]:
column
0 aaa
1 asd-bfd-asd
2 -xsdert-...
3 123-345
In [155]: df['column'] = df['column'].str.replace(r'-.*$', '')
In [156]: df
Out[156]:
column
0 aaa
1 asd
2
3 123
or using .str.split():
In [159]: df['column'] = df['column'].str.split('-').str[0]
In [160]: df
Out[160]:
column
0 aaa
1 asd
2
3 123