I have a csv file like so:
Id col1 col2 col3
a 7.04 0.3 1.2
b 0.3 1.7 .1
c 0.34 0.05 1.3
d 0.4 1.60 3.1
I want to convert it to a data frame thresholding on 0.5 . If the values is greater than or equal to 0.5 then the column is counted , otherwise it is not counted.
Id classes
a col1,col3
b col2
c col3
d col2,col3
The closest solution I found is this one. However it deals with single rows, not multiple rows. For multiple rows, the best I have is iterate through all the rows. I need a succinct expression without for loop.
Use set_index first and then numpy.where for extract columns by condition. Last remove empty strings by list comprehension:
df = df.set_index('Id')
s = np.where(df > .5, ['{}, '.format(x) for x in df.columns], '')
df['new'] = pd.Series([''.join(x).strip(', ') for x in s], index=df.index)
print (df)
col1 col2 col3 new
Id
a 7.04 0.30 1.2 col1, col3
b 0.30 1.70 0.1 col2
c 0.34 0.05 1.3 col3
d 0.40 1.60 3.1 col2, col3
Similar for new DataFrame:
df1 = pd.DataFrame({'classes': [''.join(x).strip(', ') for x in s],
'Id': df.index})
print (df1)
Id classes
0 a col1, col3
1 b col2
2 c col3
3 d col2, col3
And if necessary remove empty with ,:
df1 = pd.DataFrame({'classes': [''.join(x).strip(', ').replace(', ',',') for x in s],
'Id': df.index})
print (df1)
Id classes
0 a col1,col3
1 b col2
2 c col3
3 d col2,col3
Detail:
print (s)
[['col1, ' '' 'col3, ']
['' 'col2, ' '']
['' '' 'col3, ']
['' 'col2, ' 'col3, ']]
Alternative with apply (slowier):
df1 = (df.set_index('Id')
.apply(lambda x: ','.join(x.index[x > .5]), 1)
.reset_index(name='classes'))
print (df1)
Id classes
0 a col1,col3
1 b col2
2 c col3
3 d col2,col3
Comprehension after clever multiplication... This assumes that Id is the index.
df.assign(classes=[
','.join(s for s in row if s)
for row in df.ge(.5).mul(df.columns).values
])
col1 col2 col3 classes
Id
a 7.04 0.30 1.2 col1,col3
b 0.30 1.70 0.1 col2
c 0.34 0.05 1.3 col3
d 0.40 1.60 3.1 col2,col3
Setup Fun Trick
Custom subclass of str that redefines string addition to include a ','
class s(str):
def __add__(self, other):
if self and other:
return s(super().__add__(',' + other))
else:
return s(super().__add__(other))
Fun Trick
df.ge(.5).mul(df.columns).applymap(s).sum(1)
Id
a col1,col3
b col2
c col3
d col2,col3
dtype: object
Related
I have 2 dataframes like these:
df1 = pd.DataFrame(data = {'col1' : ['finance', 'accounting'], 'col2' : ['f1', 'a1']})
df2 = pd.DataFrame(data = {'col1' : ['finance', 'finance', 'finance', 'accounting', 'accounting','IT','IT'], 'col2' : ['f1','f2','f3','a1,'a2','I1','I2']})
df1
col1 col2
0 finance f1
1 accounting a1
df2
col1 col2
0 finance f1
1 finance f2
2 finance f3
3 accounting a1
4 accounting a2
5 IT I1
6 IT I2
I would like to do LEFT JOIN on col1 and ANTI-JOIN on col2. The output should look like this:
col1 col2
finance f2
finance f3
accounting a2
Could someone please help me how to do it properly in pandas. I tried both join and merge in pandas but it hasn't worked for me. Thanks in advance.
You can merge and filter:
(df1.merge(df2, on='col1', suffixes=('_', None))
.loc[lambda d: d['col2'] != d.pop('col2_')]
)
Output:
col1 col2
1 finance f2
2 finance f3
4 accounting a2
Just for fun, here's another way (other than the really elegant solution by #mozway):
df2 = ( df2
.reset_index() # save index as column 'index'
.set_index('col1') # make 'col1' the index
.loc[df1.col1,:] # filter for 'col1' values in df1
.set_index('col2', append=True) # add 'col2' to the index
.drop(index=df1.
set_index(list(df1.columns))
.index) # create a multi-index from df1 and drop all matches from df2
.reset_index() # make 'col1' and col2' columns again
.set_index('index') # make 'index' the index again
.rename_axis(index=None) ) # make the index anonymous
Output:
col1 col2
1 finance f2
2 finance f3
4 accounting a2
I have a pandas data frame with one column as string seperated with commas
eg:
col1 col2
B1,B2,B3 20
B4,B5,B6 15
and I want to create another data frame with combination like:
Col1 Col2 col3 col4 col5
B1,B2,B3 20 B1,B2 B2,B3 B1,B3
B4,B5,B6 15 B4,B5, B5,B6 B4,B6
How can I do this in Pandas.
If there are always 3 values in col1, you could use itertools.combinations:
from itertools import combinations
splits = df['col1'].str.split(',', expand=True).values.tolist()
df[['col3', 'col4', 'col5']] = [[','.join(c) for c in combinations(split, 2)] for split in splits]
print(df)
Output
col1 col2 col3 col4 col5
0 B1,B2,B3 20 B1,B2 B1,B3 B2,B3
1 B4,B5,B6 15 B4,B5 B4,B6 B5,B6
A more general solution is to do:
from itertools import combinations
splits = df['col1'].str.split(',').values.tolist()
rows = [[','.join(c) for c in combinations(split, 2)] for split in splits]
length = max(len(row) for row in rows)
new_cols = pd.DataFrame(data=rows, columns=[f'col{i}' for i in range(3, length + 3)])
res = pd.concat((df, new_cols), axis=1)
print(res)
Output
col1 col2 col3 col4 col5
0 B1,B2,B3 20 B1,B2 B1,B3 B2,B3
1 B4,B5 15 B4,B5 None None
Note that the input for the second example was:
col1 col2
B1,B2,B3 20
B4,B5 15
I have a dataframe df as:
Col1 Col2
A -5
A 3
B -2
B 15
I need to get the following:
Col1 Col2
A -5
B 15
Where the decision was made for each group in Col1 by selecting the absolute maximum from Col2. I am not sure how to proceed with this.
Use DataFrameGroupBy.idxmax with pass absolute values for indices and then select by DataFrame.loc:
df = df.loc[df['Col2'].abs().groupby(df['Col1']).idxmax()]
#alternative with reassign column
df = df.loc[df.assign(Col2 = df['Col2'].abs()).groupby('Col1')['Col2'].idxmax()]
print (df)
Col1 Col2
0 A -5
3 B 15
I'm trying to substring a column based on the length of another column but the resultset is NaN. What am I doing wrong?
import pandas as pd
df = pd.DataFrame([['abcdefghi','xyz'], ['abcdefghi', 'z']], columns=['col1', 'col2'])
df.col1.str[:df.col2.str.len()]
0 NaN
1 NaN
Name: col1, dtype: float64
Here is what I am expecting:
0 'abc'
1 'a'
I don't think string indexing would take a series. I would do a list comprehension:
df['extract'] = [r.col1[:len(r.col2)] for _,r in df.iterrows()]
Or
df['extract'] = [s1[:len(s2)] for s1,s2 in zip(df.col1, df.col2)]
Output:
col1 col2 extract
0 abcdefghi xyz abc
1 abcdefghi z a
using numpy and converting the array to pd.Series
def slicer(start=None, stop=None, step=1):
return np.vectorize(lambda x: x[start:stop:step], otypes=[str])
df["new_str"] = pd.Series(
[slicer(0, i)(c) for i, c in zip(df["col2"].apply(len), df["col1"].values)]
)
print(df)
col1 col2 new_str
0 abcdefghi xyz abc
1 abcdefghi z a
Here is a solution using lambda:
df['new'] = df.apply(lambda row: row['col1'][0:len(row['col2'])], axis=1)
Result:
col1 col2 new
0 abcdefghi xyz abc
1 abcdefghi z a
i only want to apply a function to each row if the col1 contains the value '12'. for rows that don't have 12, return 0 on col3
I only want to apply the following to rows that have col1 = 12.
df['col3'] = df['col2'].str.lower().str.contains('apple',na=0)
d1
col1 col2
12 apple
13 apple
12 grape
Expected results..
df1
col1 col2 col3
12 apple True
13 apple False
12 grape False
Thanks
You can try:
df['col3'] = (df['col1'] == 12) & (df['col2'].str.contains('apple'))
Output:
col1 col2 col3
0 12 apple True
1 13 apple False
2 12 grape False
Can do simply
df['col3'] = df['col1'].eq(12) & df['col2'].str.contains('apple')
Or in two steps separately
s = df.loc[df.col1.eq(12), 'col2'].str.lower().str.contains('apple',na=0)
df.loc[:, 'col32'] = s
df = df.fillna(False)