Iterating over all columns of dataframe to find list of strings - pandas

Suppose I have the following df:
df = pd.DataFrame({
'col1':['x1','x2','x3'],
'col2':['y1','y2','y3'],
'col3':['z1','z2','z3'],
'col4':['a1','b2','c3']
})
and a list of elements:
l = ['x1','x2','y3']
I want to search elements of l in all the columns of my df, as it stands from my list x1 and x2 appear in col1 and y3 is in col2, so I did:
df.loc[df['col1'].apply(lambda x: True if any(i in x for i in l) else False)|
df['col2'].apply(lambda x: True if any(i in x for i in l) else False)]
which gives me
col1 col2 col3 col4
0 x1 y1 z1 a1
1 x2 y2 z2 b2
2 x3 y3 z3 c3
as desired but the above method needs me to make a | operator for each column. So I wonder how can I do this iteration over all columns efficiently without using | for every column?

A much, much more efficient way of doing this would be to use numpy broadcasting.
row_mask = (df.to_numpy() == l[:, None, None]).sum(axis=0).any(axis=1)
filtered = df[row_mask]
Output:
>>> filtered
col1 col2 col3 col4
0 x1 y1 z1 a1
1 x2 y2 z2 b2
2 x3 y3 z3 c3

Related

Datapoints ordering: catplot seaborn

In the df all the color types are together after sorting, however, in the first plot Y1 is between Xs (see below).
By adding the col= to build a grid, the order is even further broken up.
How can you keep the color types together on the plot?
I've tried hue_order, order, col_order without any success.
Plot 1: need the orange X color types to be in order X3 > X2 > X1, not X3 > X2 > Y1 > X1
Plot 2: need all the color types to be in order (one after another)
Thank you in advance!
color_data.csv =
color_data.csv:
YEAR
SITE
G_annoation
color
Type
2018
Alpha
Y1
Y
A
2017
Alpha
X1
X
A
2016
Alpha
X2
X
B
2018
Alpha
X3
X
B
2017
Alpha
Z1
Z
B
2017
Alpha
T1
T
A
2018
Alpha
T2
T
A
2016
Alpha
T3
T
A
df = pd.read_csv('color_data.csv')
g= sns.catplot(data=df.sort_values(by='color'),
x="Type", y="G_annotation", hue="color")
df = pd.read_csv('color_data.csv')
g= sns.catplot(data=df.sort_values(by='color'),
x="Type", y="G_annotation", hue="color", col="YEAR")

Joining two DataFrames with Pandas, one with 1 row per key, and the other with several rows per key

First, I want to point out that I didn't found the answer for my question here in stackoverflow nor in pandas documentation, so, if the question had been asked before, I'd appreciate a link for that thread.
I want to join two DataFrames as follows.
df1 =
key x y z
0 x0 y0 z0
1 x1 y1 z1
...
10 x10 y10 z10
df2 =
key w v u
0 w0 v0 u0
0 w0 v0 u0
0 w0 v0 u0
1 w1 v1 u1
1 w1 v1 u1
2 w2 v2 u2
3 w3 v3 u3
...
10 w10 v10 u10
10 w10 v10 u10
desired_df_output =
key x y z w v u
0 x0 y0 z0 w0 v0 u0
1 x1 y1 z1 w1 v1 u1
...
10 x10 y10 z10 w10 v10 u10
I've tried this df1.join(df2, how='inner', on='key'), but I get this error: TypeError: object of type 'NoneType' has no len().
Thanks
It seems df2 has duplicates values, so if you drop them using drop_duplicates method and merge with df1 from the right side, you get the desired outcome.
out = df1.merge(df2.drop_duplicates(), on='key')
Output:
key x y z w v u
0 0 x0 y0 z0 w0 v0 u0
1 1 x1 y1 z1 w1 v1 u1
2 10 x10 y10 z10 w10 v10 u10
import pandas as pd
df1 = pd.DataFrame({'k':[0, 1, 2, 3],
'x':['x0', 'x1', 'x2', 'x3'],
'y':['y0', 'y1', 'y2', 'y3'],
'z':['z0', 'z1', 'z2', 'z3']
})
df1.set_index('k', inplace=True)
df2 = pd.DataFrame({'k':[0, 0, 0, 1, 1, 1],
'v':['v0', 'v0', 'v0','v1', 'v1', 'v1',],
'w':['w0', 'w0', 'w0','w1', 'w1', 'w1',],
'u':['u0', 'u0', 'u0','u1', 'u1', 'u1',]
})
df2.set_index('k', inplace=True)
df_merged = pd.merge(df1, df2.drop_duplicates(), how='inner', left_index=True, right_index=True)
df_merged
x y z v w u
k
0 x0 y0 z0 v0 w0 u0
1 x1 y1 z1 v1 w1 u1

Finding max row after groupby in pandas dataframe

I have a daframe as follows:
Month Col1 Col2 Val
A p a1 31
A q a1 78
A r b2 13
B x a1 54
B y b2 56
B z b2 65
I want to get the following:
Month a1 b2
A q r
B x z
Essentially for each pair of Month and Col2, I want to find the value in Col1 which is has the maximum value.
I am not sure how to approach this.
Your problem is:
Find row with max Val within a group, which is sort and drop_duplicates, and
transform the data, which is pivot:
(df.sort_values('Val')
.drop_duplicates(['Month','Col2'], keep='last')
.pivot(index='Month', columns='Col2', values='Col1')
)
Output:
Col2 a1 b2
Month
A q r
B x z

Pandas: Convert a row into a column and make all other entries the second column

I have a pandas DataFrame like this:
Col1 Col2 Col3 Col4
control x1 x2 x3 x4
obs1 o11 o12 o13 o14
obs2 o21 o22 o23 o24
...
obsn on1 on2 on3 on4
I want to reshape it as follows (column headers are not needed):
control Observation
1 x1 o11
2 x1 o12
3 x1 o13
...
m xk ok1
m+1 xk ok2
...
How do I go about this?
You can select your "control" row and use that to set your columns via set_axis from there its a simple melt.
The sort_values and reset_index aren't functionally necessary, but they align the dataframe with what your expected output is, so I've included them here:
control = df.loc["control", :]
observations = df.drop("control")
out = (observations.set_axis(control, axis=1)
.melt(value_name="observation")
.sort_values("observation")
.reset_index(drop=True))
print(out)
control observation
0 x1 o11
1 x2 o12
2 x3 o13
3 x4 o14
4 x1 o21
5 x2 o22
6 x3 o23
7 x4 o24
I think I have a crude solution but it's not elegant. Say df is my data frame.
mdf = df.melt()
for col in df.columns:
mdf.loc[mdf['variable'] == col, 'variable'] = df.loc['control', col]
mdf.drop(mdf[mdf['variable'] == mdf['value']].index, inplace=True)

Pandas:set order of new generated column

I am working on a data and write a code which will basically split the data of column (COL) with respect to (comma:,) and print the split data into new columns. Now, what I want is that my code is able to generate the new columns in given manner (desired output). The code is attached below. Thank you in advance.
Input
X1 COL Y1
----------------
A X,Y,Z 146#12
B Z 223#13
C Y,X 725#14
Current output:
X1 Y1 COL-0 COL-1 COL-2
-----------------------------
A 146#12 X Y Z
B 223#13 Z NaN NaN
C 725#14 Y X NaN
Desired output:
X1 COL-1 COL-2 COL-3 Y1
------------------------------
A X Y Z 146#12
B Z - - 223#13
C Y X - 725#14
Script
import pandas as pd
import numpy as np
df = pd.read_csv(r"<PATH TO YOUR CSV>")
for row, item in enumerate(df["COL"]):
l = item.split(",")
for idx, elem in enumerate(l):
col = "COL-%s" % idx
if col not in df.columns:
df[col] = np.nan
df[col][row] = elem
df = df.drop(columns=["COL"])
print(df)
Use DataFrame.pop:
df['Y1'] = df.pop('Y1')
Also solution should be changed with Series.str.split:
df = (df.join(df.pop('COL').str.split(',', expand=True)
.fillna('-')
.rename(columns = lambda x: f'COL-{x+1}')))
df['Y1'] = df.pop('Y1')
print (df)
X1 COL-1 COL-2 COL-3 Y1
0 A X Y Z 146#12
1 B Z - - 223#13
2 C Y X - 725#14
If you wish to replace the NaN values with dashes you can use fillna() and, to keep the columns in order you specified you can simply define a dataframe with that column order.
df_output = df[['X1','COL-1','COL-2','COL-3','Y1']].fillna(value='-')
Not the most elegant of methods, but this should handle your real data and intended result :
import re
cols = df.filter(like='COL').columns.tolist()
pat = '(\w+)'
new_cols = [(f'{re.match(pat,col).group(0)} {i}') for i,col in enumerate(cols,1)]
df.rename(columns=dict(zip(cols,new_cols)),inplace=True)
df['Y1'] = df.pop('Y1')
out:
X1 COL 1 COL 2 COL 3 Y1
0 A X Y Z 146#12
1 B Z NaN NaN 223#13
2 C Y X NaN 725#14