Smart / pythonic way to find which columns contain / match another column - pandas

My question title sounds like little cryptic so I hope the example makes it clear.
I have a value in column "FindMe", and I want to know if this is in either of the options of "Search1" or "Search2". The logic I have works (though if its present in both search 1 & 2 I know I have an issue)
import pandas as pd
import numpy as np
data = {"Search1":["one_two","two_ten", "five_ten"],
"Search2":["three_four","one_four","two_twelve"],
"FindMe":["three","one","nine"]}
df =pd.DataFrame(data)
df["Present1"] = df.apply(lambda x: str(x.FindMe) in str(x.Search1), axis =1)
df["Present2"] = df.apply(lambda x: str(x.FindMe) in str(x.Search2), axis =1)
df["Present"] = np.where(df.apply(lambda x: str(x.FindMe) in str(x.Search1), axis =1) ==1,
df.Search1,
np.where(df.apply(lambda x: str(x.FindMe) in str(x.Search2), axis =1) ==1,
df.Search2,""))
print(df)
Like I say my "Present" column works as it should, returning the value of the column where its found. In reality, I have far more columns that I need to check and so yes I can create nested where's but this feels like there should be a better solution.
Any thoughts?
J

A list comprehension would do the job
df['Present'] = [[s for s in l if w in s] for l, w in
zip(df.filter(like='Search').to_numpy(), df['FindMe'])]
Search1 Search2 FindMe Present
0 one_two three_four three [three_four]
1 two_ten one_four one [one_four]
2 five_ten two_twelve nine []

Related

pandas add multiple columns with apply [duplicate]

This question already has answers here:
How can I split a column of tuples in a Pandas dataframe?
(6 answers)
Closed 3 years ago.
I am currently projecting the latitude, longitude coordinates to a cartesian plane in my pandas data frame. So, I have a method for projection as:
def convert_lat_long_xy(lat, lo):
return x, y
So this returns a tuple and I can use this method on my dataframe as:
df.apply(lambda x: convert_lat_long_xy(x.latitude, x.longitude), axis=1))
Now, what I would like to do is create two extra columns in my data frame called 'x' and 'y' to hold these values. I know I can do something like:
df['proj'] = df.apply(lambda x: convert_lat_long_xy(x.latitude, x.longitude), axis=1))
But is it possible to add the values to two different columns?
Yes, you need to convert the output of lambda into pd.Series. Here's an example:
In [1]: import pandas as pd
In [2]: pd.DataFrame(["1,2", "2,3"], columns=["coord"])
Out[2]:
coord
0 1,2
1 2,3
In [3]: df = pd.DataFrame(["1,2", "2,3"], columns=["coord"])
In [4]: df.apply(lambda x: pd.Series(x["coord"].split(",")), axis=1)
Out[4]:
0 1
0 1 2
1 2 3
In [5]: df[["x", "y"]] = df.apply(lambda x: pd.Series(x["coord"].split(",")), axis=1)
In [6]: df
Out[6]:
coord x y
0 1,2 1 2
1 2,3 2 3
For your particular case, df.apply will become like this:
df[['x', 'y']] = df.apply(lambda x: pd.Series(convert_lat_long_xy(x.latitude, x.longitude)), axis=1))

Discrepencies in Pandas groupby aggregates vs dataframe, particularly on axis=1

import pandas as pd
import numpy as np
def main():
df = pd.DataFrame([["a", "b", "c", "k"],["d", "e", "f", "l"],['g','h','i', "J"]], columns=["ay", "be", "ce", "jay"])
print(df)
gb1 = df.groupby({"ay": "x", "be": "x"}, axis=1)
gb2 = df.groupby({"ay": "x", "be": "x", "ce": "y", "jay": "y"}, axis=1)
print("apply sum by axis 0")
#print(df.apply(sum))
print("fails")
print("apply sum by axis 1")
# print(df.apply(sum, axis=1))
print("fails")
print("agg sum by axis 0")
print(df.agg(sum))
print("agg sum by axis 1")
print(df.agg(sum, axis=1))
print("gb1 apply sum axis 1")
print(gb1.apply(sum))
print("gb1 agg sum axis 1")
print(gb1.agg(sum))
print("gb2 apply sum axis 1")
# print(gb2.apply(sum))
print("fails")
print("gb2 agg sum axis 1")
print(gb2.agg(sum))
print(gb1.agg(lambda x: ";".join([x[0], x[1]]))
if __name__ == "__main__":
main()
I don't understand the failures occurring and I don't understand why apply on groups fails with 2 groups but not with one.
I've solved my overall goal (I was trying to concatenate some strings of columns together) but I am concerned that I am somewhat bewildered by these failures.
The driving goal for reference was to be able to do
gb1.agg(lambda x: ";".join(x))
and I also don't understand why that doesn't work
especially since
gb1.agg(lambda x: ";".join([x[0], x[1]]) does
There's a lot to unpack in there.
print("apply sum by axis 0")
#print(df.apply(sum))
print("fails")
print("apply sum by axis 1")
# print(df.apply(sum, axis=1))
print("fails")
...the above are failing because you're apply-ing the Python sum function, which requires numerical types. You could use either of the following to fix that (which I think under the hood relies on the ability of numpy to handle the object dtypes that pandas converts them to):
df.apply(np.sum)
df.sum()
Next, these two items say axis=1 in the print statement, but aren't really:
print("gb1 apply sum axis 1")
print(gb1.apply(sum))
print("gb2 apply sum axis 1")
# print(gb2.apply(sum))
print("fails")
...if you add axis=1 they'll work and give sensible results.
Note that you have a missing closing parenthesis in:
gb1.agg(lambda x: ";".join([x[0], x[1]])
...both in the sample code and in the later comment about it.
It seems like you're saying that the final bit of code is what accomplishes your goal. The previous attempt:
gb1.agg(lambda x: ";".join(x))
...is joining the items in the index of the one group that is present instead of the individual series. Examine:
print(gb1.groups)
Finally, given your dataframe if what you wanted to do was concatenate columns with ";" between them, you could also do:
cols = ['ay','be']
df.apply(lambda x: ";".join((x[c] for c in cols)), axis=1)
or for a small number of items,
df['concat'] = df['ay'] + ";" + df['be']
...rather than using groupby.

get less correlated variable names

I have a dataset (50 columns, 100 rows).
Also have 50 variable names, 0,1,2...49 for 50 columns.
I have to find less correlated variables, say correlation < 0.7.
I tried as follows:
import os, glob, time, numpy as np, pandas as pd
data = np.random.randint(1,99,size=(100, 50))
dataframe = pd.DataFrame(data)
print (dataframe.shape)
codes = np.arange(50).astype(str)
dataframe.columns = codes
corr = dataframe.corr()
corr = corr.unstack().sort_values()
print (corr)
corr = corr.values
indices = np.where(corr < 0.7)
print (indices)
res = codes[indices[0]].tolist() + codes[indices[1]].tolist()
print (len(res))
res = list(set(res))
print (len(res))
The result is, 50(all variables!), which is unexpected.
How to solve this problem, guys?
As mentioned in the comments, your question is somewhat ambiguous. First, there is the possibility, that no column pair is correlated. Second, the unstacking doesn't make sense, because you create an index array that you can't directly use on your 2D array. Third, which should be first, but I was blind to this - as #AmiTavory mentioned there is no point in "correlating names".
The correlation procedure per se works, as you can see in the following example:
import numpy as np
import pandas as pd
A = np.arange(100).reshape(25, 4)
#random order in column 2, i.e. a low correlation to the first columns
np.random.shuffle(A[:,2])
#flip column 3 to create a negative correlation with the first columns
A[:,3] = np.flipud(A[:,3])
#column 1 is unchanged, therefore positively correlated to column 0
df = pd.DataFrame(A)
print(df)
#establish a correlation matrix
corr = df.corr()
#retrieve index of pairs below a certain value
#use only the upper triangle with np.triu to filter for symmetric solutions
#use np.abs to take also negative correlation into account
res = np.argwhere(np.triu(np.abs(corr.values) <0.7))
print(res)
Output:
[[0 2]
[1 2]
[2 3]]
As expected, column 2 is the only one that is not correlated to any other, meaning, that all other columns are correlated with each other.

search and compare data between dataframes

I have an issue about merge of data-frame.
I have two data-frames as follow,
df1:
ID name-group status
1 bob,david good
2 CC,robben good
3 jack bad
df2:
ID leader location
2 robben JAPAN
3 jack USA
4 bob UK
I want to get a result as flow.
dft
ID name-group Leader location
1 bob,david
2 CC,robben Robben JAPAN
3 jack Jack USA
the [Leader] and [location] will be merged when
[leader] in df2 **IN** [name-group] of df1
&
[ID] of df2 **=** [ID] of df1
I have tried for loop, but its time-cost is very high.
any ideas for this issue?
Thanks
See the end of the post for runnable code. The proposed solution is in the function, using_tidy.
The main problem here is that having multiple names in name-group, separated
by commas, makes searching for membership difficult. If, instead, df1 had each
member of name-group in its own row, then testing for membership would be
easy. That is, suppose df1 looked like this:
ID name-group status
0 1 bob good
0 1 david good
1 2 CC good
1 2 robben good
2 3 jack bad
Then you could simply merge df1 and df2 on ID and test if leader
equals name-group... almost (see why "almost" below).
Putting df1 in tidy format (PDF)
is the main idea in the solution below. The reason why it improves performance
is because testing for equality between two columns is much much faster than
testing if a column of strings are substrings of another column of strings, or
are members of a column containing a list of strings.
The reason why I said "almost" above is because there is another difficulty --
after merging df1 and df2 on ID, some rows are leaderless such as the bob,david row:
ID name-group Leader location
1 bob,david
Since we simply want to keep these rows and we don't want to test if criteria #1 holds in this case, we need to treat these rows differently -- don't expand them.
We can handle this problem by separating the leaderless rows from those with potential leaders (see below).
The second criteria, that the IDs match is easy to enforce by merging df1 and df2 on ID:
dft = pd.merge(df1, df2, on='ID', how='left')
The first criteria is that dft['leader'] is in dft['name-group'].
This criteria could be expressed as
In [293]: dft.apply(lambda x: pd.isnull(x['leader']) or (x['leader'] in x['name-group'].split(',')), axis=1)
Out[293]:
0 True
1 True
2 True
dtype: bool
but using dft.apply(..., axis=1) calls the lambda function once for each
row. This can be very slow if there are many rows in dft.
If there are many rows in dft we can do better by first converting dft to
tidy format (PDF) -- placing each
member in dft['name-group'] on its own row. But first, let's split dft into 2
sub-DataFrames, those which have a leader, and those which don't:
has_leader = pd.notnull(dft['leader'])
leaderless, leaders = dft.loc[~has_leader, :], dft.loc[has_leader, :]
Now put the leaders in tidy format (one member per row):
member = leaders['name-group'].str.split(',', expand=True)
member = member.stack()
member.index = member.index.droplevel(1)
member.name = 'member'
leaders = pd.concat([member, leaders], axis=1)
The pay off for all this work is that criteria #1 can now be expressed by a fast calculation:
# this enforces criteria #1 (leader of df2 is in name-group of df1)
mask = (leaders['leader'] == leaders['member'])
leaders = leaders.loc[mask, :]
leaders = leaders.drop('member', axis=1)
and the desired result is:
dft = pd.concat([leaderless, leaders], axis=0)
We had to do some work to get df1 into tidy format. We need to benchmark to
determine if the cost of doing that extra work pays off by being able to compute criteria #1 faster.
Here is a benchmark using largish dataframes of 1000 rows for df1 and df2:
In [356]: %timeit using_tidy(df1, df2)
100 loops, best of 3: 17.8 ms per loop
In [357]: %timeit using_apply(df1, df2)
10 loops, best of 3: 98.2 ms per loop
The speed advantage of using_tidy over using_apply increases as the number
of rows in pd.merge(df1, df2, on='ID', how='left') increases.
Here is the setup for the benchmark:
import string
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'name-group':['bob,david', 'CC,robben', 'jack'],
'status':['good','good','bad'],
'ID':[1,2,3]})
df2 = pd.DataFrame({'leader':['robben','jack','bob'],
'location':['JAPAN','USA','UK'],
'ID':[2,3,4]})
def using_apply(df1, df2):
dft = pd.merge(df1, df2, on='ID', how='left')
mask = dft.apply(lambda x: pd.isnull(x['leader']) or (x['leader'] in x['name-group'].split(',')), axis=1)
return dft.loc[mask, :]
def using_tidy(df1, df2):
# this enforces criteria #2 (the IDs are the same)
dft = pd.merge(df1, df2, on='ID', how='left')
# split dft into 2 sub-DataFrames, based on rows which have a leader and those which do not.
has_leader = pd.notnull(dft['leader'])
leaderless, leaders = dft.loc[~has_leader, :], dft.loc[has_leader, :]
# expand leaders so each member in name-group has its own row
member = leaders['name-group'].str.split(',', expand=True)
member = member.stack()
member.index = member.index.droplevel(1)
member.name = 'member'
leaders = pd.concat([member, leaders], axis=1)
# this enforces criteria #1 (leader of df2 is in name-group of df1)
mask = (leaders['leader'] == leaders['member'])
leaders = leaders.loc[mask, :]
leaders = leaders.drop('member', axis=1)
dft = pd.concat([leaderless, leaders], axis=0)
return dft
def make_random_str_array(letters=string.ascii_uppercase, strlen=10, size=100):
return (np.random.choice(list(letters), size*strlen)
.view('|U{}'.format(strlen)))
def make_dfs(N=1000):
names = make_random_str_array(strlen=4, size=10)
df1 = pd.DataFrame({
'name-group':[','.join(np.random.choice(names, size=np.random.randint(1,10), replace=False)) for i in range(N)],
'status':np.random.choice(['good','bad'], size=N),
'ID':np.random.randint(4, size=N)})
df2 = pd.DataFrame({
'leader':np.random.choice(names, size=N),
'location':np.random.randint(10, size=N),
'ID':np.random.randint(4, size=N)})
return df1, df2
df1, df2 = make_dfs()
Why don’t you use
Dft = pd.merge(df1,df2,how=‘left’,left_on = [‘ID’],right_on =[‘ID’])

How to avoid temporary variables when creating new column via groupby.apply

I would like to create a new column newcol in a dataframe df as the result of
df.groupby('keycol').apply(somefunc)
The obvious:
df['newcol'] = df.groupby('keycol').apply(somefunc)
does not work: either df['newcol'] ends up containing all nan's (which is certainly not what the RHS evaluates to), OR some exception is raised (the details of the exception vary wildly depending on what somefunc returns).
I have tried many variations of the above, including stuff like
import pandas as pd
df['newcol'] = pd.Series(df.groupby('keycol').apply(somefunc), index=df.index)
They all fail.
The only thing that has worked requires defining an intermediate variable:
import pandas as pd
tmp = df.groupby('keycol').apply(lambda x: pd.Series(somefunc(x)))
tmp.index = df.index
df['rank'] = tmp
Is there a way to achieve this without having to create an intermediate variable?
(The documentation for GroupBy.apply is almost content-free.)
Let's build up an example and I think I can illustrate why your first attempts are failing:
Example data:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : randn(n)})
print df.head(10)
results in:
coef expenditure groupid
0 0.874076 bar one
1 -0.972586 foo two
2 -0.003457 bar one
3 -0.893106 bar one
4 -0.387922 bar two
5 -0.109405 bar two
6 1.275657 foo two
7 -0.318801 foo two
8 -1.134889 bar two
9 1.812964 foo two
So if apply a simple function, mean, to the grouped data we get the following:
df2= df.groupby('groupid').apply(mean)
print df2
Which is:
coef
groupid
one -0.215539
two 0.149459
So the dataframe above is indexed by groupid and has one column, coef.
What you tried to do first was, effectively, the following:
df['newcol'] = df2
That gives all NaNs for newcol. Honestly I have no idea why that doesn't throw an error. I'm not sure why it would produce anything at all. I think what you really want to do is merge df2 back into df
To merge df and df2 we need to remove the index from df2, rename the new column, then merge:
df2= df.groupby('groupid').apply(mean)
df2.reset_index(inplace=True)
df2.columns = ['groupid','newcol']
df.merge(df2)
which I think is what you were after.
This is such a common idiom that Pandas includes the transform method which wraps all this up into a much simpler syntax:
df['newcol'] = df.groupby('groupid').transform(mean)
print df.head()
results:
coef expenditure groupid newcol
0 1.705825 foo one -0.025112
1 -0.608750 bar one -0.025112
2 -1.215015 bar one -0.025112
3 -0.831478 foo two -0.073560
4 2.174040 bar one -0.025112
Better documentation is here.