How to filter a pandas dataframe if a column is a list - pandas

In the dataframe that comes from
http:bit.ly/imdbratings
one column, actors_list, is a list of the actors in the movie.
How do I filter the dataframe for movies where Al Pacino took part?
e.g. [u'Marlon Brando', u'Al Pacino', u'James Caan']

You could use filtering with a map function inside it. Suppose you are looking for actor number 32:
import pandas as pd
import numpy as np
df = pd.DataFrame({'name':['A','B','C','D','E','F'],
'Actors':[[1,2,3],[2,4,3],[3,4,5,32,1],[4,5,2,3],[102,302],[1,2,3,32,5]]})
df[df['Actors'].map(lambda x: 32 in x)]
Output:
name Actors
2 C [3, 4, 5, 32, 1]
5 F [1, 2, 3, 32, 5]
Or if you want to check if at least one actor from the list of actors you wish is present in the movies then use any in combination with lambda:
important_actors = [32,3]
print(df[df['Actors'].map(lambda x: any(i in x for i in important_actors))])
Output:
name Actors
0 A [1, 2, 3]
1 B [2, 4, 3]
2 C [3, 4, 5, 32, 1]
3 D [4, 5, 2, 3]
5 F [1, 2, 3, 32, 5]
The structure is that... You can now change any for all if you wish to filter the movies where all actors are in them, and so on... Feel free to leave a comment if you need further explanation/have any doubts.

You can do string contains.
l=[u'Marlon Brando', u'Al Pacino', u'James Caan']
m=df['actors_list'].str.join('|').str.contains('|'.join(l))
df=df[m]
Or
m=pd.DataFrame(df['actors_list'].tolist()).isin(l).any(1)
df=df[m.values]

Related

Finding values from different rows in pandas

I have a dataframe comprising the data and another dataframe, containing a single row carrying indices.
data = {'col_1': [4, 5, 6, 7], 'col_2': [3, 4, 9, 8],'col_3': [5, 5, 6, 9],'col_4': [8, 7, 6, 5]}
df = pd.DataFrame(data)
ind = {'ind_1': [2], 'ind_2': [1],'ind_3': [3],'ind_4': [2]}
ind = pd.DataFrame(ind)
Both have the same number of columns. I want to extract the values of df corresponding to the index stored in ind so that I get a single row at the end.
For this data it should be: [6, 4, 9, 6]. I tried df.loc[ind.loc[0]] but that of course gives me four different rows, not one.
The other idea I have is to zip columns and rows and iterate over them. But I feel there should be a simpler way.
you can go to NumPy domain and index there:
In [14]: df.to_numpy()[ind, np.arange(len(df.columns))]
Out[14]: array([[6, 4, 9, 6]], dtype=int64)
this pairs up 2, 1, 3, 2 from ind and 0, 1, 2, 3 from 0 to number of columns - 1; so we get the values at [2, 0], [1, 1] and so on.
There's also df.lookup but it's being deprecated, so...
In [19]: df.lookup(ind.iloc[0], df.columns)
~\Anaconda3\Scripts\ipython:1: FutureWarning: The 'lookup' method is deprecated and will beremoved in a future version.You can use DataFrame.melt and DataFrame.locas a substitute.
Out[19]: array([6, 4, 9, 6], dtype=int64)

pandas dataframe how to remove values from cell that is a list based on other column

I have a dataframe with 2 columns that represent a list:
a. b. vals. locs
1. 2. [1,2,3,4,5]. [2,3]
5 1. [1,7,2,4,9]. [0,1]
8. 2. [1,9,4,7,8]. [3]
I want, for each row, exclude from the columns vals all the locations that are in locs.
so I will get:
a. b. vals. locs. new_vals
1. 2. [1,2,3,4,5]. [2,3]. [1,2,5]
5 1. [1,7,2,4,9]. [0,1]. [2,4,9]
8. 2. [1,9,4,7,8]. [3]. [1,9,4,8]
What is the best way to do so?
Thanks!
You can use a list comprehension with an internal filter based on enumerate:
df['new_vals'] = [[v for i,v in enumerate(a) if i not in b]
for a,b in zip(df['vals'], df['locs'])]
however this will become quickly inefficient when b get large.
A much better approach would be to use python sets that enable a fast (O(1) complexity) identification of membership:
df['new_vals'] = [[v for i,v in enumerate(a) if i not in S]
for a,b in zip(df['vals'], df['locs']) for S in [set(b)]]
output:
a b vals locs new_vals
0 1 2 [1, 2, 3, 4, 5] [2, 3] [1, 2, 5]
1 5 1 [1, 7, 2, 4, 9] [0, 1] [2, 4, 9]
2 8 2 [1, 9, 4, 7, 8] [3] [1, 9, 4, 8]
Use list comprehension with enumerate and converting values to sets:
df['new_vals'] = [[z for i, z in enumerate(x) if i not in y]
for x, y in zip(df['vals'], df['locs'].apply(set))]
print (df)
a b vals locs new_vals
0 1 2 [1, 2, 3, 4, 5] [2, 3] [1, 2, 5]
1 5 1 [1, 7, 2, 4, 9] [0, 1] [2, 4, 9]
2 8 2 [1, 9, 4, 7, 8] [3] [1, 9, 4, 8]
One way to do this is to create a function that works on row,
def func(row):
ans = [v for v in row['vals'] if row['vals'].index(v) not in row['locs']]
return ans
The call this function for each row using apply.
df['new_value'] = df.apply(func, axis=1)
This will work well, if the lists are short.

How can we convert pandas dataframe two columns to python list after merging two columns vertically?

I have a dataframe...
print(df)
Name ae_rank adf de_rank
a 1 lk 4
b 2 lp 5
c 3 yi 6
How can I concat ae_rank column and de_rank column vertically and convert them into python list.
Expectation...
my_list = [1, 2, 3, 4, 5, 6]
Simpliest is join lists:
my_list = df['ae_rank'].tolist() + df['de_rank'].tolist()
If need reshape DataFrame with DataFrame.melt:
my_list = df.melt(['Name','adf'])['value'].tolist()
print (my_list )
[1, 2, 3, 4, 5, 6]
Another option is
my_list = df[['ae_rank', 'de_rank']].T.stack().tolist()
#[1, 2, 3, 4, 5, 6]
Most efficiently, use filter to select the columns by name that include "_rank" and use the underlying numpy array with ravel on the 'F' order (column major order):
my_list = df.filter(like='_rank').to_numpy().ravel('F').tolist()
output: [1, 2, 3, 4, 5, 6]

How to compute how many elements in three arrays in python are equal to some value in the same positon betweel the arrays?

I have three numpy arrays
a = [0, 1, 2, 3, 4]
b = [5, 1, 7, 3, 9]
c = [10, 1, 3, 3, 1]
and i wanna to compute how many elements in a, b, c are equal to 3 in the same position, so for that example would be 3.
An elegant solution is to use Numpy functions, like:
np.count_nonzero(np.vstack([a, b, c])==3, axis=0).max()
Details:
np.vstack([a, b, c]) - generate an array with 3 rows, composed of your
3 source arrays.
np.count_nonzero(...==3, axis=0) - count how many values of 3 occurs
in each column. For your data the result is array([0, 0, 1, 3, 0], dtype=int64).
max() - take the greatest value, in your case 3.

How to remove certain array items in one column by using condition array of another columns in pandas

I am trying to skim the size of the data array in one of columns by using the condition array from another columns. For example, I have my data like below df :
df= pd.DataFrame({'nHit':[4,3,5],'hit_id':[[10,20,30,50],[20,40,50],[30,50,60,70,80]],'hit_val':[[1,2,3,4],[5,6,7],[8,9,10,11,12]]},index=[0,1,2])
I want to know if there is a way to move all the values in hit_val columns based on the condition of hit_id array(such as only keep the relevant values of the same position of hit_id= 30 or 50).
The output I suppose to get is something like below df :
df= pd.DataFrame({'nHit':[2,1,2],'hit_id':[[30,50],[50],[30,50]],'hit_val':[[3,4],[7],[8,9,10]]},index=[0,1,2])
My thought is to create a condition array from hit_id columns by using df.apply() and then use it to filter hit_val, does anyone know how to implement?
From what i understand , starting from the original df, you can explode both cols and the filter the condition , then groupby with agg as list:
l = [30,50]
m = pd.concat([df[i].explode() for i in ['hit_id','hit_val']],axis=1)
out = m[m['hit_id'].isin(l)].groupby(level=0).agg(list)
out.insert(0,'nHit',out['hit_id'].str.len())
print(out)
nHit hit_id hit_val
0 2 [30, 50] [3, 4]
1 1 [50] [7]
2 2 [30, 50] [8, 9]
Using a copy-n-paste of the two expressions (thanks), here are their displays (which should help us visualize the desired action:
In [247]: df
Out[247]:
nHit hit_id hit_val
0 4 [10, 20, 30, 50] [1, 2, 3, 4]
1 3 [20, 40, 50] [5, 6, 7]
2 5 [30, 50, 60, 70, 80] [8, 9, 10, 11, 12]
In [249]: df1
Out[249]:
nHit hit_id hit_val
0 2 [30, 50] [3, 4]
1 1 [50] [7]
2 2 [30, 50] [8, 9, 10]