Advanced condition lookup in pandas(numpy) - pandas

given:
a list of elements 'ls' and a big df 'df', all the elements of 'ls' is in the 'df'.
ls = ['a0','a1','a2','b0','b2','c0',...,'c_k']
df = [['a0','b0','c0'],
['a0','b0','c1'],
['a0','b0','c2'],
...
['a_i','b_j','c_k']]
goal:
I want to collect the rows set of the 'df' that contains the most elements of 'ls', such as ['a0','b0','c0'] is the best one. But at most a row just contain only 2 elements
tried:
I tried enumerating 3 or 2 elements in 'ls', but it was too expensive and probably return None since there exist only 2 elements in some row.
I tried to use a dictionary to count, but it didn't work either.
I've been puzzling over this problem all day, any help will be greatly appreciated.

I would go like this:
row_id = df.apply(lambda x: x.isin(ls).sum(), axis=1)
This will give you the row index with max entries in the list.
The desired row can be obtained so:
df.iloc[row_id, :]

Related

How to apply function to each column and row of dataframe pandas

I have two dataframes.
df1 has an index list made of strings like (row1,row2,..,rown) and a column list made of strings like (col1,col2,..,colm) while df2 has k rows and 3 columns (char_1,char_2,value). char_1 contains strings like df1 indexes while char_2 contains strings like df1 columns. I only want to assign the df2 value to df1 in the right position. For example if the first row of df2 reads ['row3','col1','value2'] I want to assign value2 to df1 in the position ([2,0]) (third row and first column).
I tried to use two functions to slide rows and columns of df1:
def func1(val):
# first I convert the series to dataframe
val=val.to_frame()
val=val.reset_index()
val=val.set_index('index') # I set the index so that it's the right column
def func2(val2):
try: # maybe the combination doesn't exist
idx1=list(cou.index[df2[char_2]==(val2.name)]) #val2.name reads col name of df1
idx2=list(cou.index[df2[char_1]==val2.index.values[0]]) #val2.index.values[0] reads index name of df1
idx= list(reduce(set.intersection, map(set, [idx1,idx2])))
idx=int(idx[0]) # final index of df2 where I need to take value to assign to df1
check=1
except:
check=0
if check==1: # if index exists
val2[0]=df2['value'][idx] # assign value to df1
return val2
val=val.apply(func2,axis=1) #apply the function for columns
val=val.squeeze() #convert again to series
return val
df1=df1.apply(func1,axis=1) #apply the function for rows
I made the conversion inside func1 because without this step I wasn't able to work with series keeping index and column names so I wasn't able to find the index idx in func2.
Well the problem is that it takes forever. df1 size is (3'600 X 20'000) and df2 is ( 500 X 3 ) so it's not too much. I really don't understand the problem.. I run the code for the first row and column to check the result and it's fine and it takes 1 second, but now for the entire process I've been waiting for hours and it's still not finished.
Is there a way to optimize it? As I wrote in the title I only need to run a function that keeps column and index names and works sliding the entire dataframe. Thanks in advance!

Efficient way to map items from one column into new column in Pandas

Let's say if I have a Pandas df called df_1 where one of the rows looks like this:
id
rank_url_agg
url_list
2223
['gtech.com','gm.com', 'ford.com']
['google.com','gtech.com','autoblog.com','gm.com', 'ford.com']
I want to create a new column called url_list_agg which does the following things for each row:
Iterate through the URLs in url_list
If URL doesn't exist in rank_url_agg in the same row, assign a value of 0.
If URL exists in rank_url_agg, then assign the value that corresponds to the difference between the length of the rank_url_agg list and the index of that URL in rank_url_agg.
Once done iterating through all URLs in url_list, wrap the results into a list.
So at the end, the first row in the new url_list_agg column will become [0,3,0,2,1].
I've tried running the following script (only to test the 1st row and not entire dataframe):
for item in agg_report['url_list'][0]:
if item in agg_report['rank_url_agg'][0]:
item=len(rank_url_agg[0]) - agg_report['rank_url_agg'][0].index(item)
else:
item=0
But when I checked agg_report['url_list'][0], it still returned just this list: ['google.com','gtech.com','autoblog.com','gm.com', 'ford.com']. So my code didn't work.
Any advice on how to achieve this goal for every row in the dataframe will be greatly appreciated!
You're not assigning back to the actual dataframe.
def idx(a, b):
return [len(a) - a.index(x) if x in a else 0 for x in b]
df_1 = df_1.assign(url_list_agg=[*map(idx, df_1.rank_url_agg, df_1.url_list)])

Get True/False boolean list of row in pandas dataframe out of condition

I am working with several Pandas DataFrames and I need the following filtering:
Suppose I get a list like
L=['EP6','EP3','EP2']
I need to get the following vector of a row:
for row concept 1 True where columns index is in L, False where not.
I am trying:
# D being the DataFrame
L=['EP6', 'EP3','EP2']
[True for ind in D.columns if ind in L ]
But only get [True,True,True]
I need the complete list like:
desire_result=[0,0,0,0,1,0,0,1,1,0]
Note: be aware that the 1 in the desire result do not have anything to do with the 1 the Dataframe is populate with.
Thanks
We have isin in pandas
D.columns.isin(L)
You here made a filter where you yield True, given ind in L, and otherwise, you do not yield an element.
You here want to perform a mapping. You can still use list comprehension, but you should put the condition in the yield part:
[ind in L for ind in D.columns]
or if you want integers:
[int(ind in L) for ind in D.columns]

How do I change the index of a pandas dataframe object so as not to get null values in the dataframe entries?

I do not seem to know what the issue is when I combined three dataframes into one and tried changing the index of the combined dataframe. The following is what I have done:
1) I first combined (or Concatenated) three dataframes into a 'combo' dataframe. Below is an excerpt ('TSP_JuMP_Obtained_Solu') of one of the three. The index goes from 0-9 for all the three datafames as well as the combined.
2) I then used the following line of code to combine them:
f_solu_tsp = pd.concat([list_TSP,list_Scenario1,list_Scenario2], axis=1,
sort=True)
3) I subsequently used the followine line of code to change the index of the combined dataframe (df_solu_tsp):
df_solu_tsp = df_solu_tsp.reindex(proTy_uniq_list)
NB: 'proTy_uniq_list' is a list with membership as shown below:
[u'lau15', u'gr17', u'fri26', u'bays29', u'dantzig42', u'KATRINA_38',
u'HARVEY_50', u'HARVEY_100', u'HARVEY_200', u'HARVEY_415']
Below is the result of the combined dataframe (df_solu_tsp ):
Thank you in advance for the help.
Without having example DataFrame I will try to answer as good as possible:
Solution 1
As Peter Leimbigler mentioned in the comments:
df_solu_tsp = df_solu_tsp.set_index(proTy_uniq_list)
Which replaces your original index with the new index which is in this case an equal length list.
Solution 2
As mentioned in the pandas docs
df_solu_tsp.set_index([pd.Index(proTy_uniq_list), 'proTy'])
Solution 3
I see that you're creating a dataframe from three lists, so we can go a step back and create your data in one go:
f_solu_tsp = pd.DataFrame({'TSP_JuMP_Obtained_Solu': list_TSP,
'Scenario1': list_Scenario1,
'Scenario2': list_Scenario2}, index=proTy_uniq_list)
Example solution 3
data1 = ['hi', 'goodbye']
data2 = ['hello', 'bye']
idx = ['arriving', 'leaving']
df = pd.DataFrame({'column1': data1,
'column2': data2}, index=idx)
print(df)
column1 column2
arriving hi hello
leaving goodbye bye

isin pandas doesn't show all values in dataframe

I am using the Amazon database for my research where I want to select the 100 most rated items. So first I have counted the values of the itemID's (asin)
data = amazon_data_parse('data/reviews_Movies_and_TV_5.json.gz')
unique, counts = np.unique(data['asin'], return_counts=True)
test = np.asarray((unique, counts)).T
test.sort(axis=1)
which gives:
array([[5, '0005019281'],
[5, '0005119367'],
[5, '0307141985'],
...,
[1974, 'B00LG7VVPO'],
[2110, 'B00LH9ROKM'],
[2213, 'B00LT1JHLW']], dtype=object)
It is clearly to see that there must be at least 6.000 rows selected. But if I run:
a= test[49952:50054,1]
a = a.tolist()
test2 = data[data.asin.isin(a)]
It only selects 2000 rows from the dataset. I already have tried multiple thing, like only filter on one asin but it doesn't just seem to work. Can someone please help? If there is a better option to get a dataframe with the rows of the 100 most frequent values in asin column I would be glad too.
I found the solution, had to change the sorting line to:
test = test[test[:,1].argsort()]