Efficient way to map items from one column into new column in Pandas - pandas

Let's say if I have a Pandas df called df_1 where one of the rows looks like this:
id
rank_url_agg
url_list
2223
['gtech.com','gm.com', 'ford.com']
['google.com','gtech.com','autoblog.com','gm.com', 'ford.com']
I want to create a new column called url_list_agg which does the following things for each row:
Iterate through the URLs in url_list
If URL doesn't exist in rank_url_agg in the same row, assign a value of 0.
If URL exists in rank_url_agg, then assign the value that corresponds to the difference between the length of the rank_url_agg list and the index of that URL in rank_url_agg.
Once done iterating through all URLs in url_list, wrap the results into a list.
So at the end, the first row in the new url_list_agg column will become [0,3,0,2,1].
I've tried running the following script (only to test the 1st row and not entire dataframe):
for item in agg_report['url_list'][0]:
if item in agg_report['rank_url_agg'][0]:
item=len(rank_url_agg[0]) - agg_report['rank_url_agg'][0].index(item)
else:
item=0
But when I checked agg_report['url_list'][0], it still returned just this list: ['google.com','gtech.com','autoblog.com','gm.com', 'ford.com']. So my code didn't work.
Any advice on how to achieve this goal for every row in the dataframe will be greatly appreciated!

You're not assigning back to the actual dataframe.
def idx(a, b):
return [len(a) - a.index(x) if x in a else 0 for x in b]
df_1 = df_1.assign(url_list_agg=[*map(idx, df_1.rank_url_agg, df_1.url_list)])

Related

How to apply function to each column and row of dataframe pandas

I have two dataframes.
df1 has an index list made of strings like (row1,row2,..,rown) and a column list made of strings like (col1,col2,..,colm) while df2 has k rows and 3 columns (char_1,char_2,value). char_1 contains strings like df1 indexes while char_2 contains strings like df1 columns. I only want to assign the df2 value to df1 in the right position. For example if the first row of df2 reads ['row3','col1','value2'] I want to assign value2 to df1 in the position ([2,0]) (third row and first column).
I tried to use two functions to slide rows and columns of df1:
def func1(val):
# first I convert the series to dataframe
val=val.to_frame()
val=val.reset_index()
val=val.set_index('index') # I set the index so that it's the right column
def func2(val2):
try: # maybe the combination doesn't exist
idx1=list(cou.index[df2[char_2]==(val2.name)]) #val2.name reads col name of df1
idx2=list(cou.index[df2[char_1]==val2.index.values[0]]) #val2.index.values[0] reads index name of df1
idx= list(reduce(set.intersection, map(set, [idx1,idx2])))
idx=int(idx[0]) # final index of df2 where I need to take value to assign to df1
check=1
except:
check=0
if check==1: # if index exists
val2[0]=df2['value'][idx] # assign value to df1
return val2
val=val.apply(func2,axis=1) #apply the function for columns
val=val.squeeze() #convert again to series
return val
df1=df1.apply(func1,axis=1) #apply the function for rows
I made the conversion inside func1 because without this step I wasn't able to work with series keeping index and column names so I wasn't able to find the index idx in func2.
Well the problem is that it takes forever. df1 size is (3'600 X 20'000) and df2 is ( 500 X 3 ) so it's not too much. I really don't understand the problem.. I run the code for the first row and column to check the result and it's fine and it takes 1 second, but now for the entire process I've been waiting for hours and it's still not finished.
Is there a way to optimize it? As I wrote in the title I only need to run a function that keeps column and index names and works sliding the entire dataframe. Thanks in advance!

How to broadcast a list of data into dataframe (Or multiIndex )

I have a big dataframe its about 200k of rows and 3 columns (x, y, z). Some rows doesn't have y,z values and just have x value. I want to make a new column that first set of data with z value be 1,second one be 2,then 3, etc. Or make a multiIndex same format.
Following image shows what I mean
Like this image
I made a new column called "NO." and put zero as initial value. Then
I tried to record the index of where I want the new column get a new value. with following code
df = pd.read_fwf(path, header=None, names=['x','y','z'])
df['NO.']=0
index_NO_changed = df.index[df['z'].isnull()]
Then I loop through it and change the number:
for i in range(len(index_NO_changed)-1):
df['NO.'].iloc[index_NO_changed[i]:index_NO_changed[i+1]]=i+1
df['NO.'].iloc[index_NO_changed[-1]:]=len(index_NO_changed)
But the problem is I get a warning that "
A value is trying to be set on a copy of a slice from a DataFrame
I was wondering
Is there any better way? Is creating multiIndex instead of adding another column easier considering size of dataframe?

How can I create a pandas column based on another pandas column that has for values a list?

I am working with a dataframe and one of the columns has for values a list of strings in each row. The list contains a number of links (each list can have a different number of links). I want to create a new column that will be based on this column of lists but keep only the links that have the keyword "uploads".
To my example, the first entry of the column is like that:
['https://seekingalpha.com/instablog/5006891-hfir/4960045-natural-gas-daily',
'https://seekingalpha.com/article/4116929-weekly-natural-gas-storage-report',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png',
'https://seekingalpha.com/account/research/subscribe?slug=hfir-energy&sasource=upsell']
And I want to keep only
['https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png',
'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png']
And put the clean version in a new column of the same dataframe.
Can you please suggest a way to do it?
I just found a way where I create a function that looks within a list for a specific pattern (in my case the keyword "uploads")
def clean_alt_list(list_):
list_ = [s for s in list_ if "uploads" in s]
return list_
And then I apply this function into the column I am interested in
df['clean_links'] = df['links'].apply(clean_alt_list)
IIUC, this should work for you:
df = pd.DataFrame({'url': [['https://seekingalpha.com/instablog/5006891-hfir/4960045-natural-gas-daily', 'https://seekingalpha.com/article/4116929-weekly-natural-gas-storage-report', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647719993095_origin.png', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-15090647854075453_origin.png', 'https://static.seekingalpha.com/uploads/2017/10/26/5006891-1509065004154725_origin.png', 'https://seekingalpha.com/account/research/subscribe?slug=hfir-energy&sasource=upsell']]})
df = df.explode('url').reset_index(drop=True)
df[df['url'].str.contains('uploads')]
Result:
url
2 https://static.seekingalpha.com/uploads/2017/1...
3 https://static.seekingalpha.com/uploads/2017/1...
4 https://static.seekingalpha.com/uploads/2017/1...

Pandas Dataframe: How to get the cell instead of is value

I have a task to compare two dataframe with same columns name but different size, we can call it previous and current. I am trying to get the difference between (previous and current) in the Quantity and Booked Columns and highlight it as yellow. The common key between the two dataframe would be the 'SN' columns
I have coded out the following
for idx, rows in df_n.iterrows():
if rows["Quantity"] == rows['Available'] + rows['Booked']:
continue
else:
rows["Quantity"] = rows["Quantity"] - rows['Available'] - rows['Booked']
df_n.loc[idx, 'Quantity'].style.applymap('background-color: yellow')
# pdb.set_trace()
if (df_o['Booked'][df_o['SN'] == rows["SN"]] != rows['Booked']).bool():
df_n.loc[idx, 'Booked'].style.apply('background-color: yellow')
I realise I have a few problems here and need some help
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type. How can I get a dataframe from one cell. Do I have to pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns ='Quantity'). Will this create a copy or will update the reference?
How do I compare the SN of both dataframe, looking for a better way to compare. One thing I could think of is to use set index for both dataframe and when finished using them, reset them back?
My dataframe:
Previous dataframe
Current Dataframe
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type.
How can I get a dataframe from one cell. Do I have to
pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns
='Quantity'). Will this create a copy or will update the reference?
To create a DataFrame from one cell you can try: df_n.loc[idx, ['Quantity']].to_frame().T
How do I compare the SN of both dataframe, looking for a better way to
compare. One thing I could think of is to use set index for both
dataframe and when finished using them, reset them back?
You can use df_n.merge(df_o, on='S/N') to merge dataframes and 'compare' columns.

delete rows from pandas data frame that contains one of its columns as list , when one of its values match value in another compared list

delete rows from pandas data frame that contains one of its columns as list , when one of its values match value in another compared list column in another data frame.
here is the first data frame column: enter image description here
and the other data frame column is here: enter image description here
I have tried a lot of codes
Revdf=Revdf.drop(lambda x: [i for i in Revdf.AffiliationHistory if i in Authdf.Affiliations.values], axis=1)
or
Revdf=Revdf[~(Revdf.AffiliationHistory.isin(Authdf.Affiliations.values))]
but these can't help
There has to be an easier way, but i wrote a function for it and it works:
def remove_row(df1,x1,y1,df2,x2,y2):
assert type(df1.loc[x1,y1])==list,"type have to be list"
assert type(df2.loc[x2,y2])==list,"type have to be list"
flag =False
l1=df1.loc[x1,y1]
print(l1)
l2=df2.loc[x2,y2]
print(l2)
for i in l1:
if i in l2:
flag=True
break
if flag==True:
return df1.drop(x1)
else:
return df1
x is the row index, y is the column name, tried it on synthetic data and it works:
df1=pd.DataFrame({'col1':[0,0,0,0,1],
'col2':[[1,2,3,4],0,0,0,0]})
df2=pd.DataFrame({'col1':[0,0,0,0],
'col2':[[0,0,0,4],0,0,0]})
remove_row(df1,0,'col2',df2,0,'col2')
Also, i think a mistake you're making is this:
[1,2,3,4] in [0,1,2,3,4]
will return false, because you're asking if the second list contains the first.