replace values for specific rows more efficiently in pandas / Python - pandas

I have two data frames, based on a condition that I get from a list (which length is 2 million) I get rows that match that condition, then for those rows I change the values in columns x and y in the first data frame by the values of x and y in the second data frame. Here is my code, but it is very slow and makes my computer freeze. Any idea how I can do this more efficiently ?
for ids in List_id:
a=df1.index[(df1['id'] == ids )==True].values[0]
b=df2.index[(df2['id'] == ids )==True].values[0]
df1['x'][a] = df2['x'][b]
df1['y'][a] = df2['y'][b]
thank you
--
Example:
List_id=[1, 11 , 12 , 13]
ids=1
a=df1.index[(df1['id'] == 1 )==True].values[0]
print('a') : 234
b=df2.index[(df2['id'] == 1 )==True].values[0]
print('b') : 789
df1['x'][a] = 0
df2['x'][b] =15
So at the end I want in my data frame 1:
df1['x'][a] = df2['x'][b]

Assuming you don't have repeated id in both dataframe, you can try something like below:
step-1: filtering df2
step-2: joining df1 with filtered one
step-3 replace values in joined df and dropping extra columns.
df2_filtered=df2[df2['id'].isin(List_id)]
join_df = df1.setIndex('id').join(df2_filtered.setIndex('id'), rsuffix = "_ignore", how = 'left')
# other columns from df2 will be null, you can use that to get the rows which needs to be updated

Related

pandas data frame columns - how to select a subset of columns based on multiple criteria

Let us say I have the following columns in a data frame:
title
year
actor1
actor2
cast_count
actor1_fb_likes
actor2_fb_likes
movie_fb_likes
I want to select the following columns from the data frame and ignore the rest of the columns :
the first 2 columns (title and year)
some columns based on name - cast_count
some columns which contain the string "actor1" - actor1 and actor1_fb_likes
I am new to pandas. For each of the above operations, I know what method to use. But I want to do all three operations together as all I want is a dataframe that contains the above columns that I need for further analysis. How do I do this?
Here is example code that I have written:
data = {
"title":['Hamlet','Avatar','Spectre'],
"year":['1979','1985','2007'],
"actor1":['Christoph Waltz','Tom Hardy','Doug Walker'],
"actor2":['Rob Walker','Christian Bale ','Tom Hardy'],
"cast_count":['15','24','37'],
"actor1_fb_likes":[545,782,100],
"actor2_fb_likes":[50,78,35],
"movie_fb_likes":[1200,750,475],
}
df_input = pd.DataFrame(data)
print(df_input)
df1 = df_input.iloc[:,0:2] # Select first 2 columns
df2 = df_input[['cast_count']] #select some columns by name - cast_count
df3 = df_input.filter(like='actor1') #select columns which contain the string "actor1" - actor1 and actor1_fb_likes
df_output = pd.concat(df1,df2, df3) #This throws an error that i can't understand the reason
print(df_output)
Question 1:
df_1 = df[['title', 'year']]
Question 2:
# This is an example but you can put whatever criteria you'd like
df_2 = df[df['cast_count'] > 10]
Question 3:
# This is an example but you can put whatever criteria you'd like this way
df_2 = df[(df['actor1_fb_likes'] > 1000) & (df['actor1'] == 'actor1')]
Make sure each filter is contained within it's own set of parenthesis () before using the & or | operators. & acts as an and operator. | acts as an or operator.

How to select rows that have missing values in columns depending on conditions for dataframes?

I have a dataframe extracted from excel sheet.
I am looking for NOT legit rows.
A legit row is such that it meets ANY of the following conditions:
exactly 1 column filled in but the other columns are empty or null
exactly 2 columns are filled in but the other columns are empty or null
exactly all 8 columns are filled in
SO a NON legit row is the opposite of the above such as:
7 of the 8 columns are filled in but one is empty
6 of the 8 columns are filled in but any of the two is empty
and so on...
The 8 columns i am interested in are: columns A, B, D, E, F, G, I, L.
I only want to return those rows that are NOT legit.
I know how to find rows which are empty in specific columns but not sure how to find the non legit rows based on the above conditions.
empty_A = sheet[sheet[sheet.columns[0]].isnull()]
empty_B = sheet[sheet[sheet.columns[1]].isnull()]
empty_D = sheet[sheet[sheet.columns[3]].isnull()]
empty_E = sheet[sheet[sheet.columns[4]].isnull()]
empty_F = sheet[sheet[sheet.columns[5]].isnull()]
empty_G = sheet[sheet[sheet.columns[6]].isnull()]
empty_I = sheet[sheet[sheet.columns[8]].isnull()]
empty_L = sheet[sheet[sheet.columns[11]].isnull()]
print(empty_G)
UPDATE:
I solved using list comprehension
If you already have populated your dataframe then you can do it like this
import numpy as np
import pandas as pd
## Generate Random Data
raw_data=np.random.choice([None,1], (50,8))
raw_data= np.r_[raw_data, np.random.choice([None, 1,2,3], (50,8))]
## Create dataframe from random data
df = pd.DataFrame(raw_data, columns="A, B, D, E, F, G, I, L".split(","))
notnull_counts = (~df.isnull()).sum(axis=1)
## filter rows with your condition
legit_rows = df[((notnull_counts==1) | (notnull_counts==2) | (notnull_counts==8))]
non_legit_rows = df[~((notnull_counts==1) | (notnull_counts==2) | (notnull_counts==8))]
display(legit_rows)
It seems like you want to count the number of null values in these 8 particular columns and select rows based on how many nulls are found. That phrasing suggests summing and selecting based on that sum. Most pandas operations default to performing columnwise operations, so you need to tell sum() to perform the sum for each row by using axis="columns", like so:
# This is a series indexed like df.
# It counts the number of null values in the given columns.
n_null = df[["A", "B", "D", "E", "F", "G", "I", "L"]].isnull().sum(axis="columns")
# This selects the rows where n_null has certain values.
df_notlegit = df.loc[n_null.isin([8, 5, 4, 3, 2])]
# This is another way to do it.
df_nonlegit = df.loc[(n_null > 1) & (n_null < 9)]
df.loc[(df.isna().sum(axis=1)==0) | (df.isna().sum(axis=1)==7) | (df.isna().sum(axis=1)==6)]

Replace a subset of pandas data frame with another data frame

I have a data frame(DF1) with 100 columns.( one of the column is ID)
I have one more data frame(DF2) with 30 columns.( one column is ID)
I have to update the first 30 columns of the data frame(DF1) with the values in second data frame (DF2) keeping the rest of the values in the remaining columns of first data frame (DF1) intact.
update the first 30 column value in DF1 out of the 100 columns when the ID in second data frame (DF2) is present in first data frame (DF1).
I tested this on Python 3.7 but I see no reason for it not to work on 2.7:
joined = df1.reset_index() \
[['index', 'ID']] \
.merge(df2, on='ID')
df1.loc[joined['index'], df1.columns[:30]] = joined.drop(columns=['index', 'ID'])
This assumes that df2 doesn't have a column called index or the merge will fail saying duplicate key with suffix.
Here a slow-motion of its inner workings:
df1.reset_index() returns a dataframe same as df1 but with an additional column: index
[['index', 'ID']] extracts a dataframe containing just these 2 columns from the dataframe in #1
.merge(...) merges with df2 , matching on ID . The result (joined) is a dataframe with 32 columns: index, ID and the original 30 columns of df2.
df1.loc[<row_indexes>, <column_names>] = <another_dataframe> mean you want to replace at those particular cells with data from another_dataframe. Since joined has 32 columns, we need to drop the extra 2 (index and ID)

Pandas dataframes subsetting performance optimization

I need to subset dataframe rows on the basis on multiple conditions. Each condition is described by a set of columns. Say, there are columns
size_10ml
size_20ml
size_30ml
and there will be 1 in only one of the columns and zeroes in all others.
So to choose items (rows) by size and brand I will pass [["size_10ml", "size_20ml"], ["brand_A", "brand_E"]] to the following function:
def any_of_intersect_columns(df, *column_lists):
""" Choose rows ANDing multiple conditions. I.e. choose rows having nonzero value in at least one of the columns
in all sets.
column_lists : Each argument is iterable. It is is a list of column labels.
A row meets condition if any of labeled columns from the current list is true.
Then rows from each condition (list) are intersected
Return
-----
df : subset of df rows
"""
by_row = df
for columns in column_lists:
# choose columns of interest
try:
by_col = df[columns]
# leave rows, evaluating True in at least one of chosen columns
by_row = by_row.loc[by_col.any(axis=1), :]
except KeyError:
error("None of columns has labels {}".format(columns))
by_row = pd.DataFrame()
# return all, if nothing fits conditions
return by_row if by_row.shape[0] else df
The function is called a few times for different condition "levels" to choose one item and there are many items, all from one table. I need ways to optimize this since this is the performance bottleneck.
Data and output example:
>>> df
size_10ml size_20ml brand_A brand_E property_1
0 1 0 1 0 0
1 0 1 0 1 1
2 0 1 1 0 0
>>> any_of_intersect_columns(df, [["size_10ml", "size_20ml"], ["brand_A"]])
>>> [0, 2]
Finally it is possible to refactor to have string property values in columns instead of ones and zeroes but I think this can slow down things only.

Pandas dataframe row removal

I am trying to repair a csv file.
Some data rows need to be removed based on a couple conditions.
Say you have the following dataframe:
-A----B-----C
000---0-----0
000---1-----0
001---0-----1
011---1-----0
001---1-----1
If two or more rows have column A in common, i want to keep the row that has column B set to 1.
The resulting dataframe should look like this:
-A----B-----C
000---1-----0
011---1-----0
001---1-----1
I've experimented with merges and drop_duplicates but cannot seem to get the result I need. It is not certain that the row with column B = 1 will be after a row with B = 0. The take_last argument of drop_duplicates seemed attractive but I don't think it applies here.
Any advice will be greatly appreciated.Thank you.
Not straight forward, but this should work
DF = pd.DataFrame({'A' : [0,0,1,11,1], 'B' : [0,1,0,1,1], 'C' : [0,0,1,0,1]})
DF.ix[DF.groupby('A').apply(lambda df: df[df.B == 1].index[0] if len(df) > 1 else df.index[0])]
A B C
1 0 1 0
4 1 1 1
3 11 1 0
Notes:
groupby divides DF into groups of rows with unique A values i.e. groups with A = 0 (2 rows), A=1 (2 rows) and A=11 (1 row)
Apply then calls the function on each group and assimilates the results
In the function (lambda) I'm looking for the index of row with value B == 1 if there is more than one row in the group, else I use the index of the default row
The result of apply is a list of index values that represent rows with B==1 if more than one row in the group else the default row for given A
The index values are then used to access the corresponding rows by ix operator
Was able to weave my way around panda to get the result I want.
It's not pretty but it gets the job done
res = DataFrame(columns=('CARD_NO', 'STATUS'))
for i in grouped.groups:
if len(grouped.groups[i]) > 1:
card_no = i
print card_no
for a in grouped.groups[card_no]:
status = df.iloc[a]['STATUS']
print 'iloc:'+str(a) +'\t'+'status:'+str(status)
if status == 1:
print 'yes'
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
else:
print 'no'
else:
#only 1 record found
#could be a status of 0 or 1
#add to dataframe
print 'UNIQUE RECORD'
card_no = i
print card_no
status = df.iloc[grouped.groups[card_no][0]]['STATUS']
print grouped.groups[card_no][0]
#print status
print 'iloc:'+str(grouped.groups[card_no][0]) +'\t'+'status:'+str(status)
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
print res