Pandas dataframes subsetting performance optimization - pandas

I need to subset dataframe rows on the basis on multiple conditions. Each condition is described by a set of columns. Say, there are columns
size_10ml
size_20ml
size_30ml
and there will be 1 in only one of the columns and zeroes in all others.
So to choose items (rows) by size and brand I will pass [["size_10ml", "size_20ml"], ["brand_A", "brand_E"]] to the following function:
def any_of_intersect_columns(df, *column_lists):
""" Choose rows ANDing multiple conditions. I.e. choose rows having nonzero value in at least one of the columns
in all sets.
column_lists : Each argument is iterable. It is is a list of column labels.
A row meets condition if any of labeled columns from the current list is true.
Then rows from each condition (list) are intersected
Return
-----
df : subset of df rows
"""
by_row = df
for columns in column_lists:
# choose columns of interest
try:
by_col = df[columns]
# leave rows, evaluating True in at least one of chosen columns
by_row = by_row.loc[by_col.any(axis=1), :]
except KeyError:
error("None of columns has labels {}".format(columns))
by_row = pd.DataFrame()
# return all, if nothing fits conditions
return by_row if by_row.shape[0] else df
The function is called a few times for different condition "levels" to choose one item and there are many items, all from one table. I need ways to optimize this since this is the performance bottleneck.
Data and output example:
>>> df
size_10ml size_20ml brand_A brand_E property_1
0 1 0 1 0 0
1 0 1 0 1 1
2 0 1 1 0 0
>>> any_of_intersect_columns(df, [["size_10ml", "size_20ml"], ["brand_A"]])
>>> [0, 2]
Finally it is possible to refactor to have string property values in columns instead of ones and zeroes but I think this can slow down things only.

Related

Joining two data frames on column name and comparing result side by side

I have two data frames which look like df1 and df2 below and I want to create df3 as shown.
I could do this using a left join to have all the rows in one dataframe and then did a numpy.where to see if they are matching or not.
I could get what I want but I feel there should be an elegant way of doing this which will eliminate renaming columns, reshuffling columns in dataframe and then using np.where.
Is there a better way to do this?
code to reproduce dataframes:
import pandas as pd
df1=pd.DataFrame({'product':['apples','bananas','oranges','pineapples'],'price':[1,2,3,7],'quantity':[5,7,11,4]})
df2=pd.DataFrame({'product':['apples','bananas','oranges'],'price':[2,2,4],'quantity':[5,7,13]})
df3=pd.DataFrame({'product':['apples','bananas','oranges'],'price_df1':[1,2,3],'price_df2':[2,2,4],'price_match':['No','Yes','No'],'quantity':[5,7,11],'quantity_df2':[5,7,13],'quantity_match':['Yes','Yes','No']})
An elegant way to do your task is to:
generate "partial" DataFrames from each source column,
and then concatenate them.
The first step is to define a function to join 2 source columns and append "match" column:
def myJoin(s1, s2):
rv = s1.to_frame().join(s2.to_frame(), how='inner',
lsuffix='_df1', rsuffix='_df2')
rv[s1.name + '_match'] = np.where(rv.iloc[:,0] == rv.iloc[:,1], 'Yes', 'No')
return rv
Then, from df1 and df2, generate 2 auxiliary DataFrames setting product as the index:
wrk1 = df1.set_index('product')
wrk2 = df2.set_index('product')
And the final step is:
result = pd.concat([ myJoin(wrk1[col], wrk2[col]) for col in wrk1.columns ], axis=1)\
.reset_index()
Details:
for col in wrk1.columns - generates names of columns to join.
myJoin(wrk1[col], wrk2[col]) - generates the partial result for this column from
both source DataFrames.
[…] - a list comprehension, collecting the above partial results in a list.
pd.concat(…) - concatenates these partial results into the final result.
reset_index() - converts the index (product names) into a regular column.
For your source data, the result is:
product price_df1 price_df2 price_match quantity_df1 quantity_df2 quantity_match
0 apples 1 2 No 5 5 Yes
1 bananas 2 2 Yes 7 7 Yes
2 oranges 3 4 No 11 13 No

replace values for specific rows more efficiently in pandas / Python

I have two data frames, based on a condition that I get from a list (which length is 2 million) I get rows that match that condition, then for those rows I change the values in columns x and y in the first data frame by the values of x and y in the second data frame. Here is my code, but it is very slow and makes my computer freeze. Any idea how I can do this more efficiently ?
for ids in List_id:
a=df1.index[(df1['id'] == ids )==True].values[0]
b=df2.index[(df2['id'] == ids )==True].values[0]
df1['x'][a] = df2['x'][b]
df1['y'][a] = df2['y'][b]
thank you
--
Example:
List_id=[1, 11 , 12 , 13]
ids=1
a=df1.index[(df1['id'] == 1 )==True].values[0]
print('a') : 234
b=df2.index[(df2['id'] == 1 )==True].values[0]
print('b') : 789
df1['x'][a] = 0
df2['x'][b] =15
So at the end I want in my data frame 1:
df1['x'][a] = df2['x'][b]
Assuming you don't have repeated id in both dataframe, you can try something like below:
step-1: filtering df2
step-2: joining df1 with filtered one
step-3 replace values in joined df and dropping extra columns.
df2_filtered=df2[df2['id'].isin(List_id)]
join_df = df1.setIndex('id').join(df2_filtered.setIndex('id'), rsuffix = "_ignore", how = 'left')
# other columns from df2 will be null, you can use that to get the rows which needs to be updated

Sum pandas columns, excluding some rows based on other column values

I'm attempting to determine the number of widget failures from a test population.
Each widget can fail in 0, 1, or multiple ways. I'd like to calculate the number of failures of for each failure method, but once a widget is known to have failed, it should be excluded from future sums. In other words, the failure modes are known and ordered. If a widget fails via mode 1 and mode 3, I don't care about mode 3: I just want to count mode 1.
I have a dataframe with one row per item, and one column per failure mode. If the widget fails in that mode, the column value is 1, else it is 0.
d = {"item_1":
{"failure_1":0, "failure_2":0},
"item_2":
{"failure_1":1, "failure_2":0},
"item_3":
{"failure_1":0, "failure_2":1},
"item_4":
{"failure_1":1, "failure_2":1}}
df = pd.DataFrame(d).T
display(df)
Output:
failure_1 failure_2
item_1 0 0
item_2 1 0
item_3 0 1
item_4 1 1
If I just want to sum the columns, that's easy: df.sum(). And if I want to calculate percentage failures, easy too: df.sum()/len(df). But this counts widgets that fail in multiple ways, multiple times. For the problem stated, the best I can come up with is this:
# create empty df to store results
df2 = pd.DataFrame(columns=["total_failures"])
for col in df.columns:
# create a row, named after the column, and assign it the value of the sum
df2.loc[col] = df[col].sum()
# drop rows in the df column that are equal to 1
df = df.loc[df[col] != 1]
display(df2)
Output:
total_failures
failure_1 2
failure_2 1
This requires creating another dataframe (that's fine), but also requires iterating over the existing dataframe columns and deleting it a couple of rows at a time. If the dataframe takes a while to generate, or is needed for future calculations, this is not workable. I can deal with iterating over the columns.
Is there a way to do this without deleting the original df, or making a temporary copy? (Not workable with large data sets.)
You can do a cumsum on axis=1 and wherever the value is greater than 1 , mask it as 0 and then take sum:
out = df.mask(df.cumsum(axis=1).gt(1), 0).sum().to_frame('total_failures')
print(out)
total_failures
failure_1 2
failure_2 1
This way the original df is retained too.

Deleting/Selecting rows from pandas based on conditions on multiple columns

From a pandas dataframe, I need to delete specific rows based on a condition applied on two columns of the dataframe.
The dataframe is
0 1 2 3
0 -0.225730 -1.376075 0.187749 0.763307
1 0.031392 0.752496 -1.504769 -1.247581
2 -0.442992 -0.323782 -0.710859 -0.502574
3 -0.948055 -0.224910 -1.337001 3.328741
4 1.879985 -0.968238 1.229118 -1.044477
5 0.440025 -0.809856 -0.336522 0.787792
6 1.499040 0.195022 0.387194 0.952725
7 -0.923592 -1.394025 -0.623201 -0.738013
I need to delete some rows where the difference between column 1 and columns 2 is less than threshold t.
abs(column1.iloc[index]-column2.iloc[index]) < t
I have seen examples where conditions are applied individually on column values but did not find anything where a row is deleted based on a condition applied on multiple columns.
First select columns by DataFrame.iloc for positions, subtract, get Series.abs, compare by thresh with inverse opearator like < to >= or > and filter by boolean indexing:
df = df[(df.iloc[:, 0]-df.iloc[:, 1]).abs() >= t]
If need select columns by names, here 0 and 1:
df = df[(df[0]-df[1]).abs() >= t]

Adding lists stored in dataframe

I have two dataframes as:
df1.ix[1:3]
DateTime
2018-01-02 [-0.0031537018416199097, 0.006451397621428631,...
2018-01-03 [-0.0028882814454597745, -0.005829869983964528...
df2.ix[1:3]
DateTime
2018-01-02 [-0.03285881500135208, -0.027806145786217932, ...
2018-01-03 [-0.0001314381449719178, -0.006278235444742629...
len(df1.ix['2018-01-02'][0])
500
len(df2.ix['2018-01-02'][0])
500
When I do df1 + df2 I get:
len((df1 + df2).ix['2018-01-02'][0])
1000
So, the lists instead of being summation is being concatenated.
How do I add element wise the lists in the dataframes df1 and df2.
When an operation is applied between two dataframes, it gets broadcasted at element level. Element in your case is a list and when '+' operator is applied between two lists, it concatenates them. That's why resulting dataframe contains concatenated lists.
There can be multiple approaches for actually summing up elements of lists instead of concatenating.
One approach can be converting list elements into columns and then adding dataframes and then merging columns back to a single list.(which has been suggested in first answer but in a wrong way)
Step 1: Converting list elements to columns
df1=df1.apply(lambda row:pd.Series(row[0]), axis=1)
df2=df2.apply(lambda row:pd.Series(row[0]), axis=1)
We need to pass row[0] instead of row to get rid of column index associated with series.
Step 2: Add dataframes
df=df1+df2 #this dataframe will have 500 columns
Step 3: Merge columns back to lists
df=df.apply(lambda row:pd.Series({0:list(row)}),axis=1)
This is an interesting part. Why are we returning a series here? Why only returning list(row) doesn't work and keep retaining 500 columns?
Reason is - if length of list returned is same as length of columns in the beginning, then this list gets fit in columns and to us it seems nothing happened. Whereas if length of the list is not equal to number of columns, then it is returned as single list.
Let's look at an example.
Suppose I've a dataframe, having columns 0 ,1 and 2.
df=pd.DataFrame({0:[1,2,3],1:[4,5,6],2:[7,8,9]})
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
Number of columns in original dataframe are 3. If I try to return a list with two columns, it works and a series is returned,
df1=df.apply(lambda row:[row[0],row[1]],axis=1)
0 [1, 4]
1 [2, 5]
2 [3, 6]
dtype: object
Instead if try to return list of three numbers, it would get fit in columns.
df1=df.apply(list,axis=1)
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
So if we want to return list of same size as number of columns, we'll have to return it in form of Series where one row's value has been given as list.
Another approach can be, introduce one column of a dataframe into other and then add columns using apply function.
df1[1]=df2[0]
df=df1.apply(lambda r: list(np.array(r[0])+np.array(r[1])),axis=1)
We can take advantage of numpy arrays here. '+' operator on numpy arrays sums up corresponding values and gives a single numpy array.
Cast them to series so that they become columns, then add your dfs:
df1 = df1.apply(pd.Series, axis=1)
df2 = df2.apply(pd.Series, axis=1)
df1 + df2