Sum pandas columns, excluding some rows based on other column values - pandas

I'm attempting to determine the number of widget failures from a test population.
Each widget can fail in 0, 1, or multiple ways. I'd like to calculate the number of failures of for each failure method, but once a widget is known to have failed, it should be excluded from future sums. In other words, the failure modes are known and ordered. If a widget fails via mode 1 and mode 3, I don't care about mode 3: I just want to count mode 1.
I have a dataframe with one row per item, and one column per failure mode. If the widget fails in that mode, the column value is 1, else it is 0.
d = {"item_1":
{"failure_1":0, "failure_2":0},
"item_2":
{"failure_1":1, "failure_2":0},
"item_3":
{"failure_1":0, "failure_2":1},
"item_4":
{"failure_1":1, "failure_2":1}}
df = pd.DataFrame(d).T
display(df)
Output:
failure_1 failure_2
item_1 0 0
item_2 1 0
item_3 0 1
item_4 1 1
If I just want to sum the columns, that's easy: df.sum(). And if I want to calculate percentage failures, easy too: df.sum()/len(df). But this counts widgets that fail in multiple ways, multiple times. For the problem stated, the best I can come up with is this:
# create empty df to store results
df2 = pd.DataFrame(columns=["total_failures"])
for col in df.columns:
# create a row, named after the column, and assign it the value of the sum
df2.loc[col] = df[col].sum()
# drop rows in the df column that are equal to 1
df = df.loc[df[col] != 1]
display(df2)
Output:
total_failures
failure_1 2
failure_2 1
This requires creating another dataframe (that's fine), but also requires iterating over the existing dataframe columns and deleting it a couple of rows at a time. If the dataframe takes a while to generate, or is needed for future calculations, this is not workable. I can deal with iterating over the columns.
Is there a way to do this without deleting the original df, or making a temporary copy? (Not workable with large data sets.)

You can do a cumsum on axis=1 and wherever the value is greater than 1 , mask it as 0 and then take sum:
out = df.mask(df.cumsum(axis=1).gt(1), 0).sum().to_frame('total_failures')
print(out)
total_failures
failure_1 2
failure_2 1
This way the original df is retained too.

Related

Rolling apply lambda function based on condtion

I have a dataframe with normalised (to 100) returns for 18 products (columns). I want to apply a lambda function which multplies the next row by the previous row.
I can do :
df= df.rolling(2).apply(lambda x: (x[0]*x[1]),raw=True)
But some of my columns dont have values on row 1 (they go live on row 4). So I need to either:
Have a lambda function that starts only on row 4 yet applies to the entire df. I can create the first 4 rows manually.
As my values are 100 until "live" I could have the lambda function only applying when the value does not equal 100.
I have tried both :
1.
df.iloc[3:,:] = df.iloc[3:,:].rolling(2).apply(lambda x: (x[0]*x[1]),raw=True)
df= df.rolling(2).apply(lambda x: (x[0]*x[1]) if x[0] != 100 else x,raw=True)
But both meet with total failure.
Any advice welcomed - I've spent hours looking through the site and have yet to find any outcome that works for this situation.
So given the lack of responses I came up with a solution where I split my df in 2 parts and appended it back together.
My lambda function was also garbage I needed something like :
df2 = df.copy()
for i in range(df2.index.size):
if not i:
continue
df2.iloc[i] = (df2.iloc[i - 1] * (df.iloc[i]))
df2
to actually achieve what I was after.

Adding lists stored in dataframe

I have two dataframes as:
df1.ix[1:3]
DateTime
2018-01-02 [-0.0031537018416199097, 0.006451397621428631,...
2018-01-03 [-0.0028882814454597745, -0.005829869983964528...
df2.ix[1:3]
DateTime
2018-01-02 [-0.03285881500135208, -0.027806145786217932, ...
2018-01-03 [-0.0001314381449719178, -0.006278235444742629...
len(df1.ix['2018-01-02'][0])
500
len(df2.ix['2018-01-02'][0])
500
When I do df1 + df2 I get:
len((df1 + df2).ix['2018-01-02'][0])
1000
So, the lists instead of being summation is being concatenated.
How do I add element wise the lists in the dataframes df1 and df2.
When an operation is applied between two dataframes, it gets broadcasted at element level. Element in your case is a list and when '+' operator is applied between two lists, it concatenates them. That's why resulting dataframe contains concatenated lists.
There can be multiple approaches for actually summing up elements of lists instead of concatenating.
One approach can be converting list elements into columns and then adding dataframes and then merging columns back to a single list.(which has been suggested in first answer but in a wrong way)
Step 1: Converting list elements to columns
df1=df1.apply(lambda row:pd.Series(row[0]), axis=1)
df2=df2.apply(lambda row:pd.Series(row[0]), axis=1)
We need to pass row[0] instead of row to get rid of column index associated with series.
Step 2: Add dataframes
df=df1+df2 #this dataframe will have 500 columns
Step 3: Merge columns back to lists
df=df.apply(lambda row:pd.Series({0:list(row)}),axis=1)
This is an interesting part. Why are we returning a series here? Why only returning list(row) doesn't work and keep retaining 500 columns?
Reason is - if length of list returned is same as length of columns in the beginning, then this list gets fit in columns and to us it seems nothing happened. Whereas if length of the list is not equal to number of columns, then it is returned as single list.
Let's look at an example.
Suppose I've a dataframe, having columns 0 ,1 and 2.
df=pd.DataFrame({0:[1,2,3],1:[4,5,6],2:[7,8,9]})
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
Number of columns in original dataframe are 3. If I try to return a list with two columns, it works and a series is returned,
df1=df.apply(lambda row:[row[0],row[1]],axis=1)
0 [1, 4]
1 [2, 5]
2 [3, 6]
dtype: object
Instead if try to return list of three numbers, it would get fit in columns.
df1=df.apply(list,axis=1)
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
So if we want to return list of same size as number of columns, we'll have to return it in form of Series where one row's value has been given as list.
Another approach can be, introduce one column of a dataframe into other and then add columns using apply function.
df1[1]=df2[0]
df=df1.apply(lambda r: list(np.array(r[0])+np.array(r[1])),axis=1)
We can take advantage of numpy arrays here. '+' operator on numpy arrays sums up corresponding values and gives a single numpy array.
Cast them to series so that they become columns, then add your dfs:
df1 = df1.apply(pd.Series, axis=1)
df2 = df2.apply(pd.Series, axis=1)
df1 + df2

Pandas dataframes subsetting performance optimization

I need to subset dataframe rows on the basis on multiple conditions. Each condition is described by a set of columns. Say, there are columns
size_10ml
size_20ml
size_30ml
and there will be 1 in only one of the columns and zeroes in all others.
So to choose items (rows) by size and brand I will pass [["size_10ml", "size_20ml"], ["brand_A", "brand_E"]] to the following function:
def any_of_intersect_columns(df, *column_lists):
""" Choose rows ANDing multiple conditions. I.e. choose rows having nonzero value in at least one of the columns
in all sets.
column_lists : Each argument is iterable. It is is a list of column labels.
A row meets condition if any of labeled columns from the current list is true.
Then rows from each condition (list) are intersected
Return
-----
df : subset of df rows
"""
by_row = df
for columns in column_lists:
# choose columns of interest
try:
by_col = df[columns]
# leave rows, evaluating True in at least one of chosen columns
by_row = by_row.loc[by_col.any(axis=1), :]
except KeyError:
error("None of columns has labels {}".format(columns))
by_row = pd.DataFrame()
# return all, if nothing fits conditions
return by_row if by_row.shape[0] else df
The function is called a few times for different condition "levels" to choose one item and there are many items, all from one table. I need ways to optimize this since this is the performance bottleneck.
Data and output example:
>>> df
size_10ml size_20ml brand_A brand_E property_1
0 1 0 1 0 0
1 0 1 0 1 1
2 0 1 1 0 0
>>> any_of_intersect_columns(df, [["size_10ml", "size_20ml"], ["brand_A"]])
>>> [0, 2]
Finally it is possible to refactor to have string property values in columns instead of ones and zeroes but I think this can slow down things only.

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)

Taking second last observed row

I am new to pandas. I know how to use drop_duplicates and take the last observed row in a dataframe. Is there any way that I can use it to take only second last observed. Or any other way of doing it.
For example:
I would like to go from
df = pd.DataFrame(data={'A':[1,1,1,2,2,2],'B':[1,2,3,4,5,6]}) to
df1 = pd.DataFrame(data={'A':[1,2],'B':[2,5]})
The idea is that you'll group the data by the duplicate column , then check the length of group , if the length of group is greater than or equal 2 this mean that you can slice the second element of group , if the group has a length of one which mean that this value is not duplicated , then take index 0 which is the only element in the grouped data
df.groupby(df['A']).apply(lambda x : x.iloc[1] if len(x) >= 2 else x.iloc[0])
The first answer I think was on the right track, but possibly not quite right. I have extended your data to include 'A' groups with two observations, and an 'A' group with one observation, for the sake of completeness.
import pandas as pd
df = pd.DataFrame(data={'A':[1,1,1,2,2,2, 3, 3, 4],'B':[1,2,3,4,5,6, 7, 8, 9]})
def user_apply_func(x):
if len(x) == 2:
return x.iloc[0]
if len(x) > 2:
return x.iloc[-2]
return
df.groupby('A').apply(user_apply_func)
Out[7]:
A B
A
1 1 2
2 2 5
3 3 7
4 NaN NaN
For your reference the apply method automatically passes the data frame as the first argument.
Also, as you are always going to be reducing each group of data to a single observation you could also use the agg method (aggregate). apply is more flexible in terms of the length of the sequences that can be returned whereas agg must reduce the data to a single value.
df.groupby('A').agg(user_apply_func)