Conditional imputation with average of non-missing columns with pandas toolbox - pandas

This question focus on pandas own functions. There are still solutions (pandas DataFrame: replace nan values with average of columns) but with own written functions.
In SPSS there is function MEAN.n which gives you the mean value of list of numbers only when n elements of that list are valid (not pandas.NA). With that function you are able to imputat missing values only if a minimum number of items are valid.
Are there pandas function to do this with?
Example
Values [1, 2, 3, 4, NA].
Mean of the valid values is 2.5.
The resulting list should be [1, 2, 3, 4, 2.5].
Assume the rule that in a 5 item list 3 should have valid values for imputation. Otherwise the result is NA.
Values [1, 2, NA, NA, NA].
Mean of the valid values is 1.5 but it does not matter.
The resulting list should not be changed [1, 2, NA, NA, NA] because imputation is not allowed.

Assuming you want to work with pandas, you can define a custom wrapper (using only pandas functions) to fillna with the mean only if a minimum number of items are not NA:
from pandas import NA
s1 = pd.Series([1, 2, 3, 4, NA])
s2 = pd.Series([1, 2, NA, NA, NA])
def fillna_mean(s, N=4):
return s if s.notna().sum() < N else s.fillna(s.mean())
fillna_mean(s1)
# 0 1.0
# 1 2.0
# 2 3.0
# 3 4.0
# 4 2.5
# dtype: float64
fillna_mean(s2)
# 0 1
# 1 2
# 2 <NA>
# 3 <NA>
# 4 <NA>
# dtype: object
fillna_mean(s2, N=2)
# 0 1.0
# 1 2.0
# 2 1.5
# 3 1.5
# 4 1.5
# dtype: float64

Lets try list comprehension, though it will be messy
Option1
You can use pd.Series and numpy
s= [x if np.isnan(lst).sum()>=3 else pd.Series(lst).mean(skipna=True) if x is np.nan else x for x in lst]
Option2 use numpy all through
s=[x if np.isnan(lst).sum()>=3 else np.mean([x for x in lst if str(x) != 'nan']) if x is np.nan else x for x in lst]
Case1
lst=[1, 2, 3, 4, np.nan]
outcome
[1, 2, 3, 4, 2.5]
Case2
lst=[1, 2, np.nan, np.nan, np.nan]
outcome
[1, 2, nan, nan, nan]
if you wanted it as a pd. Series, simply
pd.Series(s, name='lst')
How it works
s=[x if np.isnan(lst).sum()>=3 \ #give me element x if the sum of nans in the list is greater than or equal to 3
else pd.Series(lst).mean(skipna=True) if x is np.nan else x \# Otherwise replace the Nan in list with the mean of non NaN elements in the list
for x in lst\#For every element in lst
]

Related

Pandas aggregate to a list of dicts [duplicate]

I have a pandas data frame df like:
a b
A 1
A 2
B 5
B 5
B 4
C 6
I want to group by the first column and get second column as lists in rows:
A [1,2]
B [5,5,4]
C [6]
Is it possible to do something like this using pandas groupby?
You can do this using groupby to group on the column of interest and then apply list to every group:
In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[1]:
a b
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
In [2]: df.groupby('a')['b'].apply(list)
Out[2]:
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
df1
Out[3]:
a new
0 A [1, 2]
1 B [5, 5, 4]
2 C [6]
A handy way to achieve this would be:
df.groupby('a').agg({'b':lambda x: list(x)})
Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py
If performance is important go down to numpy level:
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})
def f(df):
keys, values = df.sort_values('a').values.T
ukeys, index = np.unique(keys, True)
arrays = np.split(values, index[1:])
df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
return df2
Tests:
In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop
In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop
To solve this for several columns of a dataframe:
In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
...: :[3,3,3,4,4,4]})
In [6]: df
Out[6]:
a b c
0 A 1 3
1 A 2 3
2 B 5 3
3 B 5 4
4 B 4 4
5 C 6 4
In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]:
b c
a
A [1, 2] [3, 3]
B [5, 5, 4] [3, 4, 4]
C [6] [4]
This answer was inspired from Anamika Modi's answer. Thank you!
Use any of the following groupby and agg recipes.
# Setup
df = pd.DataFrame({
'a': ['A', 'A', 'B', 'B', 'B', 'C'],
'b': [1, 2, 5, 5, 4, 6],
'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df
a b c
0 A 1 x
1 A 2 y
2 B 5 z
3 B 5 x
4 B 4 y
5 C 6 z
To aggregate multiple columns as lists, use any of the following:
df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)
b c
a
A [1, 2] [x, y]
B [5, 5, 4] [z, x, y]
C [6] [z]
To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,
df.groupby('a').agg({'b': list}) # 4.42 ms
df.groupby('a')['b'].agg(list) # 2.76 ms - faster
a
A [1, 2]
B [5, 5, 4]
C [6]
Name: b, dtype: object
As you were saying the groupby method of a pd.DataFrame object can do the job.
Example
L = ['A','A','B','B','B','C']
N = [1,2,5,5,4,6]
import pandas as pd
df = pd.DataFrame(zip(L,N),columns = list('LN'))
groups = df.groupby(df.L)
groups.groups
{'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}
which gives and index-wise description of the groups.
To get elements of single groups, you can do, for instance
groups.get_group('A')
L N
0 A 1
1 A 2
groups.get_group('B')
L N
2 B 5
3 B 5
4 B 4
It is time to use agg instead of apply .
When
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})
If you want multiple columns stack into list , result in pd.DataFrame
df.groupby('a')[['b', 'c']].agg(list)
# or
df.groupby('a').agg(list)
If you want single column in list, result in ps.Series
df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)
Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .
Just a suplement. pandas.pivot_table is much more universal and seems more convenient:
"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6],
'c':[1,2,1,1,1,6]})
print(df)
a b c
0 A 1 1
1 A 2 2
2 B 5 1
3 B 5 1
4 B 4 1
5 C 6 6
"""pivot_table"""
pt = pd.pivot_table(df,
values=['b', 'c'],
index='a',
aggfunc={'b': list,
'c': set})
print(pt)
b c
a
A [1, 2] {1, 2}
B [5, 5, 4] {1}
C [6] {6}
If looking for a unique list while grouping multiple columns this could probably help:
df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Building upon #B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1)
And this solution can also deal with multi-indices:
However this is not heavily tested, use with caution.
If performance is important go down to numpy level:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})
def f_multi(df,col_names):
if not isinstance(col_names,list):
col_names = [col_names]
values = df.sort_values(col_names).values.T
col_idcs = [df.columns.get_loc(cn) for cn in col_names]
other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]
# split df into indexing colums(=keys) and data colums(=vals)
keys = values[col_idcs,:]
vals = values[other_col_idcs,:]
# list of tuple of key pairs
multikeys = list(zip(*keys))
# remember unique key pairs and ther indices
ukeys, index = np.unique(multikeys, return_index=True, axis=0)
# split data columns according to those indices
arrays = np.split(vals, index[1:], axis=1)
# resulting list of subarrays has same number of subarrays as unique key pairs
# each subarray has the following shape:
# rows = number of non-grouped data columns
# cols = number of data points grouped into that unique key pair
# prepare multi index
idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names)
list_agg_vals = dict()
for tup in zip(*arrays, other_col_names):
col_vals = tup[:-1] # first entries are the subarrays from above
col_name = tup[-1] # last entry is data-column name
list_agg_vals[col_name] = col_vals
df2 = pd.DataFrame(data=list_agg_vals, index=idx)
return df2
Tests:
In [227]: %timeit f_multi(df, ['a','d'])
2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df.groupby(['a','d']).agg(list)
4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results:
for the random seed 0 one would get:
The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.
df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Let us using df.groupby with list and Series constructor
pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]:
A [1, 2]
B [5, 5, 4]
C [6]
dtype: object
Here I have grouped elements with "|" as a separator
import pandas as pd
df = pd.read_csv('input.csv')
df
Out[1]:
Area Keywords
0 A 1
1 A 2
2 B 5
3 B 5
4 B 4
5 C 6
df.dropna(inplace = True)
df['Area']=df['Area'].apply(lambda x:x.lower().strip())
print df.columns
df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})
df_op.to_csv('output.csv')
Out[2]:
df_op
Area Keywords
A [1| 2]
B [5| 5| 4]
C [6]
Answer based on #EdChum's comment on his answer. Comment is this -
groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think
Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.
df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))
# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']
# Now create final list_b column, using min and max indexes for each category of a and filtering list of b.
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)
print(gp_df.shape)
gp_df.head()
This above code takes 2 minutes for 20 million rows and 500k categories in first column.
Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above
For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks
e.g.
import pandas as pd
from string import ascii_lowercase
import random
def generate_string(case=4):
return ''.join([random.choice(ascii_lowercase) for _ in range(case)])
df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})
%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})
For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s
Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is:
df = pd.DataFrame({
'a':['A','A','B','B','B','C'],
'b':[1,2,5,5,4,6]
})
df=df.groupby('a').agg({
'b':['min', 'max',lambda x: list(x)]
})
#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)
It's a bit old but I was directed here. Is there anyway to group it by multiple different columns?
"column1", "column2", "column3"
"foo", "val1", 3
"foo", "val2", 0
"foo", "val2", 3
"bar", "other", 99
to this:
"column1", "column2", "column3"
"foo", "val1", [ 3 ]
"foo", "val2", [ 0, 3 ]
"bar", "other", [ 99 ]

Generate a new column based on other columns' value

here is my sample data input and output:
df=pd.DataFrame({'A_flag': [1, 1,1], 'B_flag': [1, 1,0],'C_flag': [0, 1,0],'A_value': [5, 3,7], 'B_value': [2, 7,4],'C_value': [4, 2,5]})
df1=pd.DataFrame({'A_flag': [1, 1,1], 'B_flag': [1, 1,0],'C_flag': [0, 1,0],'A_value': [5, 3,7], 'B_value': [2, 7,4],'C_value': [4, 2,5], 'Final':[3.5,3,7]})
I want to generate another column called 'Final' conditional on A_flag, B_flag and C_flag:
(a) If number of three columns equal to 1 is 3, then 'Final'=median of (A_value, B_value, C_value)
(b) If the number of satisfied conditions is 2, then 'Final'= mean of those two
(c) If the number is 1, the 'Final' = that one
For example, in row 1, A_flag=1 and B_flag =1, 'Final'=A_value+B_value/2=5+2/2=3.5
in row 2, all three flags are 1 so 'Final'= median of (3,7,2) =3
in row 3, only A_flag=1, so 'Final'=A_value=7
I tried the following:
df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==3, "Final"]= df[['A_flag','B_flag','C_flag']].median(axis=1)
df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==2, "Final"]=
df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==1, "Final"]=
I don't know how to subset the columns that for the second and third scenarios.
Assuming the order of flag and value columns match, you can first filter the flag and value like columns then mask the values in value columns where flag is 0, then calculate median along axis=1
flag = df.filter(like='_flag')
value = df.filter(like='_value')
df['median'] = value.mask(flag.eq(0).to_numpy()).median(1)
A_flag B_flag C_flag A_value B_value C_value median
0 1 1 0 5 2 4 3.5
1 1 1 1 3 7 2 3.0
2 1 0 0 7 4 5 7.0
When dealing with functions and dataframe, usually the easiest way to go is defining a function and then calling that function to the dataframe either by iterating over the columns or the rows. I think in your case this might work:
import pandas as pd
df = pd.DataFrame(
{
"A_flag": [1, 1, 1],
"B_flag": [1, 1, 0],
"C_flag": [0, 1, 0],
"A_value": [5, 3, 7],
"B_value": [2, 7, 4],
"C_value": [4, 2, 5],
}
)
def make_final_column(row):
flags = [(row['A_flag'], row['A_value']), (row['B_flag'], row['B_value']), (row['C_flag'], row['C_value'])]
met_condition = [row[1] for row in flags if row[0] == 1]
return sum(met_condition) / len(met_condition)
df["Final"] = df.apply(make_final_column, axis=1)
df
With numpy:
flags = df[["A_flag", "B_flag", "C_flag"]].to_numpy()
values = df[["A_value", "B_value", "C_value"]].to_numpy()
# Sort each row so that the 0 flags appear first
index = np.argsort(flags)
flags = np.take_along_axis(flags, index, axis=1)
# Rearrange the values to match the flags
values = np.take_along_axis(values, index, axis=1)
# Result
np.select(
[
flags[:, 0] == 1, # when all flags are 1
flags[:, 1] == 1, # when two flags are 1
flags[:, 2] == 1, # when one flag is 1
],
[
np.quantile(values, 0.5, axis=1), # median all of 3 values
np.mean(values[:, -2:], axis=1), # mean of the two 1-flag
values[:, 2], # value of the 1-flag
],
default=np.nan
)
Quite interesting solutions already. I have used a masked approach.
Explanation:
So, with the flag given already it becomes easy to find which values are important just by multiplying by the flag. There after mask the values which are zero in respective rows and find median over the axis.
>>> import numpy as np
>>> t_arr = np.array((df.A_flag * df.A_value, df.B_flag * df.B_value, df.C_flag * df.C_value)).T
>>> maskArr = np.ma.masked_array(t_arr, mask=x==0)
>>> df["Final"] = np.ma.median(maskArr, axis=1)
>>> df
A_flag B_flag C_flag A_value B_value C_value Final
0 1 1 0 5 2 4 3.5
1 1 1 1 3 7 2 3.0
2 1 0 0 7 4 5 7.0

How to remove all type of nan from the dataframe.?

I had a data frame, which is shown below. I want to merge column values into one column, excluding nan values.
Image 1:
When I am using the code
df3["Generation"] = df3[df3.columns[5:]].apply(lambda x: ','.join(x.dropna()), axis=1)
I am getting results like this.
Image 2:
I suspect that these columns are of type string; thus, they are not affected by x.dropna().
One example that I made is this, which gives similar results as yours.
df = pd.DataFrame({'a': [np.nan, np.nan, 1, 2], 'b': [1, 1, np.nan, None]}).astype(str)
df.apply(lambda x: ','.join(x.dropna()))
0 nan,1.0
1 nan,1.0
2 1.0,nan
3 2.0,nan
dtype: object
-----------------
# using simple string comparing solves the problem
df.apply(lambda x: ','.join(x[x!='nan']), axis=1)
0 1.0
1 1.0
2 1.0
3 2.0
dtype: object

dataframe groupby nth with same behaviour as first and last

In a dataframe, when performing groupby['col'].first() we get the first not nan value in each column (same for last).
I am trying to get the second not nan value and I cannot find how. The only relevant function that I found is groupby['col'].nth(1), but it just gives me the second row with nans if exist. groupby['col'].nth(1, dropna='any') doesn't do the job since it skips rows with nans and doesn't check each column seperately.
example:
df = pd.DataFrame({
'A': [1, 1, 1, 1, 1],
'B': [np.nan, 2, 3, 4, 5],
'C': [np.nan, np.nan, 3, 4, 5]
}, columns=['A', 'B', 'C'])
first() behaviour:
df.groupby('A').first().reset_index()
results with:
A B C
0 1 2.0 3.0
on the other hand:
df.groupby('A').nth(0, dropna='any').reset_index()
gives:
A B C
0 1 3.0 3.0
Is there a way to get the same behaviour of first/last in the nth function so I can apply it also for second or any nth item?
You can use the generic aggregate method to filter each series with notna and then pick the index you want, for example:
df.groupby('A').aggregate(lambda x: x.array[pd.notna(x)][0])
Produces:
B C
A
1 2.0 3.0
Changing the index to 1 to get the second notna value gives:
B C
A
1 3.0 4.0
Of course that lambda is a bit naive because it will raise an IndexError if the array isn't long enough. A function like this should work:
def nth_notna(n):
def inner(series):
a = series.array[pd.notna(series)]
if len(a) - 1 < n:
return np.nan
return a[n]
return inner
Then df.groupby('A').aggregate(nth_notna(3)) will produce:
B C
A
1 5.0 NaN

pandas using qcut on series with fewer values than quantiles

I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):
>>> s = pd.Series([5, np.nan, np.nan])
When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)
>>> s.quantile([0.5, 1])
0.5 5.0
1.0 5.0
dtype: float64
But when I apply .qcut() with an integer value for number of quantiles an error is thrown:
>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5., 5., 5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Even after I set the duplicates argument, it still fails:
>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0
How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)
The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:
0 (4.999, 5.000]
1 NaN
2 NaN
Ok, this is a workaround which might work for you.
pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]:
0 (4.999, 5.0]
1 NaN
2 NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]
You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)
#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)
#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')
Use python 3.5 instead of python 2.7 .
This worked for me