The Problem:
I have a list of pandas.Series, where the series all have dates as index, but it is not guaranteed that they all have the same index.
The values are guaranteed to be bools (no NaNs possible).
The result i want to get is one pandas.Series where the index is the union of all indices found in the list of series. The value for each index should be the logical and of all series values, which contain the index.
Example:
A = pd.Series(index=[datetime(2015,05,01,20),
datetime(2015,05,01,20,15),
datetime(2015,05,01,20,30)],
data=[False, True, True])
B = pd.Series(index=[datetime(2015,05,01,20),
datetime(2015,05,01,20,30),
datetime(2015,05,01,20,45)],
data=[True, True, True])
series = [A, B]
A common index is datetime(2015,05,01,20) the result at this index should be False and True i.e. False.
An uncommon index is datetime(2015,05,01,20,45), it is only found in series B. The expected result is to be the value of B at this index, i.e. True.
The desired result in total looks like this:
result = pd.Series(index=[datetime(2015,05,01,20),
datetime(2015,05,01,20,15),
datetime(2015,05,01,20,30),
datetime(2015,05,01,20,45)],
data=[False, True, True, True])
My Approach:
I came up with a good start (I think) but cannot find the correct operation, it currently looks like this
result = None
for next in series:
if result is None:
result = next
else:
result = result.reindex(index=result.index | next.index)
# the next line sadly returns: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
result.loc[next.index] = result.loc[next.index] and next.loc[next.index] # sadly returns
I should have digged a little further before asking. I found a solution, which works for me and looks like the pandas-way of doing it, but I might stand corrected if an even more convenient way is presented!
result = None
for next in series:
if result is None:
result = next
else:
index = result.index | next.index
result = result.reindex(index, fill_value=True) & next.reindex(index, fill_value=True)
If I understand what you want, I'd concat the 2 series column-wise and then call a function row-wise that drops the NaN values and returns the logical and of the 2 columns or the lone column value:
In [231]:
df = pd.concat([A,B], axis=1)
def func(x):
l = x.dropna()
if len(l) > 1:
return l[0]&l[1]
return l.values[0]
df['result'] = df.apply(func, axis=1)
df
Out[231]:
0 1 result
2015-05-01 20:00:00 False True False
2015-05-01 20:15:00 True NaN True
2015-05-01 20:30:00 True True True
2015-05-01 20:45:00 NaN True True
Related
I am using pandas and have run into a few occasions where I have a programmatically generated list of conditionals, like so
conditionals = [
df['someColumn'] == 'someValue',
df['someOtherCol'] == 'someOtherValue',
df['someThirdCol'].isin(['foo','bar','baz']),
]
and I want to select rows where ALL of these conditions are true. I figure I'd do something like this.
bigConditional = IHaveNoIdeaOfWhatToPutHere
for conditional in conditionals:
bigConditional = bigConditional && conditional
filteredDf = df[bigConditional]
I know that I WANT to use the identity property, to where bigConditional is initialized to a series of true for every index in my dataframe, such that if any condition in my conditionals list evaluates to false that row won't be in the filtered dataframe, but initially every row is considered.
I don't know how to do that, or at least not the best most succinct way that shows it's intentional
Also, I've run into inverse scenarios where I only need on of the conditionals to match to include the row into the new dataframe, so I would need bigConditional to be set to false for every index in the dataframe.
what about sum the conditions and check if it is equal to the number of conditions
filteredDf = df.loc[sum(conditionals)==len(conditionals)]
or even more simple, with np.all
filteredDf = df.loc[np.all(conditionals, axis=0)]
otherwise, for your original question, you can create a series of True indexed like df and your for loop should work.
bigConditional = pd.Series(True, index=df.index)
Maybe you can use query and generate your conditions like this:
conditionals = [
"someColumn == 'someValue'",
"someOtherCol == 'someOtherValue'",
"someThirdCol.isin(['foo', 'bar', 'baz'])",
]
qs = ' & '.join(conditionals)
out = df.query(qs)
Or use eval to create boolean values instead of filter your dataframe:
mask = df.eval(qs)
Demo
Suppose this dataframe:
>>> df
someColumn someOtherCol someThirdCol
0 someValue someOtherValue foo
1 someValue someOtherValue baz
2 someValue anotherValue anotherValue
3 anotherValue anotherValue anotherValue
>>> df.query(qs)
someColumn someOtherCol someThirdCol
0 someValue someOtherValue foo
1 someValue someOtherValue baz
>>> df.eval(qs)
0 True
1 True
2 False
3 False
dtype: bool
You can even use f-strings or another template language to pass variables to your condition list.
I have a dataframe that looks like the following:
arr = pd.DataFrame([[0,0],[0,1],[0,4],[1,4],[1,5],[1,6],[2,5],[2,8],[2,6])
My desired output is booleans that represent whether the value in column 2 is in the next consecutive group or not. The groups are represented by the values in column 1. So for example, 4 shows up in group 0 and the next consecutive group, group 1:
output = pd.DataFrame([[False],[False],[True],[False],[True],[True],[Nan],[Nan],[Nan]])
The outputs for group 2 would be Nan because group 3 doesn't exist.
So far I have tried this:
output = arr.groupby([0])[1].isin(arr.groupby([0])[1].shift(periods=-1))
This doesn't work because I can't apply the isin() on a groupby series.
You could create a helper column with lists of shifted group items, then check against that with a function that returns True, False of NaN:
import pandas as pd
import numpy as np
arr = pd.DataFrame([[0,0],[0,1],[0,4],[1,4],[1,5],[1,6],[2,5],[2,8],[2,6]])
arr = pd.merge(arr, arr.groupby([0]).agg(list).shift(-1).reset_index(), on=[0], how='outer')
def check_columns(row):
try:
if row['1_x'] in row['1_y']:
return True
else:
return False
except:
return np.nan
arr.apply(check_columns, axis=1)
Result:
0 False
1 False
2 True
3 False
4 True
5 True
6 NaN
7 NaN
8 NaN
I have Pandas DataFrame with multiple columns, i wanted to check if the specific column value is NaN, if Yes, i need to return boolean (True or False).
I tried
pandas_df['col1'].isnull()
But it returns all the rows with index and boolean value.
IIUC you need .any() to check if there is any null values:
pandas_df.col1.isnull().any()
For return boolean scalar use Series.any - test if at least one NaN (at least one True) per column:
pandas_df['col1'].isnull().any()
If need test if all valeus are NaNs use Series.all:
pandas_df['col1'].isnull().all()
pandas_df = pd.DataFrame({'col1':[1,2,np.nan],
'col2':[np.nan, np.nan, np.nan]})
print (pandas_df['col1'].isnull().any())
True
print (pandas_df['col2'].isnull().all())
True
You can also use isna(). This is identical to isnull()
df.isna() will detect missing values on the whole dataframe.
age born name toy
0 5.0 NaT Alfred None
1 6.0 1939-05-27 Batman Batmobile
2 NaN 1940-04-25 Joker
Calling df.isna() will return True on missing values (the trio NaN, NaT and None):
age born name toy
0 False True False True
1 False False False False
2 True False False False
I am performing a groupby operation on a DataFrame. On each of the group I have to rename two columns and drop one, so that each group will have the following form:
index(timestamp) | column-x | column-y
... | .... | .....
The index is a timestamp and it will be common to each group. 'column-x' and 'column-y' instead will be different to each group. My goal is then to join all groups on the index so that I have a unique DataFrame such as:
index(timestamp) | column-x1 | column-y1 | column-x2 | column-y2 | ...
... | ..... | ...... | ....... | ....... | ...
The function I apply to each group is (can I make inplace edit to the group while iterating?):
def process_ssp(df_ssp):
sensor_name = df_ssp.iloc[0]['subsystem-sensor-parameter'] # to be used as column name
df_ssp.rename(columns = {
'value_raw': '%s_raw' % sensor_name,
'value_hrf': '%s_hrf' % sensor_name,
}, inplace = True)
df_ssp.drop('subsystem-sensor-parameter', axis='columns', inplace=True) # since this is the column I am grouping on I guess this isn't the right thing to do?
return df_ssp
Then I call:
res = df_node.groupby('subsystem-sensor-parameter', as_index=False).apply(process_ssp)
Which produces the error:
ValueError: cannot reindex from a duplicate axis
EDIT:
Dataset sample https://drive.google.com/file/d/1RvPE1t3BmjeaqCNkVqGwmokCFQQp77n8/view?usp=sharing
You can first add column subsystem-sensor-parameter for MultiIndex, reshape by unstack, sorting MultiIndex in columns by second level and chane their positons. Last convert MultiIndex by flattening with map and join:
res = (df_node.set_index('subsystem-sensor-parameter', append=True)
.unstack()
.sort_index(axis=1, level=1)
.swaplevel(0,1, axis=1))
res.columns = res.columns.map('_'.join)
I'm able to successfully apply your code and produce the output you want by iterating over the groups rather than using apply:
import pandas as pd
df = pd.read_csv('/Users/jeffmayse/Downloads/sample.csv')
df.set_index('timestamp', inplace=True)
def process_ssp(df_ssp):
sensor_name = df_ssp.iloc[0]['subsystem-sensor-parameter'] # to be used as column name
df_ssp.rename(columns = {
'value_raw': '%s_raw' % sensor_name,
'value_hrf': '%s_hrf' % sensor_name,
}, inplace = True)
df_ssp.drop('subsystem-sensor-parameter', axis='columns', inplace=True) # since this is the column I am grouping on I guess this isn't the right thing to do?
return df_ssp
groups = df.groupby('subsystem-sensor-parameter')
out = []
for name, group in groups:
try:
out.append(process_ssp(group))
except:
print(name)
pd.concat(out).shape
Out[7]: (16131, 114)
And in fact, the issue is in the apply method, as your function is not needed to produce the error:
df.groupby('subsystem-sensor-parameter', as_index=False).apply(lambda x: x)
evaluates to ValueError: cannot reindex from a duplicate axis as well.
However, this statement evaluates as we'd expect:
df.reset_index(inplace=True)
df.groupby('subsystem-sensor-parameter', as_index=False).apply(process_ssp)
Out[22]:
nc-devices-alphasense_hrf ... wagman-uptime-uptime_raw
0 0 ... NaN
1 NaN ... NaN
2 NaN ... NaN
3 NaN ... NaN
...
The issue is that you have a DatetimeIndex with duplicate values. .apply is attempting to combine the result sets back together, but is not sure how to combine an index with duplicate values. At least, I believe that's it. Reset your index and try again.
Edit: to expand, you see this error commonly when trying to reindex a DatetimeIndex i.e., you have an hourly index and want to convert it to a second resolution index, or usually fill in missing hours. you use reindex, but it will fail if your index has duplicate values. I'd guess that is what is happening here: the dataframes produced by the function being applied have duplicate index values and the error comes from trying to produce the output via calling reindex on a DatetimeIndex with duplicates. Resetting the index works because your index is now all unique, and the timestamp column is not important to this operation.
I have a pandas dataframe. I want to check the value in a particular column and create a flag column based on if it is null/not null.
df_have:
A B
1
2 X
df_want
A B B_Available
1 N
2 X Y
I did:
def chkAvail(row):
return (pd.isnull(row['B']) == False)
if (df_have.apply (lambda row: chkAvail(row),axis=1)):
df_want['B_Available']='Y'
I got:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What did I do wrong?
You can use
df['B_available'] = df.B.notnull().map({False: 'N', True:'Y'})
If blank values are NaN or None. If they are whitespaces, do
df['B_available'] = (df.B != ' ').map({False: 'N', True:'Y'})
To do if series is not a good idea because there might be many True and False in series. E.g. what does if pd.Series([True, False, True, True]) mean? Makes no sense ;)
You can also use np.select:
# In case blank values are NaN
df['B_Available'] = np.select([df.B.isnull()], ['N'], 'Y')
# In case blank values are empty strings:
df['B_Available'] = np.select([df.B == ''], ['N'], 'Y')
>>> df
A B B_Available
0 1 NaN N
1 2 X Y
By using np.where
df['B_Available']=np.where(df.B.eq(''),'N','Y')
df
Out[86]:
A B B_Available
0 1 N
1 2 X Y