check if at least one value exists in pandas dataframe index - pandas

How can I check if at least one of the index of df2 is in df1?
df1
Val
StartDate
2015-03-31 NaN
2015-04-03 NaN
2015-04-05 8.08
2015-04-06 23.48
df2
Val
StartDate
2015-03-31 True
2015-04-01 True
2015-04-02 True
2015-04-03 True
2015-04-04 True
2015-04-05 True
2015-04-06 True
df2.index in df1.index
returns False

Use Index.isin with Index.any for check at least one True:
a = df1.index.isin(df2.index).any()
print (a)
True
Detail:
print (df1.index.isin(df2.index))
[ True True True True]

[i for i in df1.index if i in df2.index]

Related

Working with a multiindex dataframe, to get summation results over a boolean column, based on a condition from another column

We have a multiindex dataframe that looks like:
date condition_1 condition_2
item1 0 2021-06-10 06:30:00+00:00 True False
1 2021-06-10 07:00:00+00:00 False True
2 2021-06-10 07:30:00+00:00 True True
item2 3 2021-06-10 06:30:00+00:00 True False
4 2021-06-10 07:00:00+00:00 True True
5 2021-06-10 07:30:00+00:00 True True
item3 6 2021-06-10 06:30:00+00:00 True True
7 2021-06-10 07:00:00+00:00 False True
8 2021-06-10 07:30:00+00:00 True True
The value of date repeats between items (because the df is a result of a default concat on a dictionary of dataframes).
The logic we basically want to vectorize is "for every date where condition_1 is true for all items: sum the occurrences where condition_2 is true in a new results column for all of them".
The result would basically look like this based on the above example (comments on how it's derived: next to the results column):
date condition_1 condition_2 result
item1 0 2021-06-10 06:30:00+00:00 True False 1 [because condition_1 is True for all items and condition_2 is True once]
1 2021-06-10 07:00:00+00:00 False True 0 [condition_1 is not True for all items so condition_2 is irrelevant]
2 2021-06-10 07:30:00+00:00 True True 3 [both conditions are True for all 3 items]
item2 3 2021-06-10 06:30:00+00:00 True False 1 [a repeat for the same reasons]
4 2021-06-10 07:00:00+00:00 True True 0 [a repeat for the same reasons]
5 2021-06-10 07:30:00+00:00 True True 3 [a repeat for the same reasons]
item3 6 2021-06-10 06:30:00+00:00 True True 1 [a repeat for the same reasons]
7 2021-06-10 07:00:00+00:00 False True 0 [a repeat for the same reasons]
8 2021-06-10 07:30:00+00:00 True True 3 [a repeat for the same reasons]
Here is what I came up with.
def cond_sum(s):
return s.cond1.all() * s.cond2.sum()
df.reset_index(level=0, inplace=True)
df['result'] = df.groupby('date').apply(cond_sum)
df.set_index('item', append=True)
Then if you want the original index, you can add it back.
df.set_index('item', append=True).swaplevel()
Note, you mentioned vectorized, so you could swap that out for:
dfg = df.groupby(level=0).agg({'cond1': 'all', 'cond2': 'sum'})
df['result'] = dfg.cond1 * dfg.cond2

Consolidating columns by the number before the decimal point in the column name

I have the following dataframe (three example columns below):
import pandas as pd
array = {'25.2': [False, True, False], '25.4': [False, False, True], '27.78': [True, False, True]}
df = pd.DataFrame(array)
25.2 25.4 27.78
0 False False True
1 True False False
2 False True True
I want to create a new dataframe with consolidated columns names, i.e. add 25.2 and 25.4 into 25 new column. If one of the values in the separate columns is True then the value in the new column is True.
Expected output:
25 27
0 False True
1 True False
2 True True
Any ideas?
use rename()+groupby()+sum():
df=(df.rename(columns=lambda x:x.split('.')[0])
.groupby(axis=1,level=0).sum().astype(bool))
OR
In 2 steps:
df.columns=[x.split('.')[0] for x in df]
#OR
#df.columns=df.columns.str.replace(r'\.\d+','',regex=True)
df=df.groupby(axis=1,level=0).sum().astype(bool)
output:
25 27
0 False True
1 True False
2 True True
Note: If you have int columns then you can use round() instead of split()
Another way:
>>> df.T.groupby(np.floor(df.columns.astype(float))).sum().astype(bool).T
25.0 27.0
0 False True
1 True False
2 True True

How to get count for the non duplicates in column

My code to get the duplicates, how to negate the below meaning
df.duplicated(subset='col', keep='last').sum()
len(df['col'])-len(df['col'].drop_duplicates())
I think you need DataFrame.duplicated with keep=False for all duplicates, invert mask and sum for count Trues:
df = pd.DataFrame({'col':[1,2,2,3,3,3,4,5,5]})
print (df.duplicated(subset='col', keep=False))
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
dtype: bool
print (~df.duplicated(subset='col', keep=False))
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
dtype: bool
print ((~df.duplicated(subset='col', keep=False)).sum())
2
Another solution with Series.drop_duplicates and keep=False with length of Series:
print (df['col'].drop_duplicates(keep=False))
0 1
6 4
Name: col, dtype: int64
print (len(df['col'].drop_duplicates(keep=False)))
2

How to parse data from time period to periodic time series with Pandas

I'm currently trying to parse data, which is given with a start and end date, from one dataframe to a second dataframe, which has a periodic timedateindex.
df1 is my input dataframe and I would like to parse it to df2-structure.
Actually I don't need the values itself, I just want to mark the period where they occur.
df1
Start End Value1 Value2
1 2018-01-02 15:20 2018-01-02 19:50 x Nan
2 2018-03-21 05:40 2018-03-22 11:20 a b
3 ...
df2
Value1 Value2
2018-01-02 15:10 False False
2018-01-02 15:20 True False
2018-01-02 15:30 True False
2018-01-02 15:40 True False
...
2018-01-02 19:50 True False
2018-01-02 20:00 False False
I got already the structure for df2, but I couldn't figure out how to transform the data.
date_rng=pd.date_range(start='2018-01-01', end='2018-12-31', freq='10min')
df2=pd.DataFrame(date_rng, columns=['Date'])
df2['datetime'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index('datetime')
df2.drop(['Date'], axis=1, inplace=True)
Anyone can help?
many thanks
You could initialize df2 with all values set to False and then iterate through both dataframes and checking if the time in df2 is within one/ore more of the specified intervals in df1.
Here a working example:
# | create some dummy data
data = [{'Start': '2018-01-02 15:20', 'End': '2018-01-02 19:50', 'Value1': 'x', 'Value2': np.nan},
{'Start': '2018-01-01 00:00:00', 'End': '2018-01-01 00:15:00', 'Value1': 'a', 'Value2': np.nan}]
df1 = pd.DataFrame(data)
df1['Start'] = pd.to_datetime(df1['Start'])
df1['End'] = pd.to_datetime(df1['End'])
date_rng=pd.date_range(start='2018-01-01', end='2018-12-31', freq='10min')
df2=pd.DataFrame(date_rng, columns=['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index('Date', drop=True)
# | initialize all values with False
df2['Value1'] = False
df2['Value2'] = False
# | iterate through dataframes (also check if values are NaN)
for _, row_1 in df1.iterrows():
for index_2, row_2 in df2.iterrows():
if not pd.isnull(row_1['Value1']):
row_2['Value1'] = row_1['Start'] <= index_2 and row_1['End'] >= index_2
if not pd.isnull(row_1['Value2']):
row_2['Value2'] = row_1['Start'] <= index_2 and row_1['End'] >= index_2
output:
Value1 Value2
Date
2018-01-01 00:00:00 True False
2018-01-01 00:10:00 True False
2018-01-01 00:20:00 False False
2018-01-01 00:30:00 False False
.
.
.

Replace a string value with NaN in pandas data frame - Python

Do I have to replace the value? with NaN so you can invoke the .isnull () method. I have found several solutions but some errors are always returned. Suppose:
data = pd.DataFrame([[1,?,5],[?,?,4],[?,32.1,1]])
and if I try:
pd.data.replace('?', np.nan)
I have:
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
but data.isnull() returns:
0 1 2
0 False False False
1 False False False
2 False False False
Why?
I think you forget assign back:
data = pd.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
data = data.replace('?', np.nan)
#alternative
#data.replace('?', np.nan, inplace=True)
print (data)
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
print (data.isnull())
0 1 2
0 False True False
1 True True False
2 True False False
# a dataframe with string values
dat = pd.DataFrame({'a':[1,'FG', 2, 4], 'b':[2, 5, 'NA', 7]})
Removing non numerical elements from the dataframe:
"Method 1 - with regex"
dat2 = dat.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
dat2
"Method 2 - with pd.to_numeric"
dat3 = pd.DataFrame()
for col in dat.columns:
dat3[col] = pd.to_numeric(dat[col], errors='coerce')
dat3
? is a not null. So you will expect to get a False under the isnull test
>>> data = pandas.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
>>> data
0 1 2
0 False False False
1 False False False
2 False False False
After you replace ? with NaN the test will look much different
>>> data = data.replace('?', np.nan)
>>> data
0 1 2
0 False True False
1 True True False
2 True False False
I believe when you are doing pd.data.replace('?', np.nan) this action is not done in place, so you must try -
data = data.replace('?', np.nan)