I have a dataframe that looks like the following:
arr = pd.DataFrame([[0,0],[0,1],[0,4],[1,4],[1,5],[1,6],[2,5],[2,8],[2,6])
My desired output is booleans that represent whether the value in column 2 is in the next consecutive group or not. The groups are represented by the values in column 1. So for example, 4 shows up in group 0 and the next consecutive group, group 1:
output = pd.DataFrame([[False],[False],[True],[False],[True],[True],[Nan],[Nan],[Nan]])
The outputs for group 2 would be Nan because group 3 doesn't exist.
So far I have tried this:
output = arr.groupby([0])[1].isin(arr.groupby([0])[1].shift(periods=-1))
This doesn't work because I can't apply the isin() on a groupby series.
You could create a helper column with lists of shifted group items, then check against that with a function that returns True, False of NaN:
import pandas as pd
import numpy as np
arr = pd.DataFrame([[0,0],[0,1],[0,4],[1,4],[1,5],[1,6],[2,5],[2,8],[2,6]])
arr = pd.merge(arr, arr.groupby([0]).agg(list).shift(-1).reset_index(), on=[0], how='outer')
def check_columns(row):
try:
if row['1_x'] in row['1_y']:
return True
else:
return False
except:
return np.nan
arr.apply(check_columns, axis=1)
Result:
0 False
1 False
2 True
3 False
4 True
5 True
6 NaN
7 NaN
8 NaN
Related
We can count occurrences of nan with df.isna().count()
Is there is a similar function to count inf?
This worked for me:
number_inf = df[df == np.inf].count()
use np.isinf()
df = pd.DataFrame({'data' : [0,0,float('inf'),float('inf')]})
print(df)
data
0 0.0
1 0.0
2 inf
3 inf
df.groupby(np.isinf(df['data'])).count()
data
data
False 2
True 2
you can use isinf() and ravel() in one line
# Pandas Series.ravel() function returns the flattened underlying data as an ndarray.
np.isinf(df["Col"]).values.ravel().sum()
I have a Dataframe that has list of dates with sales count for each of the days as shown below:
date,count
11/1/2018,345
11/2/2018,100
11/5/2018,432
11/7/2018,500
11/11/2018,555
11/17/2018,754
I am trying to check of all the sales that were done how many were done on a weekday. To pull all week-days in November I am doing the below:
weekday = pd.DataFrame(pd.bdate_range('2018-11-01', '2018-11-30'))
Now I am trying to compare dates in df with value in weekday as below:
df_final = df[df['date'].isin(weekday)]
But the above returns no rows.
You should remove pd.DataFrame when create the weekday, since when we using Series and DataFrame with isin means we not only match the values but also the index and columns , since the original index and columns may different from the new created dataframe weekday, that is why return the False
df.date=pd.to_datetime(df.date)
weekday = pd.bdate_range('2018-11-01', '2018-11-30')
df_final = df[df['date'].isin(weekday)]
df_final
Out[39]:
date count
0 2018-11-01 345
1 2018-11-02 100
2 2018-11-05 432
3 2018-11-07 500
Simple example address the issue I mentioned above
df=pd.DataFrame({'A':[1,2,3,4,5]})
newdf=pd.DataFrame({'B':[2,3]})
df.isin(newdf)
Out[43]:
A
0 False
1 False
2 False
3 False
4 False
df.isin(newdf.B.tolist())
Out[44]:
A
0 False
1 True
2 True
3 False
4 False
Use a DatetimeIndex and let pandas do the work for you as follows:
# generate some sample sales data for the month of November
df = pd.DataFrame(
{'count': np.random.randint(0, 900, 30)},
index=pd.date_range('2018-11-01', '2018-11-30', name='date')
)
# resample by business day and call `.asfreq()` on the resulting groupby-like object to get your desired filtering
df.resample(rule='B').asfreq()
Other values for the resampling rule can be found here
I'm looking for a way to determine if a column or set of columns of a pandas dataframe uniquely identifies the rows of that dataframe. I've seen this called the isid function in Stata.
The best I can think of is to get the unique values of a subset of columns using a set comprehension, and asserting that there are as many values in the set as there are rows in the dataframe:
subset = df[["colA", "colC"...]]
unique_vals = {tuple(x) for x in subset.values}
assert(len(unique_vals) == len(df))
This isn't the most elegant answer in the world, so I'm wondering if there's a built-in function that does this, or perhaps a way to test if a subset of columns are a uniquely-valued index.
You could make an index and check its is_unique attribute:
import pandas as pd
df1 = pd.DataFrame([(1,2),(1,2)], columns=list('AB'))
df2 = pd.DataFrame([(1,2),(1,3)], columns=list('AB'))
print(df1.set_index(['A','B']).index.is_unique)
# False
print(df2.set_index(['A','B']).index.is_unique)
# True
Maybe groupby size
df.groupby(['x','y']).size()==1
Out[308]:
x y
1 a True
2 b True
3 c True
4 d False
dtype: bool
You can check
df[['x', 'y']].transform(tuple,1).duplicated(keep=False).any()
To see if there are any duplicated rows with the sets of value from columns x and y.
Example:
df = pd.DataFrame({'x':[1,2,3,4,4], 'y': ["a", "b", "c", "d","d"]})
x y
0 1 a
1 2 b
2 3 c
3 4 d
4 4 d
Then transform
0 (1, a)
1 (2, b)
2 (3, c)
3 (4, d)
4 (4, d)
dtype: object
then check which are duplicated()
0 False
1 False
2 False
3 True
4 True
dtype: bool
Notice that transforming into tuple might not be necessary
df.duplicated(keep=False)
0 False
1 False
2 False
3 True
4 True
dtype: bool
I want to select a subset of rows in a pandas dataframe, based on a particular string column, where the value starts with any number of values in a list.
A small version of this:
df = pd.DataFrame({'a': ['aa10', 'aa11', 'bb13', 'cc14']})
valids = ['aa', 'bb']
So I want just those rows where a starts with aa or bb in this case.
You need startswith
df.a.str.startswith(tuple(valids))
Out[191]:
0 True
1 True
2 True
3 False
Name: a, dtype: bool
After filter with original df
df[df.a.str.startswith(tuple(valids))]
Out[192]:
a
0 aa10
1 aa11
2 bb13
If you would like to filter those rows for which a string is in a column value, it is possible to use something like data.sample_id.str.contains('hph') (answered before: check if string in pandas dataframe column is in list, or Check if string is in a pandas dataframe).
However, my lookup column contains emtpy cells. Terefore, str.contains() yields NaN values and I get an value error upon indexing.
`ValueError: cannot index with vector containing NA / NaN values``
What works:
# get all runs
mask = [index for index, item in enumerate(data.sample_id.values) if 'zent' in str(item)]
Is there a more elegant and faster method (similar to str.contains()) than this one ?
You can set parameter na in str.contains to False:
print (df.a.str.contains('hph', na=False))
Using EdChum sample:
df = pd.DataFrame({'a':['hph', np.NaN, 'sadhphsad', 'hello']})
print (df)
a
0 hph
1 NaN
2 sadhphsad
3 hello
print (df.a.str.contains('hph', na=False))
0 True
1 False
2 True
3 False
Name: a, dtype: bool
IIUC you can filter those rows out also
data['sample'].dropna().str.contains('hph')
Example:
In [38]:
df =pd.DataFrame({'a':['hph', np.NaN, 'sadhphsad', 'hello']})
df
Out[38]:
a
0 hph
1 NaN
2 sadhphsad
3 hello
In [39]:
df['a'].dropna().str.contains('hph')
Out[39]:
0 True
2 True
3 False
Name: a, dtype: bool
So by calling dropna first you can then safely use str.contains on the Series as there will be no NaN values
Another way to handle the null values would be to use notnull:
In [43]:
(df['a'].notnull()) & (df['a'].str.contains('hph'))
Out[43]:
0 True
1 False
2 True
3 False
Name: a, dtype: bool
but I think passing na=False would be cleaner (#jezrael's answer)