I have a list of conditions to be run on the dataset to sort huge data.
df = A Huge_dataframe.
eg.
Index D1 D2 D3 D5 D6
0 8 5 0 False True
1 45 35 0 True False
2 35 10 1 False True
3 40 5 2 True False
4 12 10 5 False False
5 18 15 13 False True
6 25 15 5 True False
7 35 10 11 False True
8 95 50 0 False False
I have to sort above df based on given orders:
orders = [[A, B],[D, ~E, B], [~C, ~A], [~C, A]...]
#(where A, B, C , D, E are the conditions)
eg.
A = df['D1'].le(50)
B = df['D2'].ge(5)
C = df['D3'].ne(0)
D = df['D1'].ne(False)
E = df['D1'].ne(True)
# In the real scenario, I have 64 such conditions to be run on 5 million records.
eg.
I have to run all these conditions to get the resultant output.
What is the easiest way to achieve the following task, to order them using for loop or map or .apply?
df = df.loc[A & B]
df = df.loc[D & ~E & B]
df = df.loc[~C & ~A]
df = df.loc[~C & A]
Resultant df would be my expected output.
Here I am more interested in knowing, how would you use loop or map or .apply, If I want to run multiple conditions which are stored in a list. Not the resultant output.
such as:
for i in orders:
df = df[all(i)] # I am not able to implement this logic for each order
You are looking for bitwise and all the elements inside orders. In which case:
df = df[np.concatenate(orders).all(0)]
Related
this is a dataframe having column 'customer' with repetative values
df=pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],'customer':['a','b','c','b','b','b','d','e','e','f'],'address':['xx','yy','rr','yy','oo','ee','vv','zz','nn','cc']})
want values repeating more than 3 times
df.groupby(['customer']).count()>3
result==>
in the result am getting boolean values
id address
customer
a False False
b True True
c False False
d False False
e False False
f False False
expected result==>
id customer address
1 2 b yy
You can GroupBy.filter() the dataframe and the .drop_duplicates by "customer" column:
x = (
df.groupby("customer")
.filter(lambda x: len(x) > 3)
.drop_duplicates("customer")
)
print(x)
Prints:
id customer address
1 2 b yy
You can use groupby.transform and boolean indexing:
df[df.groupby('customer')['customer'].transform('count').gt(3)]
Output:
id customer address
1 2 b yy
3 4 b yy
4 5 b oo
5 6 b ee
Fix your code with isin
s = df.groupby(['customer'])['id'].count()>3
out = df.loc[df['customer'].isin(s[s].index)]
Out[389]:
id customer address
1 2 b yy
3 4 b yy
4 5 b oo
5 6 b ee
I have a DataFrame consisting of medical data where the columns are ["Patient_ID", "Code", "Data"], where "Code" just represents some medical interaction patient "Patient_ID" had on "Date". Any patient will generally have more than one row, since they have more than one interaction. I want to apply two types of filtering to this data.
Remove any patients who have less than some min_len interactions
To each patient apply a half-overlapping, sliding window of length T days. Within each window keep only the first of any duplicate codes, and then shuffle the codes within the window
So I need to modify subsets of the overall dataframe, but the modification involves changing the size of the subset. I have both of these implemented as part of a larger pipeline, however they are a sigfnificant bottleneck in terms of time. I'm wondering if there's a more efficient way to achieve the same thing, as I really just threw together what worked and I'm not too familiar on efficiency of pandas operations. Here is how I have them currently:
def Filter_by_length(df, min_len = 1):
print("Filtering short sequences...")
df = df.sort_values(axis = 0, by = ['ID', 'DATE']).copy(deep = True)
new_df = []
for sub_df in tqdm((df[df.ID == sub] for sub in df.ID.unique()), total = len(df.ID.unique()), miniters = 1):
if len(sub_df) >= min_len:
new_df.append(sub_df.copy(deep = True))
if len(new_df) != 0:
df = pd.concat(new_df, sort = False)
else:
df = pd.DataFrame({})
print("Done")
return df
def shuffle_col(df, col):
df[col] = np.random.permutation(df[col])
return df
def Filter_by_redundancy(df, T, min_len = 1):
print("Filtering redundant concepts and short sequences...")
df = df.sort_values(axis = 0, by = ['ID', 'DATE']).copy(deep = True)
new_df = []
for sub_df in tqdm((df[df.ID == sub] for sub in df.ID.unique()), total = len(df.ID.unique()), miniters = 1):
start_date = sub_df.DATE.min()
end_date = sub_df.DATE.max()
next_date = start_date + dt.timedelta(days = T)
while start_date <= end_date:
sub_df = pd.concat([sub_df[sub_df.DATE < start_date],\
shuffle_col(sub_df[(sub_df.DATE <= next_date) & (sub_df.DATE >= start_date)]\
.drop_duplicates(subset = ['CODE']), "CODE"),\
sub_df[sub_df.DATE > next_date]], sort = False )
start_date += dt.timedelta(days = int(T/2))
next_date += dt.timedelta(days = int(T/2))
if len(sub_df) >= min_len:
new_df.append(sub_df.copy(deep = True))
if len(new_df) != 0:
df = pd.concat(new_df, sort = False)
else:
df = pd.DataFrame({})
print("Done")
return df
As you can see, in the second case I am actually applying both filters, because it is important to have the option to apply both together or either one on its own, but I am interested in any performance improvement that can be made to either one or both.
For the first part, instead of counting in your group-by like that, I would use this approach:
>>> d = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'q': [np.random.randint(1, 15, size=np.random.randint(1, 5)) for _ in range(5)]}).explode('q')
id q
0 1 1
0 1 9
1 2 9
1 2 10
1 2 4
2 3 3
2 3 6
2 3 2
2 3 10
3 4 11
3 4 5
4 5 5
4 5 6
4 5 3
4 5 2
>>> sizes = d.groupby('id').size()
>>> d[d['id'].isin(sizes[sizes >= 3].index)] # index is list of IDs meeting criteria
id q
1 2 9
1 2 10
1 2 4
2 3 3
2 3 6
2 3 2
2 3 10
4 5 5
4 5 6
4 5 3
4 5 2
I'm not sure why you want to shuffle your codes within some window. To avoid an X-Y problem, what are you in fact trying to do there?
I'm trying to subset a pandas df by removing rows that fall between specific values. The problem is these values can be at different rows so I can't select fixed rows.
Specifically, I want to remove rows that fall between ABC xxx and the integer 5. These values could fall anywhere in the df and be of unequal length.
Note: The string ABC will be followed by different values.
I thought about returning all the indexes that contain these two values.
But mask could work better if I could return all rows between these two values?
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'X',1,2,'ABC',1,4,5,'Y',1,2],
})
mask = (df['Val'].str.contains(r'ABC(?!$)')) & (df['Val'] == 5)
Intended Output:
Val
0 None
8 X
9 1
10 2
15 Y
16 1
17 2
If ABC is always before 5 and always pairs (ABC, 5) get indices of values with np.where, zip and get index values between - last filter by isin with invert mask by ~:
#2 values of ABC, 5 in data
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'None','None','None',
'None','ABC','None',1,2,3,4,5,'None','None','None']
})
m1 = np.where(df['Val'].str.contains(r'ABC', na=False))[0]
m2 = np.where(df['Val'] == 5)[0]
print (m1)
[ 1 12]
print (m2)
[ 7 18]
idx = [x for y, z in zip(m1, m2) for x in range(y, z + 1)]
print (df[~df.index.isin(idx)])
Val
0 None
8 X
9 1
10 2
11 None
19 X
20 1
21 2
a = df.index[df['Val'].str.contains('ABC')==True][0]
b = df.index[df['Val']==5][0]+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]
Output
Val
0 None
8 X
9 1
10 2
If there are more than one 'ABC' and 5, then you the below version.
With this you get the df other than the first ABC & the last 5
a = (df['Val'].str.contains('ABC')==True).idxmax()
b = df['Val'].where(df['Val']==5).last_valid_index()+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]
I Have e DataFrame like this:
RTD Val
BA 2
BA 88
BA 15
BA 67
BA 83
BA 77
BA 79
BA 90
BA 1
BA 14
First:
df['count'] = df.Val > 15
print(df)
I get as a result:
RTD Val count
0 BA 2 False
1 BA 88 True
2 BA 15 False
3 BA 67 True
4 BA 83 True
5 BA 77 True
6 BA 79 True
7 BA 90 True
8 BA 1 False
9 BA 14 False
Now, to count the maximum consecutive occurrences I use:
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
ddf= df['count'].apply(rolling_count)
print (max(ddf))
I get as result: 5.
My answer is:
To count the max occurrences consecutive of False, how i should do?
The correct value is equal to 2.
I am interested to know the maximum of consecutive occurrences other than True, for Val > 15 and conversely
Here is a longer method that coerces count to be an integer rather then boolean by adding 0. The absolute difference indicates changes in the boolean value, and the first value is filled to be 1.
The result of this change Series is evaluated as to whether elements are greater than 0 in the 'bools' variable and the corresponding elements from df['count'] are extracted.
The results of the change vector are used with cumsum to form IDs which are used in groupby in the runs variable. Counts of each ID are then contstructed in the runs variable.
countDf = DataFrame({'bools': list(df['count'][(df['count'] + 0)
.diff().abs().fillna(1) > 0]),
'runs': list(df['Val'].groupby((df['count'] + 0)
.diff().abs().fillna(1).cumsum()).count())})
countDf
bools runs
0 False 1
1 True 1
2 False 1
3 True 5
4 False 2
You can extract the maximum runs using standard subsetting like
countDf[countDf.bools == False]['runs'].max()
2
countDf[countDf.bools == True]['runs'].max()
5
This is my attempt
gt15 = df.Val.gt(15)
counts = df.groupby([gt15, (gt15 != gt15.shift()) \
.cumsum()]).size().rename_axis(['>15', 'grp'])
counts
>15 grp
False 1 1
3 1
5 2
True 2 1
4 5
dtype: int64
counts.loc[False].max()
2
It seems dataframe.le doesn't operate column wise fashion.
df = DataFrame(randn(8,12))
series=Series(rand(8))
df.le(series)
I would expect for each column in df it will compare to series (so total 12 columns comparison with series, so 12 column*8 row comparison involved). But it appears for each element in df it will compare against every elements in series so this will involves 12(columns)*8(rows) * 8(elements in series) comparison. How can I achieve column by column comparison?
Second question is once I am done with column wise comparison I want to be able to count for each row how many 'true' are there, I am currently doing astype(int32) to turn bool into int then do sum, does this sound reasonable?
Let me give an example about the first question to show what I meant (using a simpler example since show 8*12 is tough):
>>>from pandas import *
>>>from numpy.random import *
>>>df = DataFrame(randn(2,5))
>>>t = DataFrame(randn(2,1))
>>>df
0 1 2 3 4
0 -0.090283 1.656517 -0.183132 0.904454 0.157861
1 1.667520 -1.242351 0.379831 0.672118 -0.290858
>>>t
0
0 1.291535
1 0.151702
>>>df.le(t)
0 1 2 3 4
0 True False False False False
1 False False False False False
What I expect df's column 1 should be:
1
False
True
Because 1.656517 < 1.291535 is False and -1.242351 < 0.151702 is True, this is column wise comparison. However the print out is False False.
I'm not sure I understand the first part of your question, but as to the second part, you can count the Trues in a boolean DataFrame using sum:
In [11]: df.le(s).sum(axis=0)
Out[11]:
0 4
1 3
2 7
3 3
4 6
5 6
6 7
7 6
8 0
9 0
10 0
11 0
dtype: int64
.
Essentially le is testing for each column:
In [21]: df[0] < s
Out[21]:
0 False
1 True
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool
Which for each index is testing:
In [22]: df[0].loc[0] < s.loc[0]
Out[22]: False