Pandas dataframe: count max consecutive values - pandas

I Have e DataFrame like this:
RTD Val
BA 2
BA 88
BA 15
BA 67
BA 83
BA 77
BA 79
BA 90
BA 1
BA 14
First:
df['count'] = df.Val > 15
print(df)
I get as a result:
RTD Val count
0 BA 2 False
1 BA 88 True
2 BA 15 False
3 BA 67 True
4 BA 83 True
5 BA 77 True
6 BA 79 True
7 BA 90 True
8 BA 1 False
9 BA 14 False
Now, to count the maximum consecutive occurrences I use:
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
ddf= df['count'].apply(rolling_count)
print (max(ddf))
I get as result: 5.
My answer is:
To count the max occurrences consecutive of False, how i should do?
The correct value is equal to 2.
I am interested to know the maximum of consecutive occurrences other than True, for Val > 15 and conversely

Here is a longer method that coerces count to be an integer rather then boolean by adding 0. The absolute difference indicates changes in the boolean value, and the first value is filled to be 1.
The result of this change Series is evaluated as to whether elements are greater than 0 in the 'bools' variable and the corresponding elements from df['count'] are extracted.
The results of the change vector are used with cumsum to form IDs which are used in groupby in the runs variable. Counts of each ID are then contstructed in the runs variable.
countDf = DataFrame({'bools': list(df['count'][(df['count'] + 0)
.diff().abs().fillna(1) > 0]),
'runs': list(df['Val'].groupby((df['count'] + 0)
.diff().abs().fillna(1).cumsum()).count())})
countDf
bools runs
0 False 1
1 True 1
2 False 1
3 True 5
4 False 2
You can extract the maximum runs using standard subsetting like
countDf[countDf.bools == False]['runs'].max()
2
countDf[countDf.bools == True]['runs'].max()
5

This is my attempt
gt15 = df.Val.gt(15)
counts = df.groupby([gt15, (gt15 != gt15.shift()) \
.cumsum()]).size().rename_axis(['>15', 'grp'])
counts
>15 grp
False 1 1
3 1
5 2
True 2 1
4 5
dtype: int64
counts.loc[False].max()
2

Related

Multiple conditions on pandas dataframe

I have a list of conditions to be run on the dataset to sort huge data.
df = A Huge_dataframe.
eg.
Index D1 D2 D3 D5 D6
0 8 5 0 False True
1 45 35 0 True False
2 35 10 1 False True
3 40 5 2 True False
4 12 10 5 False False
5 18 15 13 False True
6 25 15 5 True False
7 35 10 11 False True
8 95 50 0 False False
I have to sort above df based on given orders:
orders = [[A, B],[D, ~E, B], [~C, ~A], [~C, A]...]
#(where A, B, C , D, E are the conditions)
eg.
A = df['D1'].le(50)
B = df['D2'].ge(5)
C = df['D3'].ne(0)
D = df['D1'].ne(False)
E = df['D1'].ne(True)
# In the real scenario, I have 64 such conditions to be run on 5 million records.
eg.
I have to run all these conditions to get the resultant output.
What is the easiest way to achieve the following task, to order them using for loop or map or .apply?
df = df.loc[A & B]
df = df.loc[D & ~E & B]
df = df.loc[~C & ~A]
df = df.loc[~C & A]
Resultant df would be my expected output.
Here I am more interested in knowing, how would you use loop or map or .apply, If I want to run multiple conditions which are stored in a list. Not the resultant output.
such as:
for i in orders:
df = df[all(i)] # I am not able to implement this logic for each order
You are looking for bitwise and all the elements inside orders. In which case:
df = df[np.concatenate(orders).all(0)]

Iterate over every row and compare a column value of a dataframe

I have following dataframe. I want to iterate over every row and compare the score column, if the value is >= value present in cut_off list.
seq score status
7 TTGTTCTCTGTGTATTTCAGGCT 10.42 positive
56 CAGGTGAGA 9.22 positive
64 AATTCCTGTGGACTTTCAAGTAT 1.23 positive
116 AAGGTATAT 7.84 positive
145 AAGGTAATA 8.49 positive
172 TGGGTAGGT 6.86 positive
204 CAGGTAGAG 7.10 positive
214 GCGTTTCTTGAATCCAGCAGGGA 3.58 positive
269 GAGGTAATG 8.73 positive
274 CACCCATTCCTGTACCTTAGGTA 8.96 positive
325 GCCGTAAGG 5.46 positive
356 GAGGTGAGG 8.41 positive
cut_off = range(0, 11)
The code I tried so far is:
cutoff_list_pos = []
number_list_pos = []
cut_off = range(0, int(new_df['score'].max())+1)
for co in cut_off:
for df in df_elements:
val = (df['score']>=co).value_counts()
cutoff_list_pos.append(co)
number_list_pos.append(val)
The desired output is:
cutoff true false
0 0 12.0 0
1 1 12.0 0
and so on..
If the score is >= to the value in cut_off, it should assign the row as true else false.
You can use parameter keys in concat by values of cutoff_list_pos, then transpose and convert index to column by DataFrame.reset_index:
df = (pd.concat(number_list_pos, axis=1, keys=cutoff_list_pos, sort=False)
.T
.rename_axis('cutoff')
.reset_index())
Another pandas implementation:
res_df = pd.DataFrame(columns=['cutoff', 'true'])
for i in range(1,int(df['score'].max()+1)):
temp_df = pd.DataFrame(data={'cutoff': i, 'true': (df['score']>=i).sum()}, index=[i])
res_df = pd.concat([res_df, temp_df])
res_df
cutoff true
1 1 12
2 2 11
3 3 11
4 4 10
5 5 10
6 6 9
7 7 8
8 8 6
9 9 2
10 10 1

Replace values based on index pandas

I'm working with a dataset, from which a subset has initial values and final values. I created an id that lets me identify those observations, so after applying this:
df['aux']=df.duplicated(subset=['id'], keep=False)
df_dup=df_dup[df_dup.aux==True]
df_dup.sort_values(by='id').reset_index(inplace=True)
I get something like this:
index id status value
88 1 'initial' 8
95 1 'final' 12
63 2 'initial' 9
52 2 'final' 13
What I want to achieve is to replace the final value in the initial value:
index id status value
88 1 'initial' 12
95 1 'final' 12
63 2 'initial' 13
52 2 'final' 13
I tried several things, my last attempt was this:
df_dup[df_dup.status=='initial'].reset_index().value= \
df_dup[df_dup.status=='final'].reset_index().value
But that fills initial values with nan:
index id status value
88 1 'initial' nan
95 1 'final' 12
63 2 'initial' nan
52 2 'final' 13
What am I missing?
Thanks
Use GroupBy.transform with last - it also replace unique values of id, but it return same value:
df['value'] = df.groupby('id')['value'].transform('last')
print (df)
index id status value
0 88 1 'initial' 12
1 95 1 'final' 12
2 63 2 'initial' 13
3 52 2 'final' 13
If want replace only duplicated id rows (reason many unique values, so better performance):
mask = df.duplicated(subset=['id'], keep=False)
df.loc[mask, 'value'] = df[mask].groupby('id')['value'].transform('last')
Without groupby and base on your drop_duplicates
df.value=df.id.map(df.drop_duplicates('id',keep='last').set_index('id').value)
df
Out[436]:
index id status value
0 88 1 'initial' 12
1 95 1 'final' 12
2 63 2 'initial' 13
3 52 2 'final' 13

Groupby Apply Filter without Lambda

Let's say I have this data:
data = {
'batch_no': [42, 42, 52, 52, 52, 73],
'quality': ['OK', 'NOT OK', 'OK', 'NOT OK', 'NOT OK', 'OK'],
}
df = pd.DataFrame(data, columns = ['batch_no', 'quality'])
This gives me the following dataframe
batch_no quality
42 OK
42 NOT OK
52 OK
52 NOT OK
52 NOT OK
73 OK
Now I need to find the count of NOT OK for each batch_no.
I can achieve this using groupby and apply with a lamda function as follows:
df.groupby('batch_no')['quality'].apply(lambda x: x[x.eq('NOT OK')].count())
This gives me the following desired output
batch_no
42 1
52 2
73 0
However this is extremely slow even on my moderate sized data of around 3 million rows and is not feasible for my needs.
Is there a fast alternative to this ?
You can compare column quality, then groupby by batch_no and aggregate sum, Trues are processes like 1 so it count values:
df = df['quality'].eq('NOT OK')
.groupby(df['batch_no']).sum()
.astype(int)
.reset_index(name='count')
print (df)
batch_no count
0 42 1
1 52 2
2 73 0
Detail:
print (df['quality'].eq('NOT OK'))
0 False
1 True
2 False
3 True
4 True
5 False
Name: quality, dtype: bool
You could use
In [77]: df.quality.eq('NOT OK').groupby(df.batch_no).sum()
Out[77]:
batch_no
42 1.0
52 2.0
73 0.0
Name: quality, dtype: float64
Using pd.factorize and np.bincount
f, u = pd.factorize(df.batch_no)
pd.Series(np.bincount(f, df.quality.eq('NOT OK')).astype(int), u)
42 1
52 2
73 0
dtype: int64
Incorporating 'OK' (inspired by Wen)
i, r = pd.factorize(df.batch_no)
j = df.quality.eq('NOT OK')
pd.DataFrame(
np.bincount(i * 2 + j, minlength=len(r) * 2).reshape(len(r), -1),
r, ['OK', 'NOT OK']
)
OK NOT OK
42 1 1
52 1 2
73 1 0
This will provide all value count
df.groupby('batch_no').quality.value_counts().unstack(fill_value=0)
Out[231]:
quality NOT OK OK
batch_no
42 1 1
52 2 1
73 0 1
Using crosstab
pd.crosstab(df.batch_no,df.quality)
Out[242]:
quality NOT OK OK
batch_no
42 1 1
52 2 1
73 0 1

How would you do this task using SQL or R library sqldf?

I need to implement the following function (ideally in R or SQL): given two data frames (have a column for userid and the rest of the colums are booleans attributes (they are just permitted to be 0's or 1's)) I need to return a new data frame with two columns (userid and count) where count is the number of matches for 0's and 1's for each user in both tables. An user F could occur in both data frames or it could occur in just one. In this last case, I need to return NA for that user count. I write an example:
DF1
ID c1 c2 c3 c4 c5
1 0 1 0 1 1
10 1 0 1 0 0
5 0 1 1 1 0
20 1 1 0 0 1
3 1 1 0 0 1
6 0 0 1 1 1
71 1 0 1 0 0
15 0 1 1 1 0
80 0 0 0 1 0
DF2
ID c1 c2 c3 c4 c5
5 1 0 1 1 0
6 0 1 0 0 1
15 1 0 0 1 1
80 1 1 1 0 0
78 1 1 1 0 0
98 0 0 1 1 1
1 0 1 0 0 1
2 1 0 0 1 1
9 0 0 0 1 0
My function must return something like this: (the following is a subset)
DF_Return
ID Count
1 4
2 NA
80 1
20 NA
.
.
.
Could you give me any suggestions to carry this out? I'm not that expert in sql.
I put the codes in R to generate the experiment I used above.
id1=c(1,10,5,20,3,6,71,15,80)
c1=c(0,1,0,1,1,0,1,0,0)
c2=c(1,0,1,1,1,0,0,1,0)
c3=c(0,1,1,0,0,1,1,1,0)
c4=c(1,0,1,0,0,1,0,1,1)
c5=c(1,0,0,1,1,1,0,0,0)
DF1=data.frame(ID=id1,c1=c1,c2=c2,c3=c3,c4=c4,c5=c5)
DF2=data.frame(ID=c(5,6,15,80,78,98,1,2,9),c1=c2,c2=c1,c3=c5,c4=c4,c5=c3)
Many thanks in advance.
Best Regards!
Here's an approach for you. The first hardcodes the columns to compare, while the other is more general and agnostic to how many columns DF1 and DF2 have:
#Merge together using ALL = TRUE for equivlent of outer join
DF3 <- merge(DF1, DF2, by = "ID", all = TRUE, suffixes= c(".1", ".2"))
#Calculate the rowSums where the same columns match
out1 <- data.frame(ID = DF3[, 1], count = rowSums(DF3[, 2:6] == DF3[, 7:ncol(DF3)]))
#Approach that is agnostic to the number of columns you have
library(reshape2)
library(plyr)
DF3.m <- melt(DF3, id.vars = 1)
DF3.m[, c("level", "DF")] <- with(DF3.m, colsplit(variable, "\\.", c("level", "DF")))
out2 <- dcast(data = DF3.m, ID + level ~ DF, value.var="value")
colnames(out)[3:4] <- c("DF1", "DF2")
out2 <- ddply(out, "ID", summarize, count = sum(DF1 == DF2))
#Are they the same?
all.equal(out1, out2)
#[1] TRUE
> head(out1)
ID count
1 1 4
2 2 NA
3 3 NA
4 5 3
5 6 2
6 9 NA
SELECT
COALESCE(DF1.ID, DF2.ID) AS ID,
CASE WHEN DF1.c1 = DF2.c1 THEN 1 ELSE 0 END +
CASE WHEN DF1.c2 = DF2.c2 THEN 1 ELSE 0 END +
CASE WHEN DF1.c3 = DF2.c3 THEN 1 ELSE 0 END +
CASE WHEN DF1.c4 = DF2.c4 THEN 1 ELSE 0 END +
CASE WHEN DF1.c5 = DF2.c5 THEN 1 ELSE 0 END AS count_of_matches
FROM
DF1
FULL OUTER JOIN
DF2
ON DF1.ID = DF2.ID
There's probably a more elegant way, but this works:
x <- merge(DF1,DF2,by="ID",all=TRUE)
pre <- paste("c",1:5,sep="")
x$Count <- rowSums(x[,paste(pre,"x",sep=".")]==x[,paste(pre,"y",sep=".")])
DF_Return <- x[,c("ID","Count")]
We could use safe_full_join from my package safejoin, and apply ==
between conflicting columns. This will yield a new data frame with logical
c* columns that we can use rowSums on.
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
safe_full_join(DF1, DF2, by = "ID", conflict = `==`) %>%
transmute(ID, count = rowSums(.[-1]))
# ID count
# 1 1 4
# 2 10 NA
# 3 5 3
# 4 20 NA
# 5 3 NA
# 6 6 2
# 7 71 NA
# 8 15 1
# 9 80 1
# 10 78 NA
# 11 98 NA
# 12 2 NA
# 13 9 NA
You can use the apply function to handle this. To get the sum of each row, you can use:
sums <- apply(df1[2:ncol(df1)], 1, sum)
cbind(df1[1], sums)
which will return the sum of all but the first column, then bind that to the first column to get the ID back.
You could do that on both data frames. I'm not really clear what the desired behavior is after that, but maybe look at the merge function.